diff --git a/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown b/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown index 6c6a439338..af93472092 100644 --- a/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown +++ b/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown @@ -39,3 +39,13 @@ A very interesting question asked by a massive number of guests is: does RocksDB When will be the next meetup? We haven't decided yet. We will see whether the community is interested in it and how it can help RocksDB grow. If you have any questions or feedback for the meetup or RocksDB, please let us know in [our Facebook group](https://www.facebook.com/groups/rocksdb.dev/). + +### Comments + +**[Rajiv](geetasen@gmail.com)** + +Have any of these talks been recorded and if so will they be published? + +**[Igor Canadi](icanadi@fb.com)** + +Yes, I think we plan to publish them soon. diff --git a/docs/_posts/2014-05-14-lock.markdown b/docs/_posts/2014-05-14-lock.markdown index d99e59f11d..a2720a87d5 100644 --- a/docs/_posts/2014-05-14-lock.markdown +++ b/docs/_posts/2014-05-14-lock.markdown @@ -52,3 +52,38 @@ When I said there was only one single major lock, I was lying. In RocksDB, all L (2) [New PlainTable format](//github.com/facebook/rocksdb/wiki/PlainTable-Format) (optimized for SST in ramfs/tmpfs) does not organize data by blocks. Data are located by memory addresses so no block cache is needed. With all of those improvements, lock contention is not a bottleneck anymore, which is shown in our [memory-only benchmark](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) . Furthermore, lock contentions are not causing some huge (50 milliseconds+) latency outliers they used to cause. + +### Comments + +**[Lee Hounshell](lee@apsalar.com)** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. + +**[Siying Dong](siying.d@fb.com)** + +Sorry for the delay. We don’t have feature support for this scenario yet. Here is an example you can work around this problem. You can build a snapshot of the DB by doing this: + +1. create a separate directory on the same host for a snapshot of the DB. +1. call `DB::DisableFileDeletions()` +1. call `DB::GetLiveFiles()` to get a full list of the files. +1. for all the files except manifest, add a hardlink file in your new directory pointing to the original file +1. copy the manifest file and truncate the size (you can read the comments of `DB::GetLiveFiles()` for more information) +1. call `DB::EnableFileDeletions()` +1. now you can open the snapshot directory in another process to access those files. Please remember to delete the directory after reading the data to allow those files to be recycled. + +By the way, the best way to ask those questions is in our [facebook group](https://www.facebook.com/groups/rocksdb.dev/). Let us know if you need any further help. + +**[Darshan](darshan.ghumare@gmail.com)** + +Will this consistency problem of RocksDB all occurs in case of single put/write? +What all ACID properties is supported by RocksDB, only durability irrespective of single or batch write? + +**[Siying Dong](siying.d@fb.com)** + +We recently [introduced optimistic transaction](https://reviews.facebook.net/D33435) which can help you ensure all of ACID. + +This blog post is mainly about optimizations in implementation. The RocksDB consistency semantic is not changed. diff --git a/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown b/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown index 305ab36241..210b4de983 100644 --- a/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown +++ b/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown @@ -37,3 +37,9 @@ To make sure the format works efficiently with empty queries, we added a bloom f These are the design goals and basic ideas of PlainTable file format. For detailed information, see [this wiki page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). [1] Bloom filter checks typically require multiple memory access. However, because they are independent, they usually do not make the CPU pipeline stale. In any case, we improved the bloom filter to improve data locality - we may cover this further in a future blog post. + +### Comments + +**[Siying Dong](siying.d@fb.com)** + +Does [http://rocksdb.org/feed/](http://rocksdb.org/feed/) work? diff --git a/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown b/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown index b8b531b74c..aca75bdefe 100644 --- a/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown +++ b/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown @@ -88,3 +88,13 @@ The new approach entails one issue: a thread can visit GetImpl() once but can ne (3) After any flush/compaction, the background thread performs a sweep (CAS) across all threads’ local storage and frees encountered SuperVersion. A reader thread must re-acquire a new SuperVersion reference on its next visit. + +### Comments + +**[David Barbour](dmbarbour@gmail.com)** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. diff --git a/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown b/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown index ee70e7b486..03787e9984 100644 --- a/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown +++ b/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown @@ -80,4 +80,11 @@ When talking to our customers, there are couple of issues that keep reoccurring. As we increase deployment of RocksDB, engineers are spending more time on debugging RocksDB issues. We plan to improve user experience when running RocksDB. The goal is to reduce TTD (time-to-debug). The work includes monitoring, visualizations and documentations. -[1][ http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) +[1]( http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + + +### Comments + +**[Mike](allspace2012@outlook.com)** + +What’s the status of this roadmap? “RocksDB on cheaper storage media”, has this been implemented? diff --git a/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown b/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown index b7266f5000..c248a86b59 100644 --- a/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown +++ b/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown @@ -12,3 +12,17 @@ Over the past 6 months we have seen a number of use cases where RocksDB is succe We at Microsoft Bing could not be left behind. As a result we are happy to [announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores. We are happy to make this available for the community. Keep tuned for more announcements to come. + +### Comments + +**[Siying Dong](siying.d@fb.com)** + +Appreciate your contributions to RocksDB project! I believe it will benefits many users! + +**[empresas sevilla](oxofkx@gmail.com)** + +Magnifico artículo|, un placer leer el blog + +**[jak usunac](tomogedac@o2.pl)** + +I believe it will benefits too diff --git a/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown b/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown index 45843f29e0..78173b5736 100644 --- a/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown +++ b/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown @@ -4,7 +4,7 @@ layout: post author: sdong category: blog redirect_from: - -/blog/2537/analysis-file-read-latency-by-level/ + - /blog/2537/analysis-file-read-latency-by-level/ --- In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users. @@ -16,7 +16,7 @@ In order to make tuning easier, we added new instrumentation to help users analy The output looks like this: -```bash +``` ** Level 0 read latency histogram (micros): Count: 696 Average: 489.8118 StdDev: 222.40 Min: 3.0000 Median: 452.3077 Max: 1896.0000 @@ -148,3 +148,95 @@ Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 In this example, you can see we only issued 696 reads from level 0 while issued 25 million reads from level 5. The latency distribution is also clearly shown among those reads. This will be helpful for users to analysis OS page cache efficiency. Currently the read latency per level includes reads from data blocks, index blocks, as well as bloom filter blocks. We are also working on a feature to break down those three type of blocks. + +### Comments + +**[Tao Feng](fengtao04@gmail.com)** + +Is this feature also included in RocksJava? + +**[Siying Dong](siying.d@fb.com)** + +Should be. As long as you enable statistics, you should be able to get the value from `RocksDB.getProperty()` with property `rocksdb.dbstats`. Let me know if you can’t find it. + +**[chiddu](cnbscience@gmail.com)** + +> In this example, you can see we only issued 696 reads from level 0 while issued 256K reads from level 5. + +Isn’t it 2.5 M of reads instead of 256K ? . + +Also could anyone please provide more description on the histogram ? especially + +> Count: 25583746 Average: 421.1326 StdDev: 385.11 +> Min: 1.0000 Median: 376.0011 Max: 202444.0000 +> Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 + +and + +> [ 0, 1 ) 2351 0.009% 0.009% +> [ 1, 2 ) 6077 0.024% 0.033% +> [ 2, 3 ) 8471 0.033% 0.066% +> [ 3, 4 ) 788 0.003% 0.069%” + +thanks in advance + +**[Siying Dong](siying.d@fb.com)** + +Thank you for pointing out the mistake. I fixed it now. + +In this output, there are 2.5 million samples, average latency is 421 micro seconds, with standard deviation 385. Median is 376, max value is 202 milliseconds. 0.009% has value of 1, 0.024% has value of 1, 0.033% has value of 2. Accumulated value from 0 to 2 is 0.066%. + +Hope it helps. + +**[chiddu](cnbscience@gmail.com)** + +Thank you Siying for the quick reply, I was running couple of benchmark testing to check the performance of rocksdb on SSD. One of the test is similar to what is mentioned in the wiki, TEST 4 : Random read , except the key_size is 10 and value_size is 20. I am inserting 1 billion hashes and reading 1 billion hashes with 32 threads. The histogram shows something like this + +``` +Level 5 read latency histogram (micros): +Count: 7133903059 Average: 480.4357 StdDev: 309.18 +Min: 0.0000 Median: 551.1491 Max: 224142.0000 +Percentiles: P50: 551.15 P75: 651.44 P99: 996.52 P99.9: 2073.07 P99.99: 3196.32 +—————————————————— +[ 0, 1 ) 28587385 0.401% 0.401% +[ 1, 2 ) 686572516 9.624% 10.025% ## +[ 2, 3 ) 567317522 7.952% 17.977% ## +[ 3, 4 ) 44979472 0.631% 18.608% +[ 4, 5 ) 50379685 0.706% 19.314% +[ 5, 6 ) 64930061 0.910% 20.224% +[ 6, 7 ) 22613561 0.317% 20.541% +…………more…………. +``` + +If I understand your previous comment correctly, + +1. How is it that the count is around 7 billion when I have only inserted 1 billion hashes ? is the stat broken ? +1. What does the percentiles and the numbers signify ? +1. 0, 1 ) 28587385 0.401% 0.401% what does this “28587385” stand for in the histogram row ? + +**[Siying Dong](siying.d@fb.com)** + +If I remember correctly, with db_bench, if you specify –num=1000000000 –threads=32, it is every thread reading one billion keys, total of 32 billions. Is it the case you ran into? + +28,587,385 means that number of data points take the value [0,1) +28,587,385 / 7,133,903,058 = 0.401% provides percentage. + +**[chiddu](cnbscience@gmail.com)** + +I do have `num=1000000000` and `t=32`. The script says reading 1 billion hashes and not 32 billion hashes. + +this is the script on which I have used + +``` +echo “Load 1B keys sequentially into database…..” +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=1; sync=0; r=1000000000; t=1; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench –benchmarks=fillseq –disable_seek_compaction=1 –mmap_read=0 –statistics=1 –histogram=1 –num=$r –threads=$t –value_size=$vs –block_size=$bs –cache_size=$cs –bloom_bits=10 –cache_numshardbits=6 –open_files=$of –verify_checksum=1 –db=/data/mysql/leveldb/test –sync=$sync –disable_wal=1 –compression_type=none –stats_interval=$si –compression_ratio=0.5 –disable_data_sync=$dds –write_buffer_size=$wbs –target_file_size_base=$mb –max_write_buffer_number=$wbn –max_background_compactions=$mbc –level0_file_num_compaction_trigger=$ctrig –level0_slowdown_writes_trigger=$delay –level0_stop_writes_trigger=$stop –num_levels=$levels –delete_obsolete_files_period_micros=$del –min_level_to_compress=$mcz –max_grandparent_overlap_factor=$overlap –stats_per_interval=1 –max_bytes_for_level_base=$bpl –use_existing_db=0 –key_size=10 + +echo “Reading 1B keys in database in random order….” +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000000; t=32; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench –benchmarks=readrandom –disable_seek_compaction=1 –mmap_read=0 –statistics=1 –histogram=1 –num=$r –threads=$t –value_size=$vs –block_size=$bs –cache_size=$cs –bloom_bits=10 –cache_numshardbits=6 –open_files=$of –verify_checksum=1 –db=/some_data_base –sync=$sync –disable_wal=1 –compression_type=none –stats_interval=$si –compression_ratio=0.5 –disable_data_sync=$dds –write_buffer_size=$wbs –target_file_size_base=$mb –max_write_buffer_number=$wbn –max_background_compactions=$mbc –level0_file_num_compaction_trigger=$ctrig –level0_slowdown_writes_trigger=$delay –level0_stop_writes_trigger=$stop –num_levels=$levels –delete_obsolete_files_period_micros=$del –min_level_to_compress=$mcz –max_grandparent_overlap_factor=$overlap –stats_per_interval=1 –max_bytes_for_level_base=$bpl –use_existing_db=1 –key_size=10 +``` + +After running this script, there were no issues wrt to loading billion hashes , but when it came to reading part, its been almost 4 days and still I have only read 7 billion hashes and have read 200 million hashes in 2 and half days. Is there something which is missing in db_bench or something which I am missing ? + +**[Siying Dong](siying.d@fb.com)** + +It’s a printing error then. If you have `num=1000000000` and `t=32`, it will be 32 threads, and each reads 1 billion keys. diff --git a/docs/_posts/2016-01-29-compaction_pri.markdown b/docs/_posts/2016-01-29-compaction_pri.markdown index 534ae4e19f..42fff622e4 100644 --- a/docs/_posts/2016-01-29-compaction_pri.markdown +++ b/docs/_posts/2016-01-29-compaction_pri.markdown @@ -41,3 +41,9 @@ Usually people use [compaction filters](https://github.com/facebook/rocksdb/blob In all, there three choices of compaction priority modes optimizing different scenarios. if you have a new use case, we suggest you start with options.compaction_pri=kOldestSmallestSeqFirst (note it is not the default one for backward compatible reason). If you want to further optimize your use case, you can try other two use cases if your use cases apply. If you have good ideas about better compaction picker approach, you are welcome to implement and benchmark it. We'll be glad to review and merge your a pull requests. + +### Comments + +**[Mark Callaghan](mdcallag@gmail.com)** + +Performance results for compaction_pri values and linkbench are explained at [http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html](http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html)