Merge branch 'main' into fix_typo

2023-10-09 15:29:20 +03:00 · 2023-10-09 15:29:20 +03:00 · 166cc6154b
parent 1412b69171 21a12363e1
commit 166cc6154b
186 changed files with 7005 additions and 1572 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -632,6 +632,7 @@ set(SOURCES
        cache/secondary_cache.cc
        cache/secondary_cache_adapter.cc
        cache/sharded_cache.cc
+        cache/tiered_secondary_cache.cc
        db/arena_wrapped_db_iter.cc
        db/blob/blob_contents.cc
        db/blob/blob_fetcher.cc
@ -1263,6 +1264,7 @@ if(WITH_TESTS)
        cache/cache_test.cc
        cache/compressed_secondary_cache_test.cc
        cache/lru_cache_test.cc
+        cache/tiered_secondary_cache_test.cc
        db/blob/blob_counting_iterator_test.cc
        db/blob/blob_file_addition_test.cc
        db/blob/blob_file_builder_test.cc
--- a/HISTORY.md
+++ b/HISTORY.md
@ -1,6 +1,51 @@
 # Rocksdb Change Log
 > NOTE: Entries for next release do not go here. Follow instructions in `unreleased_history/README.txt`

+## 8.7.0 (09/22/2023)
+### New Features
+* Added an experimental new "automatic" variant of HyperClockCache that does not require a prior estimate of the average size of cache entries. This variant is activated when HyperClockCacheOptions::estimated\_entry\_charge = 0 and has essentially the same concurrency benefits as the existing HyperClockCache.
+* Add a new statistic `COMPACTION_CPU_TOTAL_TIME` that records cumulative compaction cpu time. This ticker is updated regularly while a compaction is running.
+* Add `GetEntity()` API for ReadOnly DB and Secondary DB.
+* Add a new iterator API `Iterator::Refresh(const Snapshot *)` that allows iterator to be refreshed while using the input snapshot to read.
+* Added a new read option `merge_operand_count_threshold`. When the number of merge operands applied during a successful point lookup exceeds this threshold, the query will return a special OK status with a new subcode `kMergeOperandThresholdExceeded`. Applications might use this signal to take action to reduce the number of merge operands for the affected key(s), for example by running a compaction.
+* For `NewRibbonFilterPolicy()`, made the `bloom_before_level` option mutable through the Configurable interface and the SetOptions API, allowing dynamic switching between all-Bloom and all-Ribbon configurations, and configurations in between. See comments on `NewRibbonFilterPolicy()`
+* RocksDB now allows the block cache to be stacked on top of a compressed secondary cache and a non-volatile secondary cache, thus creating a three-tier cache. To set it up, use the `NewTieredCache()` API in rocksdb/cache.h..
+* Added a new wide-column aware full merge API called `FullMergeV3` to `MergeOperator`. `FullMergeV3` supports wide columns both as base value and merge result, which enables the application to perform more general transformations during merges. For backward compatibility, the default implementation implements the earlier logic of applying the merge operation to the default column of any wide-column entities. Specifically, if there is no base value or the base value is a plain key-value, the default implementation falls back to `FullMergeV2`. If the base value is a wide-column entity, the default implementation invokes `FullMergeV2` to perform the merge on the default column, and leaves any other columns unchanged.
+* Add wide column support to ldb commands (scan, dump, idump, dump_wal) and sst_dump tool's scan command
+
+### Public API Changes
+* Expose more information about input files used in table creation (if any) in `CompactionFilter::Context`. See `CompactionFilter::Context::input_start_level`,`CompactionFilter::Context::input_table_properties` for more.
+* `Options::compaction_readahead_size` 's default value is changed from 0 to 2MB.
+* When using LZ4 compression, the `acceleration` parameter is configurable by setting the negated value in `CompressionOptions::level`. For example, `CompressionOptions::level=-10` will set `acceleration=10`
+* The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.
+* The `NewTieredVolatileCache()` API in rocksdb/cache.h has been renamed to `NewTieredCache()`.
+
+### Behavior Changes
+* Compaction read performance will regress when `Options::compaction_readahead_size` is explicitly set to 0
+* Universal size amp compaction will conditionally exclude some of the newest L0 files when selecting input with a small negative impact to size amp. This is to prevent a large number of L0 files from being locked by a size amp compaction, potentially leading to write stop with a few more flushes.
+* Change ldb scan command delimiter from ':' to '==>'.
+
+### Bug Fixes
+* Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
+* Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
+* Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails (#11872).
+* Fix a bug where RocksDB (with atomic_flush=false) can delete output SST files of pending flushes when a previous concurrent flush fails (#11865). This can result in DB entering read-only state with error message like `IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst`.
+* Fix an assertion fault during seek with async_io when readahead trimming is enabled.
+* When the compressed secondary cache capacity is reduced to 0, it should be completely disabled. Before this fix, inserts and lookups would still go to the backing `LRUCache` before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead.
+* Updating the tiered cache (cache allocated using NewTieredCache()) by calling SetCapacity() on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But SetCapacity() would just set the primary cache capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, SetCapacity() will distribute the new capacity across the two caches by the same ratio.
+* Fixed a bug in `MultiGet` for cleaning up SuperVersion acquired with locking db mutex.
+* Fix a bug where row cache can falsely return kNotFound even though row cache entry is hit.
+* Fixed a race condition in `GenericRateLimiter` that could cause it to stop granting requests
+* Fix a bug (Issue #10257) where DB can hang after write stall since no compaction is scheduled (#11764).
+* Add a fix for async_io where during seek, when reading a block for seeking a target key in a file without any readahead, the iterator aligned the read on a page boundary and reading more than necessary. This increased the storage read bandwidth usage.
+* Fix an issue in sst dump tool to handle bounds specified for data with user-defined timestamps.
+* When auto_readahead_size is enabled, update readahead upper bound during readahead trimming when reseek changes iterate_upper_bound dynamically.
+* Fixed a bug where `rocksdb.file.read.verify.file.checksums.micros` is not populated
+
+### Performance Improvements
+* Added additional improvements in tuning readahead_size during Scans when auto_readahead_size is enabled. However it's not supported with Iterator::Prev operation and will return NotSupported error.
+* During async_io, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, it tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase that would save some CPU.
+
 ## 8.6.0 (08/18/2023)
 ### New Features
 * Added enhanced data integrity checking on SST files with new format_version=6. Performance impact is very small or negligible. Previously if SST data was misplaced or re-arranged by the storage layer, it could pass block checksum with higher than 1 in 4 billion probability. With format_version=6, block checksums depend on what file they are in and location within the file. This way, misplaced SST data is no more likely to pass checksum verification than randomly corrupted data. Also in format_version=6, SST footers are checksum-protected.
--- a/INSTALL.md
+++ b/INSTALL.md
@ -17,7 +17,7 @@ There are few options when compiling RocksDB:
 * `make check` will compile and run all the unit tests. `make check` will compile RocksDB in debug mode.

 * `make all` will compile our static library, and all our tools and unit tests. Our tools
-depend on gflags. You will need to have gflags installed to run `make all`. This will compile RocksDB in debug mode. Don't
+depend on gflags 2.2.0 or newer. You will need to have gflags installed to run `make all`. This will compile RocksDB in debug mode. Don't
 use binaries compiled by `make all` in production.

 * By default the binary we produce is optimized for the CPU you're compiling on
@ -77,7 +77,7 @@ most processors made since roughly 2013.

              git clone https://github.com/gflags/gflags.git
              cd gflags
-              git checkout v2.0
+              git checkout v2.2.0
              ./configure && make && sudo make install

      **Notice**: Once installed, please add the include path for gflags to your `CPATH` environment variable and the
--- a/3
+++ b/3
@ -1885,6 +1885,9 @@ compressed_secondary_cache_test: $(OBJ_DIR)/cache/compressed_secondary_cache_tes
 lru_cache_test: $(OBJ_DIR)/cache/lru_cache_test.o $(TEST_LIBRARY) $(LIBRARY)
 	$(AM_LINK)

+tiered_secondary_cache_test: $(OBJ_DIR)/cache/tiered_secondary_cache_test.o $(TEST_LIBRARY) $(LIBRARY)
+	$(AM_LINK)
+
 range_del_aggregator_test: $(OBJ_DIR)/db/range_del_aggregator_test.o $(TEST_LIBRARY) $(LIBRARY)
 	$(AM_LINK)

--- a/10
+++ b/10
@ -3,8 +3,6 @@
 # --> DO NOT EDIT MANUALLY <--
 # This file is a Facebook-specific integration for buck builds, so can
 # only be validated by Facebook employees.
-#
-# @noautodeps @nocodemods
 load("//rocks/buckifier:defs.bzl", "cpp_library_wrapper","rocks_cpp_library_wrapper","cpp_binary_wrapper","cpp_unittest_wrapper","fancy_bench_wrapper","add_c_test_wrapper")


@ -21,6 +19,7 @@ cpp_library_wrapper(name="rocksdb_lib", srcs=[
        "cache/secondary_cache.cc",
        "cache/secondary_cache_adapter.cc",
        "cache/sharded_cache.cc",
+        "cache/tiered_secondary_cache.cc",
        "db/arena_wrapped_db_iter.cc",
        "db/blob/blob_contents.cc",
        "db/blob/blob_fetcher.cc",
@ -394,6 +393,7 @@ rocks_cpp_library_wrapper(name="rocksdb_stress_lib", srcs=[
        "db_stress_tool/db_stress_stat.cc",
        "db_stress_tool/db_stress_test_base.cc",
        "db_stress_tool/db_stress_tool.cc",
+        "db_stress_tool/db_stress_wide_merge_operator.cc",
        "db_stress_tool/expected_state.cc",
        "db_stress_tool/expected_value.cc",
        "db_stress_tool/multi_ops_txns_stress.cc",
@ -5475,6 +5475,12 @@ cpp_unittest_wrapper(name="tiered_compaction_test",
            extra_compiler_flags=[])


+cpp_unittest_wrapper(name="tiered_secondary_cache_test",
+            srcs=["cache/tiered_secondary_cache_test.cc"],
+            deps=[":rocksdb_test_lib"],
+            extra_compiler_flags=[])
+
+
 cpp_unittest_wrapper(name="timer_queue_test",
            srcs=["util/timer_queue_test.cc"],
            deps=[":rocksdb_test_lib"],
--- a/buckifier/targets_cfg.py
+++ b/buckifier/targets_cfg.py
@ -6,8 +6,6 @@ rocksdb_target_header_template = """# This file \100generated by:
 # --> DO NOT EDIT MANUALLY <--
 # This file is a Facebook-specific integration for buck builds, so can
 # only be validated by Facebook employees.
-#
-# @noautodeps @nocodemods
 load("//rocks/buckifier:defs.bzl", "cpp_library_wrapper","rocks_cpp_library_wrapper","cpp_binary_wrapper","cpp_unittest_wrapper","fancy_bench_wrapper","add_c_test_wrapper")

 """
--- a/cache/cache.cc
+++ b/cache/cache.cc
@ -66,6 +66,41 @@ static std::unordered_map<std::string, OptionTypeInfo>
          OptionTypeFlags::kMutable}},
 };

+namespace {
+static void NoopDelete(Cache::ObjectPtr /*obj*/,
+                       MemoryAllocator* /*allocator*/) {
+  assert(false);
+}
+
+static size_t SliceSize(Cache::ObjectPtr obj) {
+  return static_cast<Slice*>(obj)->size();
+}
+
+static Status SliceSaveTo(Cache::ObjectPtr from_obj, size_t from_offset,
+                          size_t length, char* out) {
+  const Slice& slice = *static_cast<Slice*>(from_obj);
+  std::memcpy(out, slice.data() + from_offset, length);
+  return Status::OK();
+}
+
+static Status NoopCreate(const Slice& /*data*/, CompressionType /*type*/,
+                         CacheTier /*source*/, Cache::CreateContext* /*ctx*/,
+                         MemoryAllocator* /*allocator*/,
+                         Cache::ObjectPtr* /*out_obj*/,
+                         size_t* /*out_charge*/) {
+  assert(false);
+  return Status::NotSupported();
+}
+
+static Cache::CacheItemHelper kBasicCacheItemHelper(CacheEntryRole::kMisc,
+                                                    &NoopDelete);
+}  // namespace
+
+const Cache::CacheItemHelper kSliceCacheItemHelper{
+    CacheEntryRole::kMisc, &NoopDelete, &SliceSize,
+    &SliceSaveTo,          &NoopCreate, &kBasicCacheItemHelper,
+};
+
 Status SecondaryCache::CreateFromString(
    const ConfigOptions& config_options, const std::string& value,
    std::shared_ptr<SecondaryCache>* result) {
--- a/cache/cache_bench_tool.cc
+++ b/cache/cache_bench_tool.cc
@ -290,7 +290,8 @@ Status SaveToFn(Cache::ObjectPtr from_obj, size_t /*from_offset*/,
  return Status::OK();
 }

-Status CreateFn(const Slice& data, Cache::CreateContext* /*context*/,
+Status CreateFn(const Slice& data, CompressionType /*type*/,
+                CacheTier /*source*/, Cache::CreateContext* /*context*/,
                MemoryAllocator* /*allocator*/, Cache::ObjectPtr* out_obj,
                size_t* out_charge) {
  *out_obj = new char[data.size()];
--- a/cache/cache_helpers.cc
+++ b/cache/cache_helpers.cc
@ -25,7 +25,8 @@ Status WarmInCache(Cache* cache, const Slice& key, const Slice& saved,
  assert(helper->create_cb);
  Cache::ObjectPtr value;
  size_t charge;
-  Status st = helper->create_cb(saved, create_context,
+  Status st = helper->create_cb(saved, CompressionType::kNoCompression,
+                                CacheTier::kVolatileTier, create_context,
                                cache->memory_allocator(), &value, &charge);
  if (st.ok()) {
    st =
--- a/cache/cache_reservation_manager.h
+++ b/cache/cache_reservation_manager.h
@ -273,9 +273,10 @@ class ConcurrentCacheReservationManager
    std::size_t total_mem_used = cache_res_mgr_->GetTotalMemoryUsed();
    Status s;
    if (!increase) {
-      assert(total_mem_used >= memory_used_delta);
-      s = cache_res_mgr_->UpdateCacheReservation(total_mem_used -
-                                                 memory_used_delta);
+      s = cache_res_mgr_->UpdateCacheReservation(
+          (total_mem_used > memory_used_delta)
+              ? (total_mem_used - memory_used_delta)
+              : 0);
    } else {
      s = cache_res_mgr_->UpdateCacheReservation(total_mem_used +
                                                 memory_used_delta);
--- a/cache/charged_cache.cc
+++ b/cache/charged_cache.cc
@ -19,8 +19,10 @@ ChargedCache::ChargedCache(std::shared_ptr<Cache> cache,

 Status ChargedCache::Insert(const Slice& key, ObjectPtr obj,
                            const CacheItemHelper* helper, size_t charge,
-                            Handle** handle, Priority priority) {
-  Status s = target_->Insert(key, obj, helper, charge, handle, priority);
+                            Handle** handle, Priority priority,
+                            const Slice& compressed_val, CompressionType type) {
+  Status s = target_->Insert(key, obj, helper, charge, handle, priority,
+                             compressed_val, type);
  if (s.ok()) {
    // Insert may cause the cache entry eviction if the cache is full. So we
    // directly call the reservation manager to update the total memory used
--- a/cache/charged_cache.h
+++ b/cache/charged_cache.h
@ -22,9 +22,11 @@ class ChargedCache : public CacheWrapper {
  ChargedCache(std::shared_ptr<Cache> cache,
               std::shared_ptr<Cache> block_cache);

-  Status Insert(const Slice& key, ObjectPtr obj, const CacheItemHelper* helper,
-                size_t charge, Handle** handle = nullptr,
-                Priority priority = Priority::LOW) override;
+  Status Insert(
+      const Slice& key, ObjectPtr obj, const CacheItemHelper* helper,
+      size_t charge, Handle** handle = nullptr,
+      Priority priority = Priority::LOW, const Slice& compressed_val = Slice(),
+      CompressionType type = CompressionType::kNoCompression) override;

  Cache::Handle* Lookup(const Slice& key, const CacheItemHelper* helper,
                        CreateContext* create_context,
--- a/cache/clock_cache.cc
+++ b/cache/clock_cache.cc
@ -2218,6 +2218,9 @@ bool AutoHyperClockTable::Grow(InsertState& state) {
  // forward" due to length_info_ being out-of-date.
  CatchUpLengthInfoNoWait(grow_home);

+  // See usage in DoInsert()
+  state.likely_empty_slot = grow_home;
+
  // Success
  return true;
 }
@ -2847,14 +2850,15 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(
  // We could go searching through the chain for any duplicate, but that's
  // not typically helpful, except for the REDUNDANT block cache stats.
  // (Inferior duplicates will age out with eviction.) However, we do skip
-  // insertion if the home slot already has a match (already_matches below),
-  // so that we keep better CPU cache locality when we can.
+  // insertion if the home slot (or some other we happen to probe) already
+  // has a match (already_matches below). This helps to keep better locality
+  // when we can.
  //
  // And we can do that as part of searching for an available slot to
  // insert the new entry, because our preferred location and first slot
  // checked will be the home slot.
  //
-  // As the table initially grows to size few entries will be in the same
+  // As the table initially grows to size, few entries will be in the same
  // cache line as the chain head. However, churn in the cache relatively
  // quickly improves the proportion of entries sharing that cache line with
  // the chain head. Data:
@ -2877,12 +2881,22 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(

  size_t idx = home;
  bool already_matches = false;
-  if (!TryInsert(proto, arr[idx], initial_countdown, take_ref,
-                 &already_matches)) {
-    if (already_matches) {
-      return nullptr;
-    }
-
+  bool already_matches_ignore = false;
+  if (TryInsert(proto, arr[idx], initial_countdown, take_ref,
+                &already_matches)) {
+    assert(idx == home);
+  } else if (already_matches) {
+    return nullptr;
+    // Here we try to populate newly-opened slots in the table, but not
+    // when we can add something to its home slot. This makes the structure
+    // more performant more quickly on (initial) growth. We ignore "already
+    // matches" in this case because it is unlikely and difficult to
+    // incorporate logic for here cleanly and efficiently.
+  } else if (UNLIKELY(state.likely_empty_slot > 0) &&
+             TryInsert(proto, arr[state.likely_empty_slot], initial_countdown,
+                       take_ref, &already_matches_ignore)) {
+    idx = state.likely_empty_slot;
+  } else {
    // We need to search for an available slot outside of the home.
    // Linear hashing provides nice resizing but does typically mean
    // that some heads (home locations) have (in expectation) twice as
@ -2892,54 +2906,28 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(
    //
    // This means that if we just use linear probing (by a small constant)
    // to find an available slot, part of the structure could easily fill up
-    // and resot to linear time operations even when the overall load factor
+    // and resort to linear time operations even when the overall load factor
    // is only modestly high, like 70%. Even though each slot has its own CPU
-    // cache line, there is likely a small locality benefit (e.g. TLB and
-    // paging) to iterating one by one, but obviously not with the linear
-    // hashing imbalance.
+    // cache line, there appears to be a small locality benefit (e.g. TLB and
+    // paging) to iterating one by one, as long as we don't afoul of the
+    // linear hashing imbalance.
    //
    // In a traditional non-concurrent structure, we could keep a "free list"
    // to ensure immediate access to an available slot, but maintaining such
    // a structure could require more cross-thread coordination to ensure
    // all entries are eventually available to all threads.
    //
-    // The way we solve this problem is to use linear probing but try to
-    // correct for the linear hashing imbalance (when probing beyond the
-    // home slot). If the home is high load (minimum shift) we choose an
-    // alternate location, uniformly among all slots, to linear probe from.
-    //
-    // Supporting data: we can use FixedHyperClockCache to get a baseline
-    // of near-ideal distribution of occupied slots, with its uniform
-    // distribution and double hashing.
-    // $ ./cache_bench -cache_type=fixed_hyper_clock_cache -histograms=0
-    //     -cache_size=1300000000
-    // ...
-    // Slot occupancy stats: Overall 59% (156629/262144),
-    //   Min/Max/Window = 47%/70%/500, MaxRun{Pos/Neg} = 22/15
-    //
-    // Now we can try various sizes between powers of two with AutoHCC to see
-    // how bad the MaxRun can be.
-    // $ for I in `seq 8 15`; do
-    //     ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0
-    //       -cache_size=${I}00000000 2>&1 | grep clock_cache.cc; done
-    // where the worst case MaxRun was with I=11:
-    // Slot occupancy stats: Overall 59% (132528/221094),
-    //   Min/Max/Window = 44%/73%/500, MaxRun{Pos/Neg} = 64/19
-    //
-    // The large table size offers a large sample size to be confident that
-    // this is an acceptable level of clustering (max ~3x probe length)
-    // compared to no clustering. Increasing the max load factor to 0.7
-    // increases the MaxRun above 100, potentially much closer to a tipping
-    // point.
-
-    // TODO? remember a freed entry from eviction, possibly in thread local
-
-    size_t start = home;
-    if (orig_home_shift == LengthInfoToMinShift(state.saved_length_info)) {
-      start = FastRange64(proto.hashed_key[0], used_length);
-    }
-    idx = start;
-    for (int cycles = 0;;) {
+    // The way we solve this problem is to use unit-increment linear probing
+    // with a small bound, and then fall back on big jumps to have a good
+    // chance of finding a slot in an under-populated region quickly if that
+    // doesn't work.
+    size_t i = 0;
+    constexpr size_t kMaxLinearProbe = 4;
+    for (; i < kMaxLinearProbe; i++) {
+      idx++;
+      if (idx >= used_length) {
+        idx -= used_length;
+      }
      if (TryInsert(proto, arr[idx], initial_countdown, take_ref,
                    &already_matches)) {
        break;
@ -2947,26 +2935,59 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(
      if (already_matches) {
        return nullptr;
      }
-      ++idx;
-      if (idx >= used_length) {
-        // In case the structure has grown, double-check
-        StartInsert(state);
-        used_length = LengthInfoToUsedLength(state.saved_length_info);
+    }
+    if (i == kMaxLinearProbe) {
+      // Keep searching, but change to a search method that should quickly
+      // find any under-populated region. Switching to an increment based
+      // on the golden ratio helps with that, but we also inject some minor
+      // variation (less than 2%, 1 in 2^6) to avoid clustering effects on
+      // this larger increment (if it were a fixed value in steady state
+      // operation). Here we are primarily using upper bits of hashed_key[1]
+      // while home is based on lowest bits.
+      uint64_t incr_ratio = 0x9E3779B185EBCA87U + (proto.hashed_key[1] >> 6);
+      size_t incr = FastRange64(incr_ratio, used_length);
+      assert(incr > 0);
+      size_t start = idx;
+      for (;; i++) {
+        idx += incr;
        if (idx >= used_length) {
-          idx = 0;
+          // Wrap around (faster than %)
+          idx -= used_length;
        }
-      }
-      if (idx == start) {
-        // Cycling back should not happen unless there is enough random churn
-        // in parallel that we happen to hit each slot at a time that it's
-        // occupied, which is really only feasible for small structures, though
-        // with linear probing to find empty slots, "small" here might be
-        // larger than for double hashing.
-        assert(used_length <= 256);
-        ++cycles;
-        if (cycles > 2) {
-          // Fall back on standalone insert in case something goes awry to
-          // cause this
+        if (idx == start) {
+          // We have just completed a cycle that might not have covered all
+          // slots. (incr and used_length could have common factors.)
+          // Increment for the next cycle, which eventually ensures complete
+          // iteration over the set of slots before repeating.
+          idx++;
+          if (idx >= used_length) {
+            idx -= used_length;
+          }
+          start++;
+          if (start >= used_length) {
+            start -= used_length;
+          }
+          if (i >= used_length) {
+            used_length = LengthInfoToUsedLength(
+                length_info_.load(std::memory_order_acquire));
+            if (i >= used_length * 2) {
+              // Cycling back should not happen unless there is enough random
+              // churn in parallel that we happen to hit each slot at a time
+              // that it's occupied, which is really only feasible for small
+              // structures, though with linear probing to find empty slots,
+              // "small" here might be larger than for double hashing.
+              assert(used_length <= 256);
+              // Fall back on standalone insert in case something goes awry to
+              // cause this
+              return nullptr;
+            }
+          }
+        }
+        if (TryInsert(proto, arr[idx], initial_countdown, take_ref,
+                      &already_matches)) {
+          break;
+        }
+        if (already_matches) {
          return nullptr;
        }
      }
@ -3481,6 +3502,10 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,

    for (HandleImpl* h : to_finish_eviction) {
      TrackAndReleaseEvictedEntry(h, data);
+      // NOTE: setting likely_empty_slot here can cause us to reduce the
+      // portion of "at home" entries, probably because an evicted entry
+      // is more likely to come back than a random new entry and would be
+      // unable to go into its home slot.
    }
    to_finish_eviction.clear();

--- a/cache/clock_cache.h
+++ b/cache/clock_cache.h
@ -822,6 +822,7 @@ class AutoHyperClockTable : public BaseClockTable {
  // For BaseClockTable::Insert
  struct InsertState {
    uint64_t saved_length_info = 0;
+    size_t likely_empty_slot = 0;
  };

  void StartInsert(InsertState& state);
--- a/cache/compressed_secondary_cache.cc
+++ b/cache/compressed_secondary_cache.cc
@ -11,6 +11,7 @@

 #include "memory/memory_allocator_impl.h"
 #include "monitoring/perf_context_imp.h"
+#include "util/coding.h"
 #include "util/compression.h"
 #include "util/string_util.h"

@ -22,17 +23,20 @@ CompressedSecondaryCache::CompressedSecondaryCache(
      cache_options_(opts),
      cache_res_mgr_(std::make_shared<ConcurrentCacheReservationManager>(
          std::make_shared<CacheReservationManagerImpl<CacheEntryRole::kMisc>>(
-              cache_))) {}
+              cache_))),
+      disable_cache_(opts.capacity == 0) {}

-CompressedSecondaryCache::~CompressedSecondaryCache() {
-  assert(cache_res_mgr_->GetTotalReservedCacheSize() == 0);
-}
+CompressedSecondaryCache::~CompressedSecondaryCache() {}

 std::unique_ptr<SecondaryCacheResultHandle> CompressedSecondaryCache::Lookup(
    const Slice& key, const Cache::CacheItemHelper* helper,
    Cache::CreateContext* create_context, bool /*wait*/, bool advise_erase,
    bool& kept_in_sec_cache) {
  assert(helper);
+  if (disable_cache_) {
+    return nullptr;
+  }
+
  std::unique_ptr<SecondaryCacheResultHandle> handle;
  kept_in_sec_cache = false;
  Cache::Handle* lru_handle = cache_->Lookup(key);
@ -49,40 +53,65 @@ std::unique_ptr<SecondaryCacheResultHandle> CompressedSecondaryCache::Lookup(
  CacheAllocationPtr* ptr{nullptr};
  CacheAllocationPtr merged_value;
  size_t handle_value_charge{0};
+  const char* data_ptr = nullptr;
+  CacheTier source = CacheTier::kVolatileCompressedTier;
+  CompressionType type = cache_options_.compression_type;
  if (cache_options_.enable_custom_split_merge) {
    CacheValueChunk* value_chunk_ptr =
        reinterpret_cast<CacheValueChunk*>(handle_value);
    merged_value = MergeChunksIntoValue(value_chunk_ptr, handle_value_charge);
    ptr = &merged_value;
+    data_ptr = ptr->get();
  } else {
+    uint32_t type_32 = static_cast<uint32_t>(type);
+    uint32_t source_32 = static_cast<uint32_t>(source);
    ptr = reinterpret_cast<CacheAllocationPtr*>(handle_value);
    handle_value_charge = cache_->GetCharge(lru_handle);
+    data_ptr = ptr->get();
+    data_ptr = GetVarint32Ptr(data_ptr, data_ptr + 1,
+                              static_cast<uint32_t*>(&type_32));
+    type = static_cast<CompressionType>(type_32);
+    data_ptr = GetVarint32Ptr(data_ptr, data_ptr + 1,
+                              static_cast<uint32_t*>(&source_32));
+    source = static_cast<CacheTier>(source_32);
+    handle_value_charge -= (data_ptr - ptr->get());
  }
  MemoryAllocator* allocator = cache_options_.memory_allocator.get();

  Status s;
  Cache::ObjectPtr value{nullptr};
  size_t charge{0};
-  if (cache_options_.compression_type == kNoCompression ||
-      cache_options_.do_not_compress_roles.Contains(helper->role)) {
-    s = helper->create_cb(Slice(ptr->get(), handle_value_charge),
-                          create_context, allocator, &value, &charge);
-  } else {
-    UncompressionContext uncompression_context(cache_options_.compression_type);
-    UncompressionInfo uncompression_info(uncompression_context,
-                                         UncompressionDict::GetEmptyDict(),
-                                         cache_options_.compression_type);
+  if (source == CacheTier::kVolatileCompressedTier) {
+    if (cache_options_.compression_type == kNoCompression ||
+        cache_options_.do_not_compress_roles.Contains(helper->role)) {
+      s = helper->create_cb(Slice(data_ptr, handle_value_charge),
+                            kNoCompression, CacheTier::kVolatileTier,
+                            create_context, allocator, &value, &charge);
+    } else {
+      UncompressionContext uncompression_context(
+          cache_options_.compression_type);
+      UncompressionInfo uncompression_info(uncompression_context,
+                                           UncompressionDict::GetEmptyDict(),
+                                           cache_options_.compression_type);

-    size_t uncompressed_size{0};
-    CacheAllocationPtr uncompressed = UncompressData(
-        uncompression_info, (char*)ptr->get(), handle_value_charge,
-        &uncompressed_size, cache_options_.compress_format_version, allocator);
+      size_t uncompressed_size{0};
+      CacheAllocationPtr uncompressed =
+          UncompressData(uncompression_info, (char*)data_ptr,
+                         handle_value_charge, &uncompressed_size,
+                         cache_options_.compress_format_version, allocator);

-    if (!uncompressed) {
-      cache_->Release(lru_handle, /*erase_if_last_ref=*/true);
-      return nullptr;
+      if (!uncompressed) {
+        cache_->Release(lru_handle, /*erase_if_last_ref=*/true);
+        return nullptr;
+      }
+      s = helper->create_cb(Slice(uncompressed.get(), uncompressed_size),
+                            kNoCompression, CacheTier::kVolatileTier,
+                            create_context, allocator, &value, &charge);
    }
-    s = helper->create_cb(Slice(uncompressed.get(), uncompressed_size),
+  } else {
+    // The item was not compressed by us. Let the helper create_cb
+    // uncompress it
+    s = helper->create_cb(Slice(data_ptr, handle_value_charge), type, source,
                          create_context, allocator, &value, &charge);
  }

@ -107,41 +136,56 @@ std::unique_ptr<SecondaryCacheResultHandle> CompressedSecondaryCache::Lookup(
  return handle;
 }

-Status CompressedSecondaryCache::Insert(const Slice& key,
-                                        Cache::ObjectPtr value,
-                                        const Cache::CacheItemHelper* helper,
-                                        bool force_insert) {
-  if (value == nullptr) {
-    return Status::InvalidArgument();
+bool CompressedSecondaryCache::MaybeInsertDummy(const Slice& key) {
+  auto internal_helper = GetHelper(cache_options_.enable_custom_split_merge);
+  Cache::Handle* lru_handle = cache_->Lookup(key);
+  if (lru_handle == nullptr) {
+    PERF_COUNTER_ADD(compressed_sec_cache_insert_dummy_count, 1);
+    // Insert a dummy handle if the handle is evicted for the first time.
+    cache_->Insert(key, /*obj=*/nullptr, internal_helper, /*charge=*/0)
+        .PermitUncheckedError();
+    return true;
+  } else {
+    cache_->Release(lru_handle, /*erase_if_last_ref=*/false);
+  }
+
+  return false;
+}
+
+Status CompressedSecondaryCache::InsertInternal(
+    const Slice& key, Cache::ObjectPtr value,
+    const Cache::CacheItemHelper* helper, CompressionType type,
+    CacheTier source) {
+  if (source != CacheTier::kVolatileCompressedTier &&
+      cache_options_.enable_custom_split_merge) {
+    // We don't support custom split/merge for the tiered case
+    return Status::OK();
  }

  auto internal_helper = GetHelper(cache_options_.enable_custom_split_merge);
-  if (!force_insert) {
-    Cache::Handle* lru_handle = cache_->Lookup(key);
-    if (lru_handle == nullptr) {
-      PERF_COUNTER_ADD(compressed_sec_cache_insert_dummy_count, 1);
-      // Insert a dummy handle if the handle is evicted for the first time.
-      return cache_->Insert(key, /*obj=*/nullptr, internal_helper,
-                            /*charge=*/0);
-    } else {
-      cache_->Release(lru_handle, /*erase_if_last_ref=*/false);
-    }
-  }
+  char header[10];
+  char* payload = header;
+  payload = EncodeVarint32(payload, static_cast<uint32_t>(type));
+  payload = EncodeVarint32(payload, static_cast<uint32_t>(source));

-  size_t size = (*helper->size_cb)(value);
+  size_t header_size = payload - header;
+  size_t data_size = (*helper->size_cb)(value);
+  size_t total_size = data_size + header_size;
  CacheAllocationPtr ptr =
-      AllocateBlock(size, cache_options_.memory_allocator.get());
+      AllocateBlock(total_size, cache_options_.memory_allocator.get());
+  char* data_ptr = ptr.get() + header_size;

-  Status s = (*helper->saveto_cb)(value, 0, size, ptr.get());
+  Status s = (*helper->saveto_cb)(value, 0, data_size, data_ptr);
  if (!s.ok()) {
    return s;
  }
-  Slice val(ptr.get(), size);
+  Slice val(data_ptr, data_size);

  std::string compressed_val;
  if (cache_options_.compression_type != kNoCompression &&
+      type == kNoCompression &&
      !cache_options_.do_not_compress_roles.Contains(helper->role)) {
-    PERF_COUNTER_ADD(compressed_sec_cache_uncompressed_bytes, size);
+    PERF_COUNTER_ADD(compressed_sec_cache_uncompressed_bytes, data_size);
    CompressionOptions compression_opts;
    CompressionContext compression_context(cache_options_.compression_type,
                                           compression_opts);
@ -159,12 +203,14 @@ Status CompressedSecondaryCache::Insert(const Slice& key,
    }

    val = Slice(compressed_val);
-    size = compressed_val.size();
-    PERF_COUNTER_ADD(compressed_sec_cache_compressed_bytes, size);
+    data_size = compressed_val.size();
+    total_size = header_size + data_size;
+    PERF_COUNTER_ADD(compressed_sec_cache_compressed_bytes, data_size);

    if (!cache_options_.enable_custom_split_merge) {
-      ptr = AllocateBlock(size, cache_options_.memory_allocator.get());
-      memcpy(ptr.get(), compressed_val.data(), size);
+      ptr = AllocateBlock(total_size, cache_options_.memory_allocator.get());
+      data_ptr = ptr.get() + header_size;
+      memcpy(data_ptr, compressed_val.data(), data_size);
    }
  }

@ -175,17 +221,52 @@ Status CompressedSecondaryCache::Insert(const Slice& key,
        SplitValueIntoChunks(val, cache_options_.compression_type, charge);
    return cache_->Insert(key, value_chunks_head, internal_helper, charge);
  } else {
+    std::memcpy(ptr.get(), header, header_size);
    CacheAllocationPtr* buf = new CacheAllocationPtr(std::move(ptr));
-    return cache_->Insert(key, buf, internal_helper, size);
+    return cache_->Insert(key, buf, internal_helper, total_size);
  }
 }

+Status CompressedSecondaryCache::Insert(const Slice& key,
+                                        Cache::ObjectPtr value,
+                                        const Cache::CacheItemHelper* helper,
+                                        bool force_insert) {
+  if (value == nullptr) {
+    return Status::InvalidArgument();
+  }
+
+  if (!force_insert && MaybeInsertDummy(key)) {
+    return Status::OK();
+  }
+
+  return InsertInternal(key, value, helper, kNoCompression,
+                        CacheTier::kVolatileCompressedTier);
+}
+
+Status CompressedSecondaryCache::InsertSaved(
+    const Slice& key, const Slice& saved, CompressionType type = kNoCompression,
+    CacheTier source = CacheTier::kVolatileTier) {
+  if (type == kNoCompression) {
+    return Status::OK();
+  }
+
+  auto slice_helper = &kSliceCacheItemHelper;
+  if (MaybeInsertDummy(key)) {
+    return Status::OK();
+  }
+
+  return InsertInternal(
+      key, static_cast<Cache::ObjectPtr>(const_cast<Slice*>(&saved)),
+      slice_helper, type, source);
+}
+
 void CompressedSecondaryCache::Erase(const Slice& key) { cache_->Erase(key); }

 Status CompressedSecondaryCache::SetCapacity(size_t capacity) {
  MutexLock l(&capacity_mutex_);
  cache_options_.capacity = capacity;
  cache_->SetCapacity(capacity);
+  disable_cache_ = capacity == 0;
  return Status::OK();
 }

--- a/cache/compressed_secondary_cache.h
+++ b/cache/compressed_secondary_cache.h
@ -80,6 +80,9 @@ class CompressedSecondaryCache : public SecondaryCache {
                const Cache::CacheItemHelper* helper,
                bool force_insert) override;

+  Status InsertSaved(const Slice& key, const Slice& saved, CompressionType type,
+                     CacheTier source) override;
+
  std::unique_ptr<SecondaryCacheResultHandle> Lookup(
      const Slice& key, const Cache::CacheItemHelper* helper,
      Cache::CreateContext* create_context, bool /*wait*/, bool advise_erase,
@ -130,12 +133,19 @@ class CompressedSecondaryCache : public SecondaryCache {
  CacheAllocationPtr MergeChunksIntoValue(const void* chunks_head,
                                          size_t& charge);

+  bool MaybeInsertDummy(const Slice& key);
+
+  Status InsertInternal(const Slice& key, Cache::ObjectPtr value,
+                        const Cache::CacheItemHelper* helper,
+                        CompressionType type, CacheTier source);
+
  // TODO: clean up to use cleaner interfaces in typed_cache.h
  const Cache::CacheItemHelper* GetHelper(bool enable_custom_split_merge) const;
  std::shared_ptr<Cache> cache_;
  CompressedSecondaryCacheOptions cache_options_;
  mutable port::Mutex capacity_mutex_;
  std::shared_ptr<ConcurrentCacheReservationManager> cache_res_mgr_;
+  bool disable_cache_;
 };

 }  // namespace ROCKSDB_NAMESPACE
--- a/cache/compressed_secondary_cache_test.cc
+++ b/cache/compressed_secondary_cache_test.cc
@ -989,11 +989,11 @@ class CompressedSecCacheTestWithTiered
  CompressedSecCacheTestWithTiered() {
    LRUCacheOptions lru_opts;
    HyperClockCacheOptions hcc_opts(
-        /*_capacity=*/70 << 20,
+        /*_capacity=*/0,
        /*_estimated_entry_charge=*/256 << 10,
        /*_num_shard_bits=*/0);
-    TieredVolatileCacheOptions opts;
-    lru_opts.capacity = 70 << 20;
+    TieredCacheOptions opts;
+    lru_opts.capacity = 0;
    lru_opts.num_shard_bits = 0;
    lru_opts.high_pri_pool_ratio = 0;
    opts.cache_type = std::get<0>(GetParam());
@ -1004,9 +1004,11 @@ class CompressedSecCacheTestWithTiered
    }
    opts.adm_policy = std::get<1>(GetParam());
    ;
-    opts.comp_cache_opts.capacity = 30 << 20;
+    opts.comp_cache_opts.capacity = 0;
    opts.comp_cache_opts.num_shard_bits = 0;
-    cache_ = NewTieredVolatileCache(opts);
+    opts.total_capacity = 100 << 20;
+    opts.compressed_secondary_ratio = 0.3;
+    cache_ = NewTieredCache(opts);
    cache_res_mgr_ =
        std::make_shared<CacheReservationManagerImpl<CacheEntryRole::kMisc>>(
            cache_);
@ -1023,7 +1025,7 @@ class CompressedSecCacheTestWithTiered
 protected:
  CacheReservationManager* cache_res_mgr() { return cache_res_mgr_.get(); }

-  Cache* GetTieredCache() { return cache_.get(); }
+  std::shared_ptr<Cache> GetTieredCache() { return cache_; }

  Cache* GetCache() {
    return static_cast_with_check<CacheWithSecondaryAdapter, Cache>(
@ -1110,7 +1112,7 @@ TEST_P(CompressedSecCacheTestWithTiered, AdmissionPolicy) {
    return;
  }

-  Cache* tiered_cache = GetTieredCache();
+  Cache* tiered_cache = GetTieredCache().get();
  Cache* cache = GetCache();
  std::vector<CacheKey> keys;
  std::vector<std::string> vals;
@ -1165,6 +1167,151 @@ TEST_P(CompressedSecCacheTestWithTiered, AdmissionPolicy) {
  ASSERT_EQ(handle1, nullptr);
 }

+TEST_P(CompressedSecCacheTestWithTiered, DynamicUpdate) {
+  CompressedSecondaryCache* sec_cache =
+      reinterpret_cast<CompressedSecondaryCache*>(GetSecondaryCache());
+  std::shared_ptr<Cache> tiered_cache = GetTieredCache();
+
+  // Use EXPECT_PRED3 instead of EXPECT_NEAR to void too many size_t to
+  // double explicit casts
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (30 << 20),
+               GetPercent(30 << 20, 1));
+  size_t sec_capacity;
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (30 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, 130 << 20));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (39 << 20),
+               GetPercent(39 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (39 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, 70 << 20));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (21 << 20),
+               GetPercent(21 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (21 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, 100 << 20));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (30 << 20),
+               GetPercent(30 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (30 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 0.4));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (40 << 20),
+               GetPercent(40 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (40 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 0.2));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (20 << 20),
+               GetPercent(20 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (20 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 1.0));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (100 << 20),
+               GetPercent(100 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, 100 << 20);
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 0.0));
+  // Only check usage for LRU cache. HCC shows a 64KB usage for some reason
+  if (std::get<0>(GetParam()) == PrimaryCacheType::kCacheTypeLRU) {
+    ASSERT_EQ(GetCache()->GetUsage(), 0);
+  }
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, 0);
+
+  ASSERT_NOK(UpdateTieredCache(tiered_cache, -1, 0.3));
+  // Only check usage for LRU cache. HCC shows a 64KB usage for some reason
+  if (std::get<0>(GetParam()) == PrimaryCacheType::kCacheTypeLRU) {
+    ASSERT_EQ(GetCache()->GetUsage(), 0);
+  }
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, 0);
+}
+
+TEST_P(CompressedSecCacheTestWithTiered, DynamicUpdateWithReservation) {
+  CompressedSecondaryCache* sec_cache =
+      reinterpret_cast<CompressedSecondaryCache*>(GetSecondaryCache());
+  std::shared_ptr<Cache> tiered_cache = GetTieredCache();
+
+  ASSERT_OK(cache_res_mgr()->UpdateCacheReservation(10 << 20));
+  // Use EXPECT_PRED3 instead of EXPECT_NEAR to void too many size_t to
+  // double explicit casts
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (37 << 20),
+               GetPercent(37 << 20, 1));
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (3 << 20),
+               GetPercent(3 << 20, 1));
+  size_t sec_capacity;
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (30 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, 70 << 20));
+  // Only check usage for LRU cache. HCC is slightly off for some reason
+  if (std::get<0>(GetParam()) == PrimaryCacheType::kCacheTypeLRU) {
+    EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (28 << 20),
+                 GetPercent(28 << 20, 1));
+  }
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (3 << 20),
+               GetPercent(3 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (21 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, 130 << 20));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (46 << 20),
+               GetPercent(46 << 20, 1));
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (3 << 20),
+               GetPercent(3 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (39 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, 100 << 20));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (37 << 20),
+               GetPercent(37 << 20, 1));
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (3 << 20),
+               GetPercent(3 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (30 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 0.39));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (45 << 20),
+               GetPercent(45 << 20, 1));
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (4 << 20),
+               GetPercent(4 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (39 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 0.2));
+  // Only check usage for LRU cache. HCC is slightly off for some reason
+  if (std::get<0>(GetParam()) == PrimaryCacheType::kCacheTypeLRU) {
+    EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (28 << 20),
+                 GetPercent(28 << 20, 1));
+  }
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (2 << 20),
+               GetPercent(2 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, (20 << 20));
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 1.0));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (100 << 20),
+               GetPercent(100 << 20, 1));
+  EXPECT_PRED3(CacheUsageWithinBounds, sec_cache->TEST_GetUsage(), (10 << 20),
+               GetPercent(10 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, 100 << 20);
+
+  ASSERT_OK(UpdateTieredCache(tiered_cache, -1, 0.0));
+  EXPECT_PRED3(CacheUsageWithinBounds, GetCache()->GetUsage(), (10 << 20),
+               GetPercent(10 << 20, 1));
+  ASSERT_OK(sec_cache->GetCapacity(sec_capacity));
+  ASSERT_EQ(sec_capacity, 0);
+
+  ASSERT_OK(cache_res_mgr()->UpdateCacheReservation(0));
+}
+
 INSTANTIATE_TEST_CASE_P(
    CompressedSecCacheTests, CompressedSecCacheTestWithTiered,
    ::testing::Values(
--- a/cache/lru_cache_test.cc
+++ b/cache/lru_cache_test.cc
@ -983,13 +983,14 @@ class TestSecondaryCache : public SecondaryCache {

  using ResultMap = std::unordered_map<std::string, ResultType>;

-  explicit TestSecondaryCache(size_t capacity)
+  explicit TestSecondaryCache(size_t capacity, bool insert_saved = false)
      : cache_(NewLRUCache(capacity, 0, false, 0.5 /* high_pri_pool_ratio */,
                           nullptr, kDefaultToAdaptiveMutex,
                           kDontChargeCacheMetadata)),
        num_inserts_(0),
        num_lookups_(0),
-        inject_failure_(false) {}
+        inject_failure_(false),
+        insert_saved_(insert_saved) {}

  const char* Name() const override { return "TestSecondaryCache"; }

@ -1020,6 +1021,17 @@ class TestSecondaryCache : public SecondaryCache {
    return cache_.Insert(key, buf, size);
  }

+  Status InsertSaved(const Slice& key, const Slice& saved,
+                     CompressionType /*type*/ = kNoCompression,
+                     CacheTier /*source*/ = CacheTier::kVolatileTier) override {
+    if (insert_saved_) {
+      return Insert(key, const_cast<Slice*>(&saved), &kSliceCacheItemHelper,
+                    /*force_insert=*/true);
+    } else {
+      return Status::OK();
+    }
+  }
+
  std::unique_ptr<SecondaryCacheResultHandle> Lookup(
      const Slice& key, const Cache::CacheItemHelper* helper,
      Cache::CreateContext* create_context, bool /*wait*/,
@ -1048,7 +1060,8 @@ class TestSecondaryCache : public SecondaryCache {
        char* ptr = cache_.Value(handle);
        size_t size = DecodeFixed64(ptr);
        ptr += sizeof(uint64_t);
-        s = helper->create_cb(Slice(ptr, size), create_context,
+        s = helper->create_cb(Slice(ptr, size), kNoCompression,
+                              CacheTier::kVolatileTier, create_context,
                              /*alloc*/ nullptr, &value, &charge);
      }
      if (s.ok()) {
@ -1137,6 +1150,7 @@ class TestSecondaryCache : public SecondaryCache {
  uint32_t num_inserts_;
  uint32_t num_lookups_;
  bool inject_failure_;
+  bool insert_saved_;
  std::string ckey_prefix_;
  ResultMap result_map_;
 };
@ -1167,7 +1181,7 @@ INSTANTIATE_TEST_CASE_P(DBSecondaryCacheTest, DBSecondaryCacheTest,

 TEST_P(BasicSecondaryCacheTest, BasicTest) {
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(4096);
+      std::make_shared<TestSecondaryCache>(4096, true);
  std::shared_ptr<Cache> cache =
      NewCache(1024 /* capacity */, 0 /* num_shard_bits */,
               false /* strict_capacity_limit */, secondary_cache);
@ -1224,7 +1238,7 @@ TEST_P(BasicSecondaryCacheTest, BasicTest) {

 TEST_P(BasicSecondaryCacheTest, StatsTest) {
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(4096);
+      std::make_shared<TestSecondaryCache>(4096, true);
  std::shared_ptr<Cache> cache =
      NewCache(1024 /* capacity */, 0 /* num_shard_bits */,
               false /* strict_capacity_limit */, secondary_cache);
@ -1278,7 +1292,7 @@ TEST_P(BasicSecondaryCacheTest, StatsTest) {

 TEST_P(BasicSecondaryCacheTest, BasicFailTest) {
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(2048);
+      std::make_shared<TestSecondaryCache>(2048, true);
  std::shared_ptr<Cache> cache =
      NewCache(1024 /* capacity */, 0 /* num_shard_bits */,
               false /* strict_capacity_limit */, secondary_cache);
@ -1320,7 +1334,7 @@ TEST_P(BasicSecondaryCacheTest, BasicFailTest) {

 TEST_P(BasicSecondaryCacheTest, SaveFailTest) {
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(2048);
+      std::make_shared<TestSecondaryCache>(2048, true);
  std::shared_ptr<Cache> cache =
      NewCache(1024 /* capacity */, 0 /* num_shard_bits */,
               false /* strict_capacity_limit */, secondary_cache);
@ -1361,7 +1375,7 @@ TEST_P(BasicSecondaryCacheTest, SaveFailTest) {

 TEST_P(BasicSecondaryCacheTest, CreateFailTest) {
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(2048);
+      std::make_shared<TestSecondaryCache>(2048, true);
  std::shared_ptr<Cache> cache =
      NewCache(1024 /* capacity */, 0 /* num_shard_bits */,
               false /* strict_capacity_limit */, secondary_cache);
@ -1402,7 +1416,7 @@ TEST_P(BasicSecondaryCacheTest, CreateFailTest) {
 TEST_P(BasicSecondaryCacheTest, FullCapacityTest) {
  for (bool strict_capacity_limit : {false, true}) {
    std::shared_ptr<TestSecondaryCache> secondary_cache =
-        std::make_shared<TestSecondaryCache>(2048);
+        std::make_shared<TestSecondaryCache>(2048, true);
    std::shared_ptr<Cache> cache =
        NewCache(1024 /* capacity */, 0 /* num_shard_bits */,
                 strict_capacity_limit, secondary_cache);
@ -2021,8 +2035,9 @@ class CacheWithStats : public CacheWrapper {

  Status Insert(const Slice& key, Cache::ObjectPtr value,
                const CacheItemHelper* helper, size_t charge,
-                Handle** handle = nullptr,
-                Priority priority = Priority::LOW) override {
+                Handle** handle = nullptr, Priority priority = Priority::LOW,
+                const Slice& /*compressed*/ = Slice(),
+                CompressionType /*type*/ = kNoCompression) override {
    insert_count_++;
    return target_->Insert(key, value, helper, charge, handle, priority);
  }
@ -2115,7 +2130,7 @@ TEST_P(DBSecondaryCacheTest, LRUCacheDumpLoadBasic) {
  // we have a new cache it is empty, then, before we do the Get, we do the
  // dumpload
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(2048 * 1024);
+      std::make_shared<TestSecondaryCache>(2048 * 1024, true);
  // This time with secondary cache
  base_cache = NewCache(1024 * 1024 /* capacity */, 0 /* num_shard_bits */,
                        false /* strict_capacity_limit */, secondary_cache);
@ -2271,7 +2286,7 @@ TEST_P(DBSecondaryCacheTest, LRUCacheDumpLoadWithFilter) {
  // we have a new cache it is empty, then, before we do the Get, we do the
  // dumpload
  std::shared_ptr<TestSecondaryCache> secondary_cache =
-      std::make_shared<TestSecondaryCache>(2048 * 1024);
+      std::make_shared<TestSecondaryCache>(2048 * 1024, true);
  // This time with secondary_cache
  base_cache = NewCache(1024 * 1024 /* capacity */, 0 /* num_shard_bits */,
                        false /* strict_capacity_limit */, secondary_cache);
--- a/cache/secondary_cache.cc
+++ b/cache/secondary_cache.cc
@ -9,37 +9,4 @@

 namespace ROCKSDB_NAMESPACE {

-namespace {
-
-void NoopDelete(Cache::ObjectPtr, MemoryAllocator*) {}
-
-size_t SliceSize(Cache::ObjectPtr obj) {
-  return static_cast<Slice*>(obj)->size();
-}
-
-Status SliceSaveTo(Cache::ObjectPtr from_obj, size_t from_offset, size_t length,
-                   char* out) {
-  const Slice& slice = *static_cast<Slice*>(from_obj);
-  std::memcpy(out, slice.data() + from_offset, length);
-  return Status::OK();
-}
-
-Status FailCreate(const Slice&, Cache::CreateContext*, MemoryAllocator*,
-                  Cache::ObjectPtr*, size_t*) {
-  return Status::NotSupported("Only for dumping data into SecondaryCache");
-}
-
-}  // namespace
-
-Status SecondaryCache::InsertSaved(const Slice& key, const Slice& saved) {
-  static Cache::CacheItemHelper helper_no_secondary{CacheEntryRole::kMisc,
-                                                    &NoopDelete};
-  static Cache::CacheItemHelper helper{
-      CacheEntryRole::kMisc, &NoopDelete, &SliceSize,
-      &SliceSaveTo,          &FailCreate, &helper_no_secondary};
-  // NOTE: depends on Insert() being synchronous, not keeping pointer `&saved`
-  return Insert(key, const_cast<Slice*>(&saved), &helper,
-                /*force_insert=*/true);
-}
-
 }  // namespace ROCKSDB_NAMESPACE
--- a/cache/secondary_cache_adapter.cc
+++ b/cache/secondary_cache_adapter.cc
@ -5,6 +5,7 @@

 #include "cache/secondary_cache_adapter.h"

+#include "cache/tiered_secondary_cache.h"
 #include "monitoring/perf_context_imp.h"
 #include "util/cast_util.h"

@ -17,6 +18,7 @@ struct Dummy {
 };
 const Dummy kDummy{};
 Cache::ObjectPtr const kDummyObj = const_cast<Dummy*>(&kDummy);
+const char* kTieredCacheName = "TieredCache";
 }  // namespace

 // When CacheWithSecondaryAdapter is constructed with the distribute_cache_res
@ -107,11 +109,11 @@ CacheWithSecondaryAdapter::~CacheWithSecondaryAdapter() {
  // use after free
  target_->SetEvictionCallback({});
 #ifndef NDEBUG
-  if (distribute_cache_res_) {
+  if (distribute_cache_res_ && !ratio_changed_) {
    size_t sec_capacity = 0;
    Status s = secondary_cache_->GetCapacity(sec_capacity);
    assert(s.ok());
-    assert(pri_cache_res_->GetTotalReservedCacheSize() == sec_capacity);
+    assert(pri_cache_res_->GetTotalMemoryUsed() == sec_capacity);
  }
 #endif  // NDEBUG
 }
@ -119,7 +121,8 @@ CacheWithSecondaryAdapter::~CacheWithSecondaryAdapter() {
 bool CacheWithSecondaryAdapter::EvictionHandler(const Slice& key,
                                                Handle* handle, bool was_hit) {
  auto helper = GetCacheItemHelper(handle);
-  if (helper->IsSecondaryCacheCompatible()) {
+  if (helper->IsSecondaryCacheCompatible() &&
+      adm_policy_ != TieredAdmissionPolicy::kAdmPolicyThreeQueue) {
    auto obj = target_->Value(handle);
    // Ignore dummy entry
    if (obj != kDummyObj) {
@ -225,7 +228,9 @@ Cache::Handle* CacheWithSecondaryAdapter::Promote(
 Status CacheWithSecondaryAdapter::Insert(const Slice& key, ObjectPtr value,
                                         const CacheItemHelper* helper,
                                         size_t charge, Handle** handle,
-                                         Priority priority) {
+                                         Priority priority,
+                                         const Slice& compressed_value,
+                                         CompressionType type) {
  Status s = target_->Insert(key, value, helper, charge, handle, priority);
  if (s.ok() && value == nullptr && distribute_cache_res_) {
    size_t sec_charge = static_cast<size_t>(charge * (sec_cache_res_ratio_));
@ -234,6 +239,13 @@ Status CacheWithSecondaryAdapter::Insert(const Slice& key, ObjectPtr value,
    s = pri_cache_res_->UpdateCacheReservation(sec_charge, /*increase=*/false);
    assert(s.ok());
  }
+  // Warm up the secondary cache with the compressed block. The secondary
+  // cache may choose to ignore it based on the admission policy.
+  if (value != nullptr && !compressed_value.empty() &&
+      adm_policy_ == TieredAdmissionPolicy::kAdmPolicyThreeQueue) {
+    Status status = secondary_cache_->InsertSaved(key, compressed_value, type);
+    assert(status.ok() || status.IsNotSupported());
+  }

  return s;
 }
@ -406,19 +418,196 @@ std::string CacheWithSecondaryAdapter::GetPrintableOptions() const {
 }

 const char* CacheWithSecondaryAdapter::Name() const {
-  // To the user, at least for now, configure the underlying cache with
-  // a secondary cache. So we pretend to be that cache
-  return target_->Name();
+  if (distribute_cache_res_) {
+    return kTieredCacheName;
+  } else {
+    // To the user, at least for now, configure the underlying cache with
+    // a secondary cache. So we pretend to be that cache
+    return target_->Name();
+  }
 }

-std::shared_ptr<Cache> NewTieredVolatileCache(
-    TieredVolatileCacheOptions& opts) {
-  if (!opts.cache_opts) {
+// Update the total cache capacity. If we're distributing cache reservations
+// to both primary and secondary, then update the pri_cache_res_reservation
+// as well. At the moment, we don't have a good way of handling the case
+// where the new capacity < total cache reservations.
+void CacheWithSecondaryAdapter::SetCapacity(size_t capacity) {
+  size_t sec_capacity = static_cast<size_t>(
+      capacity * (distribute_cache_res_ ? sec_cache_res_ratio_ : 0.0));
+  size_t old_sec_capacity = 0;
+
+  if (distribute_cache_res_) {
+    MutexLock m(&mutex_);
+
+    Status s = secondary_cache_->GetCapacity(old_sec_capacity);
+    if (!s.ok()) {
+      return;
+    }
+    if (old_sec_capacity > sec_capacity) {
+      // We're shrinking the cache. We do things in the following order to
+      // avoid a temporary spike in usage over the configured capacity -
+      // 1. Lower the secondary cache capacity
+      // 2. Credit an equal amount (by decreasing pri_cache_res_) to the
+      //    primary cache
+      // 3. Decrease the primary cache capacity to the total budget
+      s = secondary_cache_->SetCapacity(sec_capacity);
+      if (s.ok()) {
+        s = pri_cache_res_->UpdateCacheReservation(
+            old_sec_capacity - sec_capacity,
+            /*increase=*/false);
+        assert(s.ok());
+        target_->SetCapacity(capacity);
+      }
+    } else {
+      // We're expanding the cache. Do it in the following order to avoid
+      // unnecessary evictions -
+      // 1. Increase the primary cache capacity to total budget
+      // 2. Reserve additional memory in primary on behalf of secondary (by
+      //    increasing pri_cache_res_ reservation)
+      // 3. Increase secondary cache capacity
+      target_->SetCapacity(capacity);
+      s = pri_cache_res_->UpdateCacheReservation(
+          sec_capacity - old_sec_capacity,
+          /*increase=*/true);
+      assert(s.ok());
+      s = secondary_cache_->SetCapacity(sec_capacity);
+      assert(s.ok());
+    }
+  } else {
+    // No cache reservation distribution. Just set the primary cache capacity.
+    target_->SetCapacity(capacity);
+  }
+}
+
+// Update the secondary/primary allocation ratio (remember, the primary
+// capacity is the total memory budget when distribute_cache_res_ is true).
+// When the ratio changes, we may accumulate some error in the calculations
+// for secondary cache inflate/deflate and pri_cache_res_ reservations.
+// This is due to the rounding of the reservation amount.
+//
+// We rely on the current pri_cache_res_ total memory used to estimate the
+// new secondary cache reservation after the ratio change. For this reason,
+// once the ratio is lowered to 0.0 (effectively disabling the secondary
+// cache and pri_cache_res_ total mem used going down to 0), we cannot
+// increase the ratio and re-enable it, We might remove this limitation
+// in the future.
+Status CacheWithSecondaryAdapter::UpdateCacheReservationRatio(
+    double compressed_secondary_ratio) {
+  if (!distribute_cache_res_ || sec_cache_res_ratio_ == 0.0) {
+    return Status::NotSupported();
+  }
+
+  MutexLock m(&mutex_);
+  size_t pri_capacity = target_->GetCapacity();
+  size_t sec_capacity =
+      static_cast<size_t>(pri_capacity * compressed_secondary_ratio);
+  size_t old_sec_capacity;
+  Status s = secondary_cache_->GetCapacity(old_sec_capacity);
+  if (!s.ok()) {
+    return s;
+  }
+
+  assert(old_sec_capacity >= pri_cache_res_->GetTotalMemoryUsed());
+  size_t old_sec_reserved =
+      old_sec_capacity - pri_cache_res_->GetTotalMemoryUsed();
+  // Calculate the new secondary cache reservation
+  size_t sec_reserved = static_cast<size_t>(
+      old_sec_reserved *
+      (double)(compressed_secondary_ratio / sec_cache_res_ratio_));
+  sec_cache_res_ratio_ = compressed_secondary_ratio;
+  if (sec_capacity > old_sec_capacity) {
+    // We're increasing the ratio, thus ending up with a larger secondary
+    // cache and a smaller usable primary cache capacity. Similar to
+    // SetCapacity(), we try to avoid a temporary increase in total usage
+    // beyond teh configured capacity -
+    // 1. A higher secondary cache ratio means it gets a higher share of
+    //    cache reservations. So first account for that by deflating the
+    //    secondary cache
+    // 2. Increase pri_cache_res_ reservation to reflect the new secondary
+    //    cache utilization (increase in capacity - increase in share of cache
+    //    reservation)
+    // 3. Increase secondary cache capacity
+    assert(sec_reserved > old_sec_reserved || sec_reserved == 0);
+    s = secondary_cache_->Deflate(sec_reserved - old_sec_reserved);
+    assert(s.ok());
+    s = pri_cache_res_->UpdateCacheReservation(
+        (sec_capacity - old_sec_capacity) - (sec_reserved - old_sec_reserved),
+        /*increase=*/true);
+    assert(s.ok());
+    s = secondary_cache_->SetCapacity(sec_capacity);
+    assert(s.ok());
+  } else {
+    // We're shrinking the ratio. Try to avoid unnecessary evictions -
+    // 1. Lower the secondary cache capacity
+    // 2. Decrease pri_cache_res_ reservation to relect lower secondary
+    //    cache utilization (decrease in capacity - decrease in share of cache
+    //    reservations)
+    // 3. Inflate the secondary cache to give it back the reduction in its
+    //    share of cache reservations
+    assert(old_sec_reserved > sec_reserved || sec_reserved == 0);
+    s = secondary_cache_->SetCapacity(sec_capacity);
+    if (s.ok()) {
+      s = pri_cache_res_->UpdateCacheReservation(
+          (old_sec_capacity - sec_capacity) - (old_sec_reserved - sec_reserved),
+          /*increase=*/false);
+      assert(s.ok());
+      s = secondary_cache_->Inflate(old_sec_reserved - sec_reserved);
+      assert(s.ok());
+    }
+  }
+
+#ifndef NDEBUG
+  // As mentioned in the function comments, we may accumulate some erros when
+  // the ratio is changed. We set a flag here which disables some assertions
+  // in the destructor
+  ratio_changed_ = true;
+#endif
+  return s;
+}
+
+Status CacheWithSecondaryAdapter::UpdateAdmissionPolicy(
+    TieredAdmissionPolicy adm_policy) {
+  adm_policy_ = adm_policy;
+  return Status::OK();
+}
+
+std::shared_ptr<Cache> NewTieredCache(const TieredCacheOptions& _opts) {
+  if (!_opts.cache_opts) {
    return nullptr;
  }

-  if (opts.adm_policy >= TieredAdmissionPolicy::kAdmPolicyMax) {
-    return nullptr;
+  TieredCacheOptions opts = _opts;
+  {
+    bool valid_adm_policy = true;
+
+    switch (_opts.adm_policy) {
+      case TieredAdmissionPolicy::kAdmPolicyAuto:
+        // Select an appropriate default policy
+        if (opts.adm_policy == TieredAdmissionPolicy::kAdmPolicyAuto) {
+          if (opts.nvm_sec_cache) {
+            opts.adm_policy = TieredAdmissionPolicy::kAdmPolicyThreeQueue;
+          } else {
+            opts.adm_policy = TieredAdmissionPolicy::kAdmPolicyPlaceholder;
+          }
+        }
+        break;
+      case TieredAdmissionPolicy::kAdmPolicyPlaceholder:
+      case TieredAdmissionPolicy::kAdmPolicyAllowCacheHits:
+        if (opts.nvm_sec_cache) {
+          valid_adm_policy = false;
+        }
+        break;
+      case TieredAdmissionPolicy::kAdmPolicyThreeQueue:
+        if (!opts.nvm_sec_cache) {
+          valid_adm_policy = false;
+        }
+        break;
+      default:
+        valid_adm_policy = false;
+    }
+    if (!valid_adm_policy) {
+      return nullptr;
+    }
  }

  std::shared_ptr<Cache> cache;
@ -426,21 +615,56 @@ std::shared_ptr<Cache> NewTieredVolatileCache(
    LRUCacheOptions cache_opts =
        *(static_cast_with_check<LRUCacheOptions, ShardedCacheOptions>(
            opts.cache_opts));
-    cache_opts.capacity += opts.comp_cache_opts.capacity;
+    cache_opts.capacity = opts.total_capacity;
    cache = cache_opts.MakeSharedCache();
  } else if (opts.cache_type == PrimaryCacheType::kCacheTypeHCC) {
    HyperClockCacheOptions cache_opts =
        *(static_cast_with_check<HyperClockCacheOptions, ShardedCacheOptions>(
            opts.cache_opts));
-    cache_opts.capacity += opts.comp_cache_opts.capacity;
+    cache_opts.capacity = opts.total_capacity;
    cache = cache_opts.MakeSharedCache();
  } else {
    return nullptr;
  }
  std::shared_ptr<SecondaryCache> sec_cache;
+  opts.comp_cache_opts.capacity = static_cast<size_t>(
+      opts.total_capacity * opts.compressed_secondary_ratio);
  sec_cache = NewCompressedSecondaryCache(opts.comp_cache_opts);

+  if (opts.nvm_sec_cache) {
+    if (opts.adm_policy == TieredAdmissionPolicy::kAdmPolicyThreeQueue) {
+      sec_cache = std::make_shared<TieredSecondaryCache>(
+          sec_cache, opts.nvm_sec_cache,
+          TieredAdmissionPolicy::kAdmPolicyThreeQueue);
+    } else {
+      return nullptr;
+    }
+  }
+
  return std::make_shared<CacheWithSecondaryAdapter>(
      cache, sec_cache, opts.adm_policy, /*distribute_cache_res=*/true);
 }
+
+Status UpdateTieredCache(const std::shared_ptr<Cache>& cache,
+                         int64_t total_capacity,
+                         double compressed_secondary_ratio,
+                         TieredAdmissionPolicy adm_policy) {
+  if (!cache || strcmp(cache->Name(), kTieredCacheName)) {
+    return Status::InvalidArgument();
+  }
+  CacheWithSecondaryAdapter* tiered_cache =
+      static_cast<CacheWithSecondaryAdapter*>(cache.get());
+
+  Status s;
+  if (total_capacity > 0) {
+    tiered_cache->SetCapacity(total_capacity);
+  }
+  if (compressed_secondary_ratio >= 0.0 && compressed_secondary_ratio <= 1.0) {
+    s = tiered_cache->UpdateCacheReservationRatio(compressed_secondary_ratio);
+  }
+  if (adm_policy < TieredAdmissionPolicy::kAdmPolicyMax) {
+    s = tiered_cache->UpdateAdmissionPolicy(adm_policy);
+  }
+  return s;
+}
 }  // namespace ROCKSDB_NAMESPACE
--- a/cache/secondary_cache_adapter.h
+++ b/cache/secondary_cache_adapter.h
@ -20,10 +20,12 @@ class CacheWithSecondaryAdapter : public CacheWrapper {

  ~CacheWithSecondaryAdapter() override;

-  Status Insert(const Slice& key, ObjectPtr value,
-                const CacheItemHelper* helper, size_t charge,
-                Handle** handle = nullptr,
-                Priority priority = Priority::LOW) override;
+  Status Insert(
+      const Slice& key, ObjectPtr value, const CacheItemHelper* helper,
+      size_t charge, Handle** handle = nullptr,
+      Priority priority = Priority::LOW,
+      const Slice& compressed_value = Slice(),
+      CompressionType type = CompressionType::kNoCompression) override;

  Handle* Lookup(const Slice& key, const CacheItemHelper* helper,
                 CreateContext* create_context,
@ -43,6 +45,12 @@ class CacheWithSecondaryAdapter : public CacheWrapper {

  const char* Name() const override;

+  void SetCapacity(size_t capacity) override;
+
+  Status UpdateCacheReservationRatio(double ratio);
+
+  Status UpdateAdmissionPolicy(TieredAdmissionPolicy adm_policy);
+
  Cache* TEST_GetCache() { return target_.get(); }

  SecondaryCache* TEST_GetSecondaryCache() { return secondary_cache_.get(); }
@ -73,6 +81,10 @@ class CacheWithSecondaryAdapter : public CacheWrapper {
  // Fraction of a cache memory reservation to be assigned to the secondary
  // cache
  double sec_cache_res_ratio_;
+  port::Mutex mutex_;
+#ifndef NDEBUG
+  bool ratio_changed_ = false;
+#endif
 };

 }  // namespace ROCKSDB_NAMESPACE
--- a/cache/sharded_cache.h
+++ b/cache/sharded_cache.h
@ -170,9 +170,12 @@ class ShardedCache : public ShardedCacheBase {
        [s_c_l](CacheShard* cs) { cs->SetStrictCapacityLimit(s_c_l); });
  }

-  Status Insert(const Slice& key, ObjectPtr obj, const CacheItemHelper* helper,
-                size_t charge, Handle** handle = nullptr,
-                Priority priority = Priority::LOW) override {
+  Status Insert(
+      const Slice& key, ObjectPtr obj, const CacheItemHelper* helper,
+      size_t charge, Handle** handle = nullptr,
+      Priority priority = Priority::LOW,
+      const Slice& /*compressed_value*/ = Slice(),
+      CompressionType /*type*/ = CompressionType::kNoCompression) override {
    assert(helper);
    HashVal hash = CacheShard::ComputeHash(key, hash_seed_);
    auto h_out = reinterpret_cast<HandleImpl**>(handle);
--- a/cache/tiered_secondary_cache.cc
+++ b/cache/tiered_secondary_cache.cc
@ -0,0 +1,119 @@
+//  Copyright (c) Meta Platforms, Inc. and affiliates.
+//  This source code is licensed under both the GPLv2 (found in the
+//  COPYING file in the root directory) and Apache 2.0 License
+//  (found in the LICENSE.Apache file in the root directory).
+
+#include "cache/tiered_secondary_cache.h"
+
+namespace ROCKSDB_NAMESPACE {
+
+// Creation callback for use in the lookup path. It calls the upper layer
+// create_cb to create the object, and optionally calls the compressed
+// secondary cache InsertSaved to save the compressed block. If
+// advise_erase is set, it means the primary cache wants the block to be
+// erased in the secondary cache, so we skip calling InsertSaved.
+//
+// For the time being, we assume that all blocks in the nvm tier belong to
+// the primary block cache (i.e CacheTier::kVolatileTier). That can be changed
+// if we implement demotion from the compressed secondary cache to the nvm
+// cache in the future.
+Status TieredSecondaryCache::MaybeInsertAndCreate(
+    const Slice& data, CompressionType type, CacheTier source,
+    Cache::CreateContext* ctx, MemoryAllocator* allocator,
+    Cache::ObjectPtr* out_obj, size_t* out_charge) {
+  TieredSecondaryCache::CreateContext* context =
+      static_cast<TieredSecondaryCache::CreateContext*>(ctx);
+  assert(source == CacheTier::kVolatileTier);
+  if (!context->advise_erase && type != kNoCompression) {
+    // Attempt to insert into compressed secondary cache
+    // TODO: Don't hardcode the source
+    context->comp_sec_cache->InsertSaved(*context->key, data, type, source)
+        .PermitUncheckedError();
+  }
+  // Primary cache will accept the object, so call its helper to create
+  // the object
+  return context->helper->create_cb(data, type, source, context->inner_ctx,
+                                    allocator, out_obj, out_charge);
+}
+
+// The lookup first looks up in the compressed secondary cache. If its a miss,
+// then the nvm cache lookup is called. The cache item helper and create
+// context are wrapped in order to intercept the creation callback to make
+// the decision on promoting to the compressed secondary cache.
+std::unique_ptr<SecondaryCacheResultHandle> TieredSecondaryCache::Lookup(
+    const Slice& key, const Cache::CacheItemHelper* helper,
+    Cache::CreateContext* create_context, bool wait, bool advise_erase,
+    bool& kept_in_sec_cache) {
+  bool dummy = false;
+  std::unique_ptr<SecondaryCacheResultHandle> result =
+      target()->Lookup(key, helper, create_context, wait, advise_erase,
+                       /*kept_in_sec_cache=*/dummy);
+  // We never want the item to spill back into the secondary cache
+  kept_in_sec_cache = true;
+  if (result) {
+    assert(result->IsReady());
+    return result;
+  }
+
+  // If wait is true, then we can be a bit more efficient and avoid a memory
+  // allocation for the CReateContext.
+  const Cache::CacheItemHelper* outer_helper =
+      TieredSecondaryCache::GetHelper();
+  if (wait) {
+    TieredSecondaryCache::CreateContext ctx;
+    ctx.key = &key;
+    ctx.advise_erase = advise_erase;
+    ctx.helper = helper;
+    ctx.inner_ctx = create_context;
+    ctx.comp_sec_cache = target();
+
+    return nvm_sec_cache_->Lookup(key, outer_helper, &ctx, wait, advise_erase,
+                                  kept_in_sec_cache);
+  }
+
+  // If wait is false, i.e its an async lookup, we have to allocate a result
+  // handle for tracking purposes. Embed the CreateContext inside the handle
+  // so we need only allocate memory once instead of twice.
+  std::unique_ptr<ResultHandle> handle(new ResultHandle());
+  handle->ctx()->key = &key;
+  handle->ctx()->advise_erase = advise_erase;
+  handle->ctx()->helper = helper;
+  handle->ctx()->inner_ctx = create_context;
+  handle->ctx()->comp_sec_cache = target();
+  handle->SetInnerHandle(nvm_sec_cache_->Lookup(
+      key, outer_helper, handle->ctx(), wait, advise_erase, kept_in_sec_cache));
+  if (!handle->inner_handle()) {
+    handle.reset();
+  } else {
+    result.reset(handle.release());
+  }
+
+  return result;
+}
+
+// Call the nvm cache WaitAll to complete the lookups
+void TieredSecondaryCache::WaitAll(
+    std::vector<SecondaryCacheResultHandle*> handles) {
+  std::vector<SecondaryCacheResultHandle*> nvm_handles;
+  std::vector<ResultHandle*> my_handles;
+  nvm_handles.reserve(handles.size());
+  for (auto handle : handles) {
+    // The handle could belong to the compressed secondary cache. Skip if
+    // that's the case.
+    if (handle->IsReady()) {
+      continue;
+    }
+    ResultHandle* hdl = static_cast<ResultHandle*>(handle);
+    nvm_handles.push_back(hdl->inner_handle());
+    my_handles.push_back(hdl);
+  }
+  nvm_sec_cache_->WaitAll(nvm_handles);
+  for (auto handle : my_handles) {
+    assert(handle->IsReady());
+    auto nvm_handle = handle->inner_handle();
+    handle->SetSize(nvm_handle->Size());
+    handle->SetValue(nvm_handle->Value());
+  }
+}
+
+}  // namespace ROCKSDB_NAMESPACE
--- a/cache/tiered_secondary_cache.h
+++ b/cache/tiered_secondary_cache.h
@ -0,0 +1,155 @@
+//  Copyright (c) Meta Platforms, Inc. and affiliates.
+//  This source code is licensed under both the GPLv2 (found in the
+//  COPYING file in the root directory) and Apache 2.0 License
+//  (found in the LICENSE.Apache file in the root directory).
+
+#pragma once
+
+#include "rocksdb/cache.h"
+#include "rocksdb/secondary_cache.h"
+
+namespace ROCKSDB_NAMESPACE {
+
+// A SecondaryCache that implements stacking of a compressed secondary cache
+// and a non-volatile (local flash) cache. It implements an admission
+// policy of warming the bottommost tier (local flash) with compressed
+// blocks from the SST on misses, and on hits in the bottommost tier,
+// promoting to the compressed and/or primary block cache. The admission
+// policies of the primary block cache and compressed secondary cache remain
+// unchanged - promote on second access. There is no demotion ofablocks
+// evicted from a tier. They are just discarded.
+//
+// In order to properly handle compressed blocks directly read from SSTs, and
+// to allow writeback of blocks compressed by the compressed secondary
+// cache in the future, we make use of the compression type and source
+// cache tier arguments in InsertSaved.
+class TieredSecondaryCache : public SecondaryCacheWrapper {
+ public:
+  TieredSecondaryCache(std::shared_ptr<SecondaryCache> comp_sec_cache,
+                       std::shared_ptr<SecondaryCache> nvm_sec_cache,
+                       TieredAdmissionPolicy adm_policy)
+      : SecondaryCacheWrapper(comp_sec_cache), nvm_sec_cache_(nvm_sec_cache) {
+#ifndef NDEBUG
+    assert(adm_policy == TieredAdmissionPolicy::kAdmPolicyThreeQueue);
+#else
+    (void)adm_policy;
+#endif
+  }
+
+  ~TieredSecondaryCache() override {}
+
+  const char* Name() const override { return "TieredSecondaryCache"; }
+
+  // This is a no-op as we currently don't allow demotion (i.e
+  // insertion by the upper layer) of evicted blocks.
+  virtual Status Insert(const Slice& /*key*/, Cache::ObjectPtr /*obj*/,
+                        const Cache::CacheItemHelper* /*helper*/,
+                        bool /*force_insert*/) override {
+    return Status::OK();
+  }
+
+  // Warm up the nvm tier directly
+  virtual Status InsertSaved(
+      const Slice& key, const Slice& saved,
+      CompressionType type = CompressionType::kNoCompression,
+      CacheTier source = CacheTier::kVolatileTier) override {
+    return nvm_sec_cache_->InsertSaved(key, saved, type, source);
+  }
+
+  virtual std::unique_ptr<SecondaryCacheResultHandle> Lookup(
+      const Slice& key, const Cache::CacheItemHelper* helper,
+      Cache::CreateContext* create_context, bool wait, bool advise_erase,
+      bool& kept_in_sec_cache) override;
+
+  virtual void WaitAll(
+      std::vector<SecondaryCacheResultHandle*> handles) override;
+
+ private:
+  struct CreateContext : public Cache::CreateContext {
+    const Slice* key;
+    bool advise_erase;
+    const Cache::CacheItemHelper* helper;
+    Cache::CreateContext* inner_ctx;
+    std::shared_ptr<SecondaryCacheResultHandle> inner_handle;
+    SecondaryCache* comp_sec_cache;
+  };
+
+  class ResultHandle : public SecondaryCacheResultHandle {
+   public:
+    ~ResultHandle() override {}
+
+    bool IsReady() override {
+      return !inner_handle_ || inner_handle_->IsReady();
+    }
+
+    void Wait() override {
+      inner_handle_->Wait();
+      Complete();
+    }
+
+    size_t Size() override { return size_; }
+
+    Cache::ObjectPtr Value() override { return value_; }
+
+    void Complete() {
+      assert(IsReady());
+      size_ = inner_handle_->Size();
+      value_ = inner_handle_->Value();
+      inner_handle_.reset();
+    }
+
+    void SetInnerHandle(std::unique_ptr<SecondaryCacheResultHandle>&& handle) {
+      inner_handle_ = std::move(handle);
+    }
+
+    void SetSize(size_t size) { size_ = size; }
+
+    void SetValue(Cache::ObjectPtr val) { value_ = val; }
+
+    CreateContext* ctx() { return &ctx_; }
+
+    SecondaryCacheResultHandle* inner_handle() { return inner_handle_.get(); }
+
+   private:
+    std::unique_ptr<SecondaryCacheResultHandle> inner_handle_;
+    CreateContext ctx_;
+    size_t size_;
+    Cache::ObjectPtr value_;
+  };
+
+  static void NoopDelete(Cache::ObjectPtr /*obj*/,
+                         MemoryAllocator* /*allocator*/) {
+    assert(false);
+  }
+  static size_t ZeroSize(Cache::ObjectPtr /*obj*/) {
+    assert(false);
+    return 0;
+  }
+  static Status NoopSaveTo(Cache::ObjectPtr /*from_obj*/,
+                           size_t /*from_offset*/, size_t /*length*/,
+                           char* /*out_buf*/) {
+    assert(false);
+    return Status::OK();
+  }
+  static Status MaybeInsertAndCreate(const Slice& data, CompressionType type,
+                                     CacheTier source,
+                                     Cache::CreateContext* ctx,
+                                     MemoryAllocator* allocator,
+                                     Cache::ObjectPtr* out_obj,
+                                     size_t* out_charge);
+
+  static const Cache::CacheItemHelper* GetHelper() {
+    const static Cache::CacheItemHelper basic_helper(CacheEntryRole::kMisc,
+                                                     &NoopDelete);
+    const static Cache::CacheItemHelper maybe_insert_and_create_helper{
+        CacheEntryRole::kMisc, &NoopDelete,           &ZeroSize,
+        &NoopSaveTo,           &MaybeInsertAndCreate, &basic_helper,
+    };
+    return &maybe_insert_and_create_helper;
+  }
+
+  std::shared_ptr<SecondaryCache> comp_sec_cache_;
+  std::shared_ptr<SecondaryCache> nvm_sec_cache_;
+};
+
+}  // namespace ROCKSDB_NAMESPACE
--- a/cache/tiered_secondary_cache_test.cc
+++ b/cache/tiered_secondary_cache_test.cc
@ -0,0 +1,711 @@
+//  Copyright (c) Meta Platforms, Inc. and affiliates.
+//  This source code is licensed under both the GPLv2 (found in the
+//  COPYING file in the root directory) and Apache 2.0 License
+//  (found in the LICENSE.Apache file in the root directory).
+//
+#include "cache/compressed_secondary_cache.h"
+#include "cache/secondary_cache_adapter.h"
+#include "db/db_test_util.h"
+#include "rocksdb/cache.h"
+#include "rocksdb/secondary_cache.h"
+#include "typed_cache.h"
+#include "util/random.h"
+
+namespace ROCKSDB_NAMESPACE {
+
+class TestSecondaryCache : public SecondaryCache {
+ public:
+  explicit TestSecondaryCache(size_t capacity)
+      : cache_(NewLRUCache(capacity, 0, false, 0.5 /* high_pri_pool_ratio */,
+                           nullptr, kDefaultToAdaptiveMutex,
+                           kDontChargeCacheMetadata)),
+        num_insert_saved_(0),
+        num_hits_(0),
+        num_misses_(0) {}
+
+  const char* Name() const override { return "TestSecondaryCache"; }
+
+  Status Insert(const Slice& /*key*/, Cache::ObjectPtr /*value*/,
+                const Cache::CacheItemHelper* /*helper*/,
+                bool /*force_insert*/) override {
+    assert(false);
+    return Status::NotSupported();
+  }
+
+  Status InsertSaved(const Slice& key, const Slice& saved,
+                     CompressionType type = kNoCompression,
+                     CacheTier source = CacheTier::kVolatileTier) override {
+    CheckCacheKeyCommonPrefix(key);
+    size_t size;
+    char* buf;
+    Status s;
+
+    num_insert_saved_++;
+    size = saved.size();
+    buf = new char[size + sizeof(uint64_t) + 2 * sizeof(uint16_t)];
+    EncodeFixed64(buf, size);
+    buf += sizeof(uint64_t);
+    EncodeFixed16(buf, type);
+    buf += sizeof(uint16_t);
+    EncodeFixed16(buf, (uint16_t)source);
+    buf += sizeof(uint16_t);
+    memcpy(buf, saved.data(), size);
+    buf -= sizeof(uint64_t) + 2 * sizeof(uint16_t);
+    if (!s.ok()) {
+      delete[] buf;
+      return s;
+    }
+    return cache_.Insert(key, buf, size);
+  }
+
+  std::unique_ptr<SecondaryCacheResultHandle> Lookup(
+      const Slice& key, const Cache::CacheItemHelper* helper,
+      Cache::CreateContext* create_context, bool wait, bool /*advise_erase*/,
+      bool& kept_in_sec_cache) override {
+    std::string key_str = key.ToString();
+    TEST_SYNC_POINT_CALLBACK("TestSecondaryCache::Lookup", &key_str);
+
+    std::unique_ptr<SecondaryCacheResultHandle> secondary_handle;
+    kept_in_sec_cache = false;
+
+    TypedHandle* handle = cache_.Lookup(key);
+    if (handle) {
+      num_hits_++;
+      Cache::ObjectPtr value = nullptr;
+      size_t charge = 0;
+      Status s;
+      char* ptr = cache_.Value(handle);
+      CompressionType type;
+      CacheTier source;
+      size_t size = DecodeFixed64(ptr);
+      ptr += sizeof(uint64_t);
+      type = static_cast<CompressionType>(DecodeFixed16(ptr));
+      ptr += sizeof(uint16_t);
+      source = static_cast<CacheTier>(DecodeFixed16(ptr));
+      assert(source == CacheTier::kVolatileTier);
+      ptr += sizeof(uint16_t);
+      s = helper->create_cb(Slice(ptr, size), type, source, create_context,
+                            /*alloc*/ nullptr, &value, &charge);
+      if (s.ok()) {
+        secondary_handle.reset(new TestSecondaryCacheResultHandle(
+            cache_.get(), handle, value, charge, /*ready=*/wait));
+        kept_in_sec_cache = true;
+      } else {
+        cache_.Release(handle);
+      }
+    } else {
+      num_misses_++;
+    }
+    return secondary_handle;
+  }
+
+  bool SupportForceErase() const override { return false; }
+
+  void Erase(const Slice& /*key*/) override {}
+
+  void WaitAll(std::vector<SecondaryCacheResultHandle*> handles) override {
+    for (SecondaryCacheResultHandle* handle : handles) {
+      TestSecondaryCacheResultHandle* sec_handle =
+          static_cast<TestSecondaryCacheResultHandle*>(handle);
+      EXPECT_FALSE(sec_handle->IsReady());
+      sec_handle->SetReady();
+    }
+  }
+
+  std::string GetPrintableOptions() const override { return ""; }
+
+  uint32_t num_insert_saved() { return num_insert_saved_; }
+
+  uint32_t num_hits() { return num_hits_; }
+
+  uint32_t num_misses() { return num_misses_; }
+
+  void CheckCacheKeyCommonPrefix(const Slice& key) {
+    Slice current_prefix(key.data(), OffsetableCacheKey::kCommonPrefixSize);
+    if (ckey_prefix_.empty()) {
+      ckey_prefix_ = current_prefix.ToString();
+    } else {
+      EXPECT_EQ(ckey_prefix_, current_prefix.ToString());
+    }
+  }
+
+ private:
+  class TestSecondaryCacheResultHandle : public SecondaryCacheResultHandle {
+   public:
+    TestSecondaryCacheResultHandle(Cache* cache, Cache::Handle* handle,
+                                   Cache::ObjectPtr value, size_t size,
+                                   bool ready)
+        : cache_(cache),
+          handle_(handle),
+          value_(value),
+          size_(size),
+          is_ready_(ready) {}
+
+    ~TestSecondaryCacheResultHandle() override { cache_->Release(handle_); }
+
+    bool IsReady() override { return is_ready_; }
+
+    void Wait() override {}
+
+    Cache::ObjectPtr Value() override {
+      assert(is_ready_);
+      return value_;
+    }
+
+    size_t Size() override { return Value() ? size_ : 0; }
+
+    void SetReady() { is_ready_ = true; }
+
+   private:
+    Cache* cache_;
+    Cache::Handle* handle_;
+    Cache::ObjectPtr value_;
+    size_t size_;
+    bool is_ready_;
+  };
+
+  using SharedCache =
+      BasicTypedSharedCacheInterface<char[], CacheEntryRole::kMisc>;
+  using TypedHandle = SharedCache::TypedHandle;
+  SharedCache cache_;
+  uint32_t num_insert_saved_;
+  uint32_t num_hits_;
+  uint32_t num_misses_;
+  std::string ckey_prefix_;
+};
+
+class DBTieredSecondaryCacheTest : public DBTestBase {
+ public:
+  DBTieredSecondaryCacheTest()
+      : DBTestBase("db_tiered_secondary_cache_test", /*env_do_fsync=*/true) {}
+
+  std::shared_ptr<Cache> NewCache(size_t pri_capacity,
+                                  size_t compressed_capacity,
+                                  size_t nvm_capacity,
+                                  TieredAdmissionPolicy adm_policy =
+                                      TieredAdmissionPolicy::kAdmPolicyAuto) {
+    LRUCacheOptions lru_opts;
+    TieredCacheOptions opts;
+    lru_opts.capacity = 0;
+    lru_opts.num_shard_bits = 0;
+    lru_opts.high_pri_pool_ratio = 0;
+    opts.cache_opts = &lru_opts;
+    opts.cache_type = PrimaryCacheType::kCacheTypeLRU;
+    opts.comp_cache_opts.capacity = 0;
+    opts.comp_cache_opts.num_shard_bits = 0;
+    opts.total_capacity = pri_capacity + compressed_capacity;
+    opts.compressed_secondary_ratio =
+        (double)compressed_capacity / opts.total_capacity;
+    if (nvm_capacity > 0) {
+      nvm_sec_cache_.reset(new TestSecondaryCache(nvm_capacity));
+      opts.nvm_sec_cache = nvm_sec_cache_;
+    }
+    opts.adm_policy = adm_policy;
+    cache_ = NewTieredCache(opts);
+    assert(cache_ != nullptr);
+
+    return cache_;
+  }
+
+  TestSecondaryCache* nvm_sec_cache() { return nvm_sec_cache_.get(); }
+
+  CompressedSecondaryCache* compressed_secondary_cache() {
+    return static_cast<CompressedSecondaryCache*>(
+        static_cast<CacheWithSecondaryAdapter*>(cache_.get())
+            ->TEST_GetSecondaryCache());
+  }
+
+ private:
+  std::shared_ptr<Cache> cache_;
+  std::shared_ptr<TestSecondaryCache> nvm_sec_cache_;
+};
+
+// In this test, the block size is set to 4096. Each value is 1007 bytes, so
+// each data block contains exactly 4 KV pairs. Metadata blocks are not
+// cached, so we can accurately estimate the cache usage.
+TEST_F(DBTieredSecondaryCacheTest, BasicTest) {
+  if (!LZ4_Supported()) {
+    ROCKSDB_GTEST_SKIP("This test requires LZ4 support.");
+    return;
+  }
+
+  BlockBasedTableOptions table_options;
+  // We want a block cache of size 5KB, and a compressed secondary cache of
+  // size 5KB. However, we specify a block cache size of 256KB here in order
+  // to take into account the cache reservation in the block cache on
+  // behalf of the compressed cache. The unit of cache reservation is 256KB.
+  // The effective block cache capacity will be calculated as 256 + 5 = 261KB,
+  // and 256KB will be reserved for the compressed cache, leaving 5KB for
+  // the primary block cache. We only have to worry about this here because
+  // the cache size is so small.
+  table_options.block_cache = NewCache(256 * 1024, 5 * 1024, 256 * 1024);
+  table_options.block_size = 4 * 1024;
+  table_options.cache_index_and_filter_blocks = false;
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  // Disable paranoid_file_checks so that flush will not read back the newly
+  // written file
+  options.paranoid_file_checks = false;
+  DestroyAndReopen(options);
+  Random rnd(301);
+  const int N = 256;
+  for (int i = 0; i < N; i++) {
+    std::string p_v;
+    test::CompressibleString(&rnd, 0.5, 1007, &p_v);
+    ASSERT_OK(Put(Key(i), p_v));
+  }
+
+  ASSERT_OK(Flush());
+
+  // The first 2 Gets, for keys 0 and 5, will load the corresponding data
+  // blocks as they will be cache misses. The nvm secondary cache will be
+  // warmed up with the compressed blocks
+  std::string v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 1u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 1u);
+
+  v = Get(Key(5));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+
+  // At this point, the nvm cache is warmed up with the data blocks for 0
+  // and 5. The next Get will lookup the block in nvm and will be a hit.
+  // It will be created as a standalone entry in memory, and a placeholder
+  // will be inserted in the primary and compressed caches.
+  v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 1u);
+
+  // For this Get, the primary and compressed only have placeholders for
+  // the required data block. So we will lookup the nvm cache and find the
+  // block there. This time, the block will be promoted to the primary
+  // block cache. No promotion to the compressed secondary cache happens,
+  // and it will retain the placeholder.
+  v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 2u);
+
+  // This Get will find the data block in the primary cache.
+  v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 2u);
+
+  // We repeat the sequence for key 5. This will end up evicting the block
+  // for 0 from the in-memory cache.
+  v = Get(Key(5));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 3u);
+
+  v = Get(Key(5));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 4u);
+
+  v = Get(Key(5));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 4u);
+
+  // This Get for key 0 will find the data block in nvm. Since the compressed
+  // cache still has the placeholder, the block (compressed) will be
+  // admitted. It is theh inserted into the primary as a standalone entry.
+  v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 5u);
+
+  // This Get for key 0 will find the data block in the compressed secondary
+  // cache.
+  v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 2u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 5u);
+
+  Destroy(options);
+}
+
+// This test is very similar to BasicTest, except it calls MultiGet rather
+// than Get, in order to exercise the async lookup and WaitAll path.
+TEST_F(DBTieredSecondaryCacheTest, BasicMultiGetTest) {
+  if (!LZ4_Supported()) {
+    ROCKSDB_GTEST_SKIP("This test requires LZ4 support.");
+    return;
+  }
+
+  BlockBasedTableOptions table_options;
+  table_options.block_cache = NewCache(260 * 1024, 10 * 1024, 256 * 1024);
+  table_options.block_size = 4 * 1024;
+  table_options.cache_index_and_filter_blocks = false;
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  options.paranoid_file_checks = false;
+  DestroyAndReopen(options);
+  Random rnd(301);
+  const int N = 256;
+  for (int i = 0; i < N; i++) {
+    std::string p_v;
+    test::CompressibleString(&rnd, 0.5, 1007, &p_v);
+    ASSERT_OK(Put(Key(i), p_v));
+  }
+
+  ASSERT_OK(Flush());
+
+  std::vector<std::string> keys;
+  std::vector<std::string> values;
+
+  keys.push_back(Key(0));
+  keys.push_back(Key(4));
+  keys.push_back(Key(8));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 3u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 3u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 0u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(12));
+  keys.push_back(Key(16));
+  keys.push_back(Key(20));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 0u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(0));
+  keys.push_back(Key(4));
+  keys.push_back(Key(8));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 3u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(0));
+  keys.push_back(Key(4));
+  keys.push_back(Key(8));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 6u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(0));
+  keys.push_back(Key(4));
+  keys.push_back(Key(8));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 6u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(12));
+  keys.push_back(Key(16));
+  keys.push_back(Key(20));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 9u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(12));
+  keys.push_back(Key(16));
+  keys.push_back(Key(20));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 12u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(12));
+  keys.push_back(Key(16));
+  keys.push_back(Key(20));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 12u);
+
+  Destroy(options);
+}
+
+TEST_F(DBTieredSecondaryCacheTest, WaitAllTest) {
+  if (!LZ4_Supported()) {
+    ROCKSDB_GTEST_SKIP("This test requires LZ4 support.");
+    return;
+  }
+
+  BlockBasedTableOptions table_options;
+  table_options.block_cache = NewCache(250 * 1024, 20 * 1024, 256 * 1024);
+  table_options.block_size = 4 * 1024;
+  table_options.cache_index_and_filter_blocks = false;
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  options.paranoid_file_checks = false;
+  DestroyAndReopen(options);
+  Random rnd(301);
+  const int N = 256;
+  for (int i = 0; i < N; i++) {
+    std::string p_v;
+    test::CompressibleString(&rnd, 0.5, 1007, &p_v);
+    ASSERT_OK(Put(Key(i), p_v));
+  }
+
+  ASSERT_OK(Flush());
+
+  std::vector<std::string> keys;
+  std::vector<std::string> values;
+
+  keys.push_back(Key(0));
+  keys.push_back(Key(4));
+  keys.push_back(Key(8));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 3u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 3u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 0u);
+
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(12));
+  keys.push_back(Key(16));
+  keys.push_back(Key(20));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 6u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 0u);
+
+  // Insert placeholders for 4 in primary and compressed
+  std::string val = Get(Key(4));
+
+  // Force placeholder 4 out of primary
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(24));
+  keys.push_back(Key(28));
+  keys.push_back(Key(32));
+  keys.push_back(Key(36));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 10u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 10u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 1u);
+
+  // Now read 4 again. This will create a placeholder in primary, and insert
+  // in compressed secondary since it already has a placeholder
+  val = Get(Key(4));
+
+  // Now read 0, 4 and 8. While 4 is already in the compressed secondary
+  // cache, 0 and 8 will be read asynchronously from the nvm tier. The
+  // WaitAll will be called for all 3 blocks.
+  keys.clear();
+  values.clear();
+  keys.push_back(Key(0));
+  keys.push_back(Key(4));
+  keys.push_back(Key(8));
+  values = MultiGet(keys, /*snapshot=*/nullptr, /*async=*/true);
+  ASSERT_EQ(values.size(), keys.size());
+  for (auto value : values) {
+    ASSERT_EQ(1007, value.size());
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 10u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 10u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 4u);
+
+  Destroy(options);
+}
+
+// This test is for iteration. It iterates through a set of keys in two
+// passes. First pass loads the compressed blocks into the nvm tier, and
+// the second pass should hit all of those blocks.
+TEST_F(DBTieredSecondaryCacheTest, IterateTest) {
+  if (!LZ4_Supported()) {
+    ROCKSDB_GTEST_SKIP("This test requires LZ4 support.");
+    return;
+  }
+
+  BlockBasedTableOptions table_options;
+  table_options.block_cache = NewCache(250 * 1024, 10 * 1024, 256 * 1024);
+  table_options.block_size = 4 * 1024;
+  table_options.cache_index_and_filter_blocks = false;
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  options.paranoid_file_checks = false;
+  DestroyAndReopen(options);
+  Random rnd(301);
+  const int N = 256;
+  for (int i = 0; i < N; i++) {
+    std::string p_v;
+    test::CompressibleString(&rnd, 0.5, 1007, &p_v);
+    ASSERT_OK(Put(Key(i), p_v));
+  }
+
+  ASSERT_OK(Flush());
+
+  ReadOptions ro;
+  ro.readahead_size = 256 * 1024;
+  auto iter = dbfull()->NewIterator(ro);
+  iter->SeekToFirst();
+  for (int i = 0; i < 31; ++i) {
+    ASSERT_EQ(Key(i), iter->key().ToString());
+    ASSERT_EQ(1007, iter->value().size());
+    iter->Next();
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 8u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 8u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 0u);
+  delete iter;
+
+  iter = dbfull()->NewIterator(ro);
+  iter->SeekToFirst();
+  for (int i = 0; i < 31; ++i) {
+    ASSERT_EQ(Key(i), iter->key().ToString());
+    ASSERT_EQ(1007, iter->value().size());
+    iter->Next();
+  }
+  ASSERT_EQ(nvm_sec_cache()->num_insert_saved(), 8u);
+  ASSERT_EQ(nvm_sec_cache()->num_misses(), 8u);
+  ASSERT_EQ(nvm_sec_cache()->num_hits(), 8u);
+  delete iter;
+
+  Destroy(options);
+}
+
+class DBTieredAdmPolicyTest
+    : public DBTieredSecondaryCacheTest,
+      public testing::WithParamInterface<TieredAdmissionPolicy> {};
+
+TEST_P(DBTieredAdmPolicyTest, CompressedOnlyTest) {
+  if (!LZ4_Supported()) {
+    ROCKSDB_GTEST_SKIP("This test requires LZ4 support.");
+    return;
+  }
+
+  BlockBasedTableOptions table_options;
+  // We want a block cache of size 10KB, and a compressed secondary cache of
+  // size 10KB. However, we specify a block cache size of 256KB here in order
+  // to take into account the cache reservation in the block cache on
+  // behalf of the compressed cache. The unit of cache reservation is 256KB.
+  // The effective block cache capacity will be calculated as 256 + 10 = 266KB,
+  // and 256KB will be reserved for the compressed cache, leaving 10KB for
+  // the primary block cache. We only have to worry about this here because
+  // the cache size is so small.
+  table_options.block_cache = NewCache(256 * 1024, 10 * 1024, 0, GetParam());
+  table_options.block_size = 4 * 1024;
+  table_options.cache_index_and_filter_blocks = false;
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  size_t comp_cache_usage = compressed_secondary_cache()->TEST_GetUsage();
+  // Disable paranoid_file_checks so that flush will not read back the newly
+  // written file
+  options.paranoid_file_checks = false;
+  DestroyAndReopen(options);
+  Random rnd(301);
+  const int N = 256;
+  for (int i = 0; i < N; i++) {
+    std::string p_v;
+    test::CompressibleString(&rnd, 0.5, 1007, &p_v);
+    ASSERT_OK(Put(Key(i), p_v));
+  }
+
+  ASSERT_OK(Flush());
+
+  // The first 2 Gets, for keys 0 and 5, will load the corresponding data
+  // blocks as they will be cache misses. Since this is a 2-tier cache (
+  // primary and compressed), no warm-up should happen with the compressed
+  // blocks.
+  std::string v = Get(Key(0));
+  ASSERT_EQ(1007, v.size());
+
+  v = Get(Key(5));
+  ASSERT_EQ(1007, v.size());
+
+  ASSERT_EQ(compressed_secondary_cache()->TEST_GetUsage(), comp_cache_usage);
+
+  Destroy(options);
+}
+
+INSTANTIATE_TEST_CASE_P(
+    DBTieredAdmPolicyTest, DBTieredAdmPolicyTest,
+    ::testing::Values(TieredAdmissionPolicy::kAdmPolicyAuto,
+                      TieredAdmissionPolicy::kAdmPolicyPlaceholder,
+                      TieredAdmissionPolicy::kAdmPolicyAllowCacheHits));
+
+}  // namespace ROCKSDB_NAMESPACE
+
+int main(int argc, char** argv) {
+  ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
--- a/cache/typed_cache.h
+++ b/cache/typed_cache.h
@ -234,15 +234,19 @@ class FullTypedCacheHelperFns : public BasicTypedCacheHelperFns<TValue> {
    return Status::OK();
  }

-  static Status Create(const Slice& data, CreateContext* context,
+  static Status Create(const Slice& data, CompressionType type,
+                       CacheTier source, CreateContext* context,
                       MemoryAllocator* allocator, ObjectPtr* out_obj,
                       size_t* out_charge) {
    std::unique_ptr<TValue> value = nullptr;
+    if (source != CacheTier::kVolatileTier) {
+      return Status::InvalidArgument();
+    }
    if constexpr (sizeof(TCreateContext) > 0) {
      TCreateContext* tcontext = static_cast<TCreateContext*>(context);
-      tcontext->Create(&value, out_charge, data, allocator);
+      tcontext->Create(&value, out_charge, data, type, allocator);
    } else {
-      TCreateContext::Create(&value, out_charge, data, allocator);
+      TCreateContext::Create(&value, out_charge, data, type, allocator);
    }
    *out_obj = UpCastValue(value.release());
    return Status::OK();
@ -301,13 +305,15 @@ class FullTypedCacheInterface
  inline Status InsertFull(
      const Slice& key, TValuePtr value, size_t charge,
      TypedHandle** handle = nullptr, Priority priority = Priority::LOW,
-      CacheTier lowest_used_cache_tier = CacheTier::kNonVolatileBlockTier) {
+      CacheTier lowest_used_cache_tier = CacheTier::kNonVolatileBlockTier,
+      const Slice& compressed = Slice(),
+      CompressionType type = CompressionType::kNoCompression) {
    auto untyped_handle = reinterpret_cast<Handle**>(handle);
-    auto helper = lowest_used_cache_tier == CacheTier::kNonVolatileBlockTier
+    auto helper = lowest_used_cache_tier > CacheTier::kVolatileTier
                      ? GetFullHelper()
                      : GetBasicHelper();
    return this->cache_->Insert(key, UpCastValue(value), helper, charge,
-                                untyped_handle, priority);
+                                untyped_handle, priority, compressed, type);
  }

  // Like SecondaryCache::InsertSaved, with SecondaryCache compatibility
@ -319,9 +325,9 @@ class FullTypedCacheInterface
      size_t* out_charge = nullptr) {
    ObjectPtr value;
    size_t charge;
-    Status st = GetFullHelper()->create_cb(data, create_context,
-                                           this->cache_->memory_allocator(),
-                                           &value, &charge);
+    Status st = GetFullHelper()->create_cb(
+        data, kNoCompression, CacheTier::kVolatileTier, create_context,
+        this->cache_->memory_allocator(), &value, &charge);
    if (out_charge) {
      *out_charge = charge;
    }
@ -340,7 +346,7 @@ class FullTypedCacheInterface
      const Slice& key, TCreateContext* create_context = nullptr,
      Priority priority = Priority::LOW, Statistics* stats = nullptr,
      CacheTier lowest_used_cache_tier = CacheTier::kNonVolatileBlockTier) {
-    if (lowest_used_cache_tier == CacheTier::kNonVolatileBlockTier) {
+    if (lowest_used_cache_tier > CacheTier::kVolatileTier) {
      return reinterpret_cast<TypedHandle*>(this->cache_->Lookup(
          key, GetFullHelper(), create_context, priority, stats));
    } else {
@ -352,7 +358,7 @@ class FullTypedCacheInterface
  inline void StartAsyncLookupFull(
      TypedAsyncLookupHandle& async_handle,
      CacheTier lowest_used_cache_tier = CacheTier::kNonVolatileBlockTier) {
-    if (lowest_used_cache_tier == CacheTier::kNonVolatileBlockTier) {
+    if (lowest_used_cache_tier > CacheTier::kVolatileTier) {
      async_handle.helper = GetFullHelper();
      this->cache_->StartAsyncLookup(async_handle);
    } else {
--- a/db/blob/blob_contents.h
+++ b/db/blob/blob_contents.h
@ -46,7 +46,8 @@ class BlobContents {
 class BlobContentsCreator : public Cache::CreateContext {
 public:
  static void Create(std::unique_ptr<BlobContents>* out, size_t* out_charge,
-                     const Slice& contents, MemoryAllocator* alloc) {
+                     const Slice& contents, CompressionType /*type*/,
+                     MemoryAllocator* alloc) {
    auto raw = new BlobContents(AllocateAndCopyBlock(contents, alloc),
                                contents.size());
    out->reset(raw);
--- a/db/blob/blob_file_reader.cc
+++ b/db/blob/blob_file_reader.cc
@ -585,7 +585,8 @@ Status BlobFileReader::UncompressBlobIfNeeded(
  assert(result);

  if (compression_type == kNoCompression) {
-    BlobContentsCreator::Create(result, nullptr, value_slice, allocator);
+    BlobContentsCreator::Create(result, nullptr, value_slice, kNoCompression,
+                                allocator);
    return Status::OK();
  }

--- a/db/builder.cc
+++ b/db/builder.cc
@ -294,12 +294,12 @@ Status BuildTable(
    if (!s.ok() || empty) {
      builder->Abandon();
    } else {
-      std::string seqno_time_mapping_str;
+      std::string seqno_to_time_mapping_str;
      seqno_to_time_mapping.Encode(
-          seqno_time_mapping_str, meta->fd.smallest_seqno,
+          seqno_to_time_mapping_str, meta->fd.smallest_seqno,
          meta->fd.largest_seqno, meta->file_creation_time);
      builder->SetSeqnoTimeTableProperties(
-          seqno_time_mapping_str,
+          seqno_to_time_mapping_str,
          ioptions.compaction_style == CompactionStyle::kCompactionStyleFIFO
              ? meta->file_creation_time
              : meta->oldest_ancester_time);
--- a/db/column_family.cc
+++ b/db/column_family.cc
@ -1119,7 +1119,7 @@ Compaction* ColumnFamilyData::PickCompaction(
      GetName(), mutable_options, mutable_db_options, current_->storage_info(),
      log_buffer);
  if (result != nullptr) {
-    result->SetInputVersion(current_);
+    result->FinalizeInputInfo(current_);
  }
  return result;
 }
@ -1203,7 +1203,7 @@ Compaction* ColumnFamilyData::CompactRange(
      compact_range_options, begin, end, compaction_end, conflict,
      max_file_num_to_ignore, trim_ts);
  if (result != nullptr) {
-    result->SetInputVersion(current_);
+    result->FinalizeInputInfo(current_);
  }
  TEST_SYNC_POINT("ColumnFamilyData::CompactRange:Return");
  return result;
--- a/db/column_family_test.cc
+++ b/db/column_family_test.cc
@ -8,6 +8,7 @@
 // found in the LICENSE file. See the AUTHORS file for names of contributors.

 #include <algorithm>
+#include <atomic>
 #include <string>
 #include <thread>
 #include <vector>
@ -27,6 +28,7 @@
 #include "test_util/testharness.h"
 #include "test_util/testutil.h"
 #include "util/coding.h"
+#include "util/defer.h"
 #include "util/string_util.h"
 #include "utilities/fault_injection_env.h"
 #include "utilities/merge_operators.h"
@ -2169,13 +2171,57 @@ TEST_P(ColumnFamilyTest, FlushStaleColumnFamilies) {
  Close();
 }

+namespace {
+struct CountOptionsFilesFs : public FileSystemWrapper {
+  explicit CountOptionsFilesFs(const std::shared_ptr<FileSystem>& t)
+      : FileSystemWrapper(t) {}
+  const char* Name() const override { return "CountOptionsFilesFs"; }
+
+  IOStatus NewWritableFile(const std::string& f, const FileOptions& file_opts,
+                           std::unique_ptr<FSWritableFile>* r,
+                           IODebugContext* dbg) override {
+    if (f.find("OPTIONS-") != std::string::npos) {
+      options_files_created.fetch_add(1, std::memory_order_relaxed);
+    }
+    return FileSystemWrapper::NewWritableFile(f, file_opts, r, dbg);
+  }
+
+  std::atomic<int> options_files_created{};
+};
+}  // namespace
+
 TEST_P(ColumnFamilyTest, CreateMissingColumnFamilies) {
-  Status s = TryOpen({"one", "two"});
-  ASSERT_TRUE(!s.ok());
-  db_options_.create_missing_column_families = true;
-  s = TryOpen({"default", "one", "two"});
-  ASSERT_TRUE(s.ok());
+  // Can't accidentally add CFs to an existing DB
+  Open();
  Close();
+  ASSERT_FALSE(db_options_.create_missing_column_families);
+  ASSERT_NOK(TryOpen({"one", "two"}));
+
+  // Nor accidentally create in a new DB
+  Destroy();
+  db_options_.create_if_missing = true;
+  ASSERT_NOK(TryOpen({"one", "two"}));
+
+  // Only with the option (new DB case)
+  db_options_.create_missing_column_families = true;
+  // Also setup to count number of options files created (see check below)
+  auto my_fs =
+      std::make_shared<CountOptionsFilesFs>(db_options_.env->GetFileSystem());
+  auto my_env = std::make_unique<CompositeEnvWrapper>(db_options_.env, my_fs);
+  SaveAndRestore<Env*> save_restore_env(&db_options_.env, my_env.get());
+
+  ASSERT_OK(TryOpen({"default", "one", "two"}));
+  Close();
+
+  // An older version would write an updated options file for each column
+  // family created under create_missing_column_families, which would be
+  // quadratic I/O in the number of column families.
+  ASSERT_EQ(my_fs->options_files_created.load(), 1);
+
+  // Add to existing DB case
+  ASSERT_OK(TryOpen({"default", "one", "two", "three", "four"}));
+  Close();
+  ASSERT_EQ(my_fs->options_files_created.load(), 2);
 }

 TEST_P(ColumnFamilyTest, SanitizeOptions) {
--- a/db/compaction/compaction.cc
+++ b/db/compaction/compaction.cc
@ -204,32 +204,36 @@ bool Compaction::IsFullCompaction(
  return num_files_in_compaction == total_num_files;
 }

-const TablePropertiesCollection& Compaction::GetTableProperties() {
-  if (!input_table_properties_initialized_) {
-    const ReadOptions read_options(Env::IOActivity::kCompaction);
-    for (size_t i = 0; i < num_input_levels(); ++i) {
-      for (const FileMetaData* fmd : *(this->inputs(i))) {
-        std::shared_ptr<const TableProperties> tp;
-        std::string file_name =
-            TableFileName(immutable_options_.cf_paths, fmd->fd.GetNumber(),
-                          fmd->fd.GetPathId());
-        Status s = input_version_->GetTableProperties(read_options, &tp, fmd,
-                                                      &file_name);
-        if (s.ok()) {
-          table_properties_[file_name] = tp;
-        } else {
-          ROCKS_LOG_ERROR(immutable_options_.info_log,
-                          "Unable to load table properties for file %" PRIu64
-                          " --- %s\n",
-                          fmd->fd.GetNumber(), s.ToString().c_str());
-        }
+Status Compaction::InitInputTableProperties() {
+  if (!input_table_properties_.empty()) {
+    return Status::OK();
+  }
+
+  Status s;
+  const ReadOptions read_options(Env::IOActivity::kCompaction);
+  assert(input_version_);
+  for (size_t i = 0; i < num_input_levels(); ++i) {
+    for (const FileMetaData* fmd : *(this->inputs(i))) {
+      std::shared_ptr<const TableProperties> tp;
+      std::string file_name =
+          TableFileName(immutable_options_.cf_paths, fmd->fd.GetNumber(),
+                        fmd->fd.GetPathId());
+      s = input_version_->GetTableProperties(read_options, &tp, fmd,
+                                             &file_name);
+      if (s.ok()) {
+        input_table_properties_[file_name] = tp;
+      } else {
+        ROCKS_LOG_ERROR(immutable_options_.info_log,
+                        "Unable to load table properties for file %" PRIu64
+                        " --- %s\n",
+                        fmd->fd.GetNumber(), s.ToString().c_str());
+        input_table_properties_.clear();
+        return s;
      }
    }
+  }

-    input_table_properties_initialized_ = true;
-  };
-
-  return table_properties_;
+  return s;
 }

 Compaction::Compaction(
@ -774,8 +778,17 @@ std::unique_ptr<CompactionFilter> Compaction::CreateCompactionFilter() const {
  CompactionFilter::Context context;
  context.is_full_compaction = is_full_compaction_;
  context.is_manual_compaction = is_manual_compaction_;
+  context.input_start_level = start_level_;
  context.column_family_id = cfd_->GetID();
  context.reason = TableFileCreationReason::kCompaction;
+  context.input_table_properties = GetInputTableProperties();
+  if (context.input_table_properties.empty()) {
+    ROCKS_LOG_WARN(
+        immutable_options_.info_log,
+        "Unable to set `input_table_properties` of `CompactionFilter::Context` "
+        "for compaction.");
+  }
+
  return cfd_->ioptions()->compaction_filter_factory->CreateCompactionFilter(
      context);
 }
--- a/db/compaction/compaction.h
+++ b/db/compaction/compaction.h
@ -289,7 +289,14 @@ class Compaction {
  // is the sum of all input file sizes.
  uint64_t OutputFilePreallocationSize() const;

-  void SetInputVersion(Version* input_version);
+  // TODO(hx235): eventually we should consider `InitInputTableProperties()`'s
+  // status and fail the compaction if needed
+  // TODO(hx235): consider making this function part of the construction so we
+  // don't forget to call it
+  void FinalizeInputInfo(Version* input_version) {
+    SetInputVersion(input_version);
+    InitInputTableProperties().PermitUncheckedError();
+  }

  struct InputLevelSummaryBuffer {
    char buffer[128];
@ -326,16 +333,20 @@ class Compaction {
      int output_level, VersionStorageInfo* vstorage,
      const std::vector<CompactionInputFiles>& inputs);

-  // If called before a compaction finishes, will return
-  // table properties of all compaction input files.
-  // If called after a compaction finished, will return
-  // table properties of all compaction input and output files.
-  const TablePropertiesCollection& GetTableProperties();
+  const TablePropertiesCollection& GetInputTableProperties() const {
+    return input_table_properties_;
+  }

+  // TODO(hx235): consider making this function symmetric to
+  // InitInputTableProperties()
  void SetOutputTableProperties(
      const std::string& file_name,
      const std::shared_ptr<const TableProperties>& tp) {
-    table_properties_[file_name] = tp;
+    output_table_properties_[file_name] = tp;
+  }
+
+  const TablePropertiesCollection& GetOutputTableProperties() const {
+    return output_table_properties_;
  }

  Slice GetSmallestUserKey() const { return smallest_user_key_; }
@ -432,6 +443,10 @@ class Compaction {
                                      const int output_level);

 private:
+  void SetInputVersion(Version* input_version);
+
+  Status InitInputTableProperties();
+
  // mark (or clear) all files that are being compacted
  void MarkFilesBeingCompacted(bool mark_as_compacted);

@ -522,9 +537,8 @@ class Compaction {
  // Does input compression match the output compression?
  bool InputCompressionMatchesOutput() const;

-  bool input_table_properties_initialized_ = false;
-  // table properties of output files
-  TablePropertiesCollection table_properties_;
+  TablePropertiesCollection input_table_properties_;
+  TablePropertiesCollection output_table_properties_;

  // smallest user keys in compaction
  // includes timestamp if user-defined timestamp is enabled.
--- a/db/compaction/compaction_job.cc
+++ b/db/compaction/compaction_job.cc
@ -288,23 +288,23 @@ void CompactionJob::Prepare() {

  if (preserve_time_duration > 0) {
    const ReadOptions read_options(Env::IOActivity::kCompaction);
-    // setup seqno_time_mapping_
-    seqno_time_mapping_.SetMaxTimeDuration(preserve_time_duration);
+    // setup seqno_to_time_mapping_
+    seqno_to_time_mapping_.SetMaxTimeDuration(preserve_time_duration);
    for (const auto& each_level : *c->inputs()) {
      for (const auto& fmd : each_level.files) {
        std::shared_ptr<const TableProperties> tp;
        Status s =
            cfd->current()->GetTableProperties(read_options, &tp, fmd, nullptr);
        if (s.ok()) {
-          seqno_time_mapping_.Add(tp->seqno_to_time_mapping)
+          seqno_to_time_mapping_.Add(tp->seqno_to_time_mapping)
              .PermitUncheckedError();
-          seqno_time_mapping_.Add(fmd->fd.smallest_seqno,
-                                  fmd->oldest_ancester_time);
+          seqno_to_time_mapping_.Add(fmd->fd.smallest_seqno,
+                                     fmd->oldest_ancester_time);
        }
      }
    }

-    auto status = seqno_time_mapping_.Sort();
+    auto status = seqno_to_time_mapping_.Sort();
    if (!status.ok()) {
      ROCKS_LOG_WARN(db_options_.info_log,
                     "Invalid sequence number to time mapping: Status: %s",
@ -320,13 +320,17 @@ void CompactionJob::Prepare() {
      preserve_time_min_seqno_ = 0;
      preclude_last_level_min_seqno_ = 0;
    } else {
-      seqno_time_mapping_.TruncateOldEntries(_current_time);
+      seqno_to_time_mapping_.TruncateOldEntries(_current_time);
      uint64_t preserve_time =
          static_cast<uint64_t>(_current_time) > preserve_time_duration
              ? _current_time - preserve_time_duration
              : 0;
+      // GetProximalSeqnoBeforeTime tells us the last seqno known to have been
+      // written at or before the given time. + 1 to get the minimum we should
+      // preserve without excluding anything that might have been written on or
+      // after the given time.
      preserve_time_min_seqno_ =
-          seqno_time_mapping_.GetOldestSequenceNum(preserve_time);
+          seqno_to_time_mapping_.GetProximalSeqnoBeforeTime(preserve_time) + 1;
      if (c->immutable_options()->preclude_last_level_data_seconds > 0) {
        uint64_t preclude_last_level_time =
            static_cast<uint64_t>(_current_time) >
@ -335,7 +339,9 @@ void CompactionJob::Prepare() {
                      c->immutable_options()->preclude_last_level_data_seconds
                : 0;
        preclude_last_level_min_seqno_ =
-            seqno_time_mapping_.GetOldestSequenceNum(preclude_last_level_time);
+            seqno_to_time_mapping_.GetProximalSeqnoBeforeTime(
+                preclude_last_level_time) +
+            1;
      }
    }
  }
@ -1570,7 +1576,7 @@ Status CompactionJob::FinishCompactionOutputFile(

  const uint64_t current_entries = outputs.NumEntries();

-  s = outputs.Finish(s, seqno_time_mapping_);
+  s = outputs.Finish(s, seqno_to_time_mapping_);

  if (s.ok()) {
    // With accurate smallest and largest key, we can get a slightly more
@ -1938,6 +1944,7 @@ Status CompactionJob::OpenCompactionOutputFile(SubcompactionState* sub_compact,
      db_options_.stats, listeners, db_options_.file_checksum_gen_factory.get(),
      tmp_set.Contains(FileType::kTableFile), false));

+  // TODO(hx235): pass in the correct `oldest_key_time` instead of `0`
  TableBuilderOptions tboptions(
      *cfd->ioptions(), *(sub_compact->compaction->mutable_cf_options()),
      cfd->internal_comparator(), cfd->int_tbl_prop_collector_factories(),
@ -1979,7 +1986,7 @@ bool CompactionJob::UpdateCompactionStats(uint64_t* num_input_range_del) {

  bool has_error = false;
  const ReadOptions read_options(Env::IOActivity::kCompaction);
-  const auto& input_table_properties = compaction->GetTableProperties();
+  const auto& input_table_properties = compaction->GetInputTableProperties();
  for (int input_level = 0;
       input_level < static_cast<int>(compaction->num_input_levels());
       ++input_level) {
--- a/db/compaction/compaction_job.h
+++ b/db/compaction/compaction_job.h
@ -350,7 +350,7 @@ class CompactionJob {

  // Stores the sequence number to time mapping gathered from all input files
  // it also collects the smallest_seqno -> oldest_ancester_time from the SST.
-  SeqnoToTimeMapping seqno_time_mapping_;
+  SeqnoToTimeMapping seqno_to_time_mapping_;

  // Minimal sequence number for preserving the time information. The time info
  // older than this sequence number won't be preserved after the compaction and
--- a/db/compaction/compaction_job_test.cc
+++ b/db/compaction/compaction_job_test.cc
@ -644,7 +644,7 @@ class CompactionJobTestBase : public testing::Test {
        mutable_cf_options_.max_compaction_bytes, 0, kNoCompression,
        cfd->GetLatestMutableCFOptions()->compression_opts,
        Temperature::kUnknown, max_subcompactions, grandparents, true);
-    compaction.SetInputVersion(cfd->current());
+    compaction.FinalizeInputInfo(cfd->current());

    assert(db_options_.info_log);
    LogBuffer log_buffer(InfoLogLevel::INFO_LEVEL, db_options_.info_log.get());
--- a/db/compaction/compaction_outputs.cc
+++ b/db/compaction/compaction_outputs.cc
@ -18,16 +18,18 @@ void CompactionOutputs::NewBuilder(const TableBuilderOptions& tboptions) {
  builder_.reset(NewTableBuilder(tboptions, file_writer_.get()));
 }

-Status CompactionOutputs::Finish(const Status& intput_status,
-                                 const SeqnoToTimeMapping& seqno_time_mapping) {
+Status CompactionOutputs::Finish(
+    const Status& intput_status,
+    const SeqnoToTimeMapping& seqno_to_time_mapping) {
  FileMetaData* meta = GetMetaData();
  assert(meta != nullptr);
  Status s = intput_status;
  if (s.ok()) {
-    std::string seqno_time_mapping_str;
-    seqno_time_mapping.Encode(seqno_time_mapping_str, meta->fd.smallest_seqno,
-                              meta->fd.largest_seqno, meta->file_creation_time);
-    builder_->SetSeqnoTimeTableProperties(seqno_time_mapping_str,
+    std::string seqno_to_time_mapping_str;
+    seqno_to_time_mapping.Encode(
+        seqno_to_time_mapping_str, meta->fd.smallest_seqno,
+        meta->fd.largest_seqno, meta->file_creation_time);
+    builder_->SetSeqnoTimeTableProperties(seqno_to_time_mapping_str,
                                          meta->oldest_ancester_time);
    s = builder_->Finish();

--- a/db/compaction/compaction_outputs.h
+++ b/db/compaction/compaction_outputs.h
@ -107,7 +107,7 @@ class CompactionOutputs {

  // Finish the current output file
  Status Finish(const Status& intput_status,
-                const SeqnoToTimeMapping& seqno_time_mapping);
+                const SeqnoToTimeMapping& seqno_to_time_mapping);

  // Update output table properties from table builder
  void UpdateTableProperties() {
--- a/db/compaction/tiered_compaction_test.cc
+++ b/db/compaction/tiered_compaction_test.cc
@ -1216,6 +1216,7 @@ class PrecludeLastLevelTest : public DBTestBase {
  PrecludeLastLevelTest()
      : DBTestBase("preclude_last_level_test", /*env_do_fsync=*/false) {
    mock_clock_ = std::make_shared<MockSystemClock>(env_->GetSystemClock());
+    mock_clock_->SetCurrentTime(kMockStartTime);
    mock_env_ = std::make_unique<CompositeEnvWrapper>(env_, mock_clock_);
  }

@ -1223,6 +1224,10 @@ class PrecludeLastLevelTest : public DBTestBase {
  std::unique_ptr<Env> mock_env_;
  std::shared_ptr<MockSystemClock> mock_clock_;

+  // Sufficient starting time that preserve time doesn't under-flow into
+  // pre-history
+  static constexpr uint32_t kMockStartTime = 10000000;
+
  void SetUp() override {
    mock_clock_->InstallTimedWaitFixCallback();
    SyncPoint::GetInstance()->SetCallBack(
@ -1231,7 +1236,7 @@ class PrecludeLastLevelTest : public DBTestBase {
              reinterpret_cast<PeriodicTaskScheduler*>(arg);
          periodic_task_scheduler_ptr->TEST_OverrideTimer(mock_clock_.get());
        });
-    mock_clock_->SetCurrentTime(0);
+    mock_clock_->SetCurrentTime(kMockStartTime);
  }
 };

@ -1249,11 +1254,6 @@ TEST_F(PrecludeLastLevelTest, MigrationFromPreserveTimeManualCompaction) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(kKeyPerSec)); });
-
  int sst_num = 0;
  // Write files that are overlap and enough to trigger compaction
  for (; sst_num < kNumTrigger; sst_num++) {
@ -1311,11 +1311,6 @@ TEST_F(PrecludeLastLevelTest, MigrationFromPreserveTimeAutoCompaction) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(kKeyPerSec)); });
-
  int sst_num = 0;
  // Write files that are overlap and enough to trigger compaction
  for (; sst_num < kNumTrigger; sst_num++) {
@ -1387,11 +1382,6 @@ TEST_F(PrecludeLastLevelTest, MigrationFromPreserveTimePartial) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(kKeyPerSec)); });
-
  int sst_num = 0;
  // Write files that are overlap and enough to trigger compaction
  for (; sst_num < kNumTrigger; sst_num++) {
@ -1514,11 +1504,6 @@ TEST_F(PrecludeLastLevelTest, LastLevelOnlyCompactionPartial) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(kKeyPerSec)); });
-
  int sst_num = 0;
  // Write files that are overlap and enough to trigger compaction
  for (; sst_num < kNumTrigger; sst_num++) {
@ -1592,11 +1577,6 @@ TEST_P(PrecludeLastLevelTestWithParms, LastLevelOnlyCompactionNoPreclude) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(kKeyPerSec)); });
-
  Random rnd(301);
  int sst_num = 0;
  // Write files that are overlap and enough to trigger compaction
@ -1906,11 +1886,6 @@ TEST_F(PrecludeLastLevelTest, PartialPenultimateLevelCompaction) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(10)); });
-
  Random rnd(301);

  for (int i = 0; i < 300; i++) {
@ -1996,7 +1971,13 @@ TEST_F(PrecludeLastLevelTest, PartialPenultimateLevelCompaction) {
  Close();
 }

-TEST_F(PrecludeLastLevelTest, RangeDelsCauseFileEndpointsToOverlap) {
+// FIXME broken test:
+// dbfull()->TEST_WaitForCompact()
+// Corruption: force_consistency_checks(DEBUG): VersionBuilder: L5 has
+// overlapping ranges:
+// file #14 largest key: '6B6579303030303134' seq:32, type:1 vs.
+// file #19 smallest key: '6B6579303030303130' seq:10, type:1
+TEST_F(PrecludeLastLevelTest, DISABLED_RangeDelsCauseFileEndpointsToOverlap) {
  const int kNumLevels = 7;
  const int kSecondsPerKey = 10;
  const int kNumFiles = 3;
@ -2017,12 +1998,6 @@ TEST_F(PrecludeLastLevelTest, RangeDelsCauseFileEndpointsToOverlap) {
  options.target_file_size_base = kFileBytes;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun([&] {
-    mock_clock_->MockSleepForSeconds(static_cast<int>(kSecondsPerKey));
-  });
-
  // Flush an L0 file with the following contents (new to old):
  //
  // Range deletions [4, 6) [7, 8) [9, 11)
@ -2139,7 +2114,6 @@ TEST_F(PrecludeLastLevelTest, RangeDelsCauseFileEndpointsToOverlap) {
  Close();
 }

-
 }  // namespace ROCKSDB_NAMESPACE

 int main(int argc, char** argv) {
--- a/db/db_basic_test.cc
+++ b/db/db_basic_test.cc
@ -3654,10 +3654,12 @@ class DBBasicTestMultiGet : public DBTestBase {

    Status Insert(const Slice& key, Cache::ObjectPtr value,
                  const CacheItemHelper* helper, size_t charge,
-                  Handle** handle = nullptr,
-                  Priority priority = Priority::LOW) override {
+                  Handle** handle = nullptr, Priority priority = Priority::LOW,
+                  const Slice& compressed = Slice(),
+                  CompressionType type = kNoCompression) override {
      num_inserts_++;
-      return target_->Insert(key, value, helper, charge, handle, priority);
+      return target_->Insert(key, value, helper, charge, handle, priority,
+                             compressed, type);
    }

    Handle* Lookup(const Slice& key, const CacheItemHelper* helper,
--- a/db/db_block_cache_test.cc
+++ b/db/db_block_cache_test.cc
@ -294,7 +294,9 @@ class ReadOnlyCacheWrapper : public CacheWrapper {

  Status Insert(const Slice& /*key*/, Cache::ObjectPtr /*value*/,
                const CacheItemHelper* /*helper*/, size_t /*charge*/,
-                Handle** /*handle*/, Priority /*priority*/) override {
+                Handle** /*handle*/, Priority /*priority*/,
+                const Slice& /*compressed*/,
+                CompressionType /*type*/) override {
    return Status::NotSupported();
  }
 };
@ -628,13 +630,15 @@ class MockCache : public LRUCache {

  Status Insert(const Slice& key, Cache::ObjectPtr value,
                const Cache::CacheItemHelper* helper, size_t charge,
-                Handle** handle, Priority priority) override {
+                Handle** handle, Priority priority, const Slice& compressed,
+                CompressionType type) override {
    if (priority == Priority::LOW) {
      low_pri_insert_count++;
    } else {
      high_pri_insert_count++;
    }
-    return LRUCache::Insert(key, value, helper, charge, handle, priority);
+    return LRUCache::Insert(key, value, helper, charge, handle, priority,
+                            compressed, type);
  }
 };

--- a/db/db_compaction_filter_test.cc
+++ b/db/db_compaction_filter_test.cc
@ -166,9 +166,12 @@ class ChangeFilter : public CompactionFilter {
 class KeepFilterFactory : public CompactionFilterFactory {
 public:
  explicit KeepFilterFactory(bool check_context = false,
-                             bool check_context_cf_id = false)
+                             bool check_context_cf_id = false,
+                             bool check_context_input_table_properties = false)
      : check_context_(check_context),
        check_context_cf_id_(check_context_cf_id),
+        check_context_input_table_properties_(
+            check_context_input_table_properties),
        compaction_filter_created_(false) {}

  std::unique_ptr<CompactionFilter> CreateCompactionFilter(
@ -176,6 +179,11 @@ class KeepFilterFactory : public CompactionFilterFactory {
    if (check_context_) {
      EXPECT_EQ(expect_full_compaction_.load(), context.is_full_compaction);
      EXPECT_EQ(expect_manual_compaction_.load(), context.is_manual_compaction);
+      EXPECT_EQ(expect_input_start_level_.load(), context.input_start_level);
+    }
+    if (check_context_input_table_properties_) {
+      EXPECT_TRUE(expect_input_table_properties_ ==
+                  context.input_table_properties);
    }
    if (check_context_cf_id_) {
      EXPECT_EQ(expect_cf_id_.load(), context.column_family_id);
@ -189,9 +197,15 @@ class KeepFilterFactory : public CompactionFilterFactory {
  const char* Name() const override { return "KeepFilterFactory"; }
  bool check_context_;
  bool check_context_cf_id_;
+  // `check_context_input_table_properties_` can be true only when access to
+  // `expect_input_table_properties_` is syncronized since we can't have
+  // std::atomic<TablePropertiesCollection> unfortunately
+  bool check_context_input_table_properties_;
  std::atomic_bool expect_full_compaction_;
  std::atomic_bool expect_manual_compaction_;
  std::atomic<uint32_t> expect_cf_id_;
+  std::atomic<int> expect_input_start_level_;
+  TablePropertiesCollection expect_input_table_properties_;
  bool compaction_filter_created_;
 };

@ -654,7 +668,9 @@ TEST_F(DBTestCompactionFilter, CompactionFilterWithMergeOperator) {
 }

 TEST_F(DBTestCompactionFilter, CompactionFilterContextManual) {
-  KeepFilterFactory* filter = new KeepFilterFactory(true, true);
+  KeepFilterFactory* filter = new KeepFilterFactory(
+      true /* check_context */, true /* check_context_cf_id */,
+      true /* check_context_input_table_properties */);

  Options options = CurrentOptions();
  options.compaction_style = kCompactionStyleUniversal;
@ -662,8 +678,9 @@ TEST_F(DBTestCompactionFilter, CompactionFilterContextManual) {
  options.compression = kNoCompression;
  options.level0_file_num_compaction_trigger = 8;
  Reopen(options);
+  const int kNumFiles = 3;
  int num_keys_per_file = 400;
-  for (int j = 0; j < 3; j++) {
+  for (int j = 0; j < kNumFiles; j++) {
    // Write several keys.
    const std::string value(10, 'x');
    for (int i = 0; i < num_keys_per_file; i++) {
@ -683,6 +700,11 @@ TEST_F(DBTestCompactionFilter, CompactionFilterContextManual) {
  filter->expect_manual_compaction_.store(true);
  filter->expect_full_compaction_.store(true);
  filter->expect_cf_id_.store(0);
+  filter->expect_input_start_level_.store(0);
+  ASSERT_OK(dbfull()->GetPropertiesOfAllTables(
+      &filter->expect_input_table_properties_));
+  ASSERT_TRUE(filter->expect_input_table_properties_.size() == kNumFiles);
+
  ASSERT_OK(dbfull()->CompactRange(CompactRangeOptions(), nullptr, nullptr));
  ASSERT_EQ(cfilter_count, 700);
  ASSERT_EQ(NumSortedRuns(0), 1);
--- a/db/db_flush_test.cc
+++ b/db/db_flush_test.cc
@ -3193,6 +3193,279 @@ INSTANTIATE_TEST_CASE_P(DBFlushDirectIOTest, DBFlushDirectIOTest,

 INSTANTIATE_TEST_CASE_P(DBAtomicFlushTest, DBAtomicFlushTest, testing::Bool());

+TEST_F(DBFlushTest, NonAtomicFlushRollbackPendingFlushes) {
+  // Fix a bug in when atomic_flush=false.
+  // The bug can happen as follows:
+  // Start Flush0 for memtable M0 to SST0
+  // Start Flush1 for memtable M1 to SST1
+  // Flush1 returns OK, but don't install to MANIFEST and let whoever flushes
+  // M0 to take care of it
+  // Flush0 finishes with a retryable IOError
+  //   - It rollbacks M0, (incorrectly) not M1
+  //   - Deletes SST1 and SST2
+  //
+  // Auto-recovery will start Flush2 for M0, it does not pick up M1 since it
+  // thinks that M1 is flushed
+  // Flush2 writes SST3 and finishes OK, tries to install SST3 and SST2
+  // Error opening SST2 since it's already deleted
+  //
+  // The fix is to let Flush0 also rollback M1.
+  Options opts = CurrentOptions();
+  opts.atomic_flush = false;
+  opts.memtable_factory.reset(test::NewSpecialSkipListFactory(1));
+  opts.max_write_buffer_number = 64;
+  opts.max_background_flushes = 4;
+  env_->SetBackgroundThreads(4, Env::HIGH);
+  DestroyAndReopen(opts);
+  std::atomic_int flush_count = 0;
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+  SyncPoint::GetInstance()->SetCallBack(
+      "FlushJob::WriteLevel0Table:s", [&](void* s_ptr) {
+        int c = flush_count.fetch_add(1);
+        if (c == 0) {
+          Status* s = (Status*)(s_ptr);
+          IOStatus io_error = IOStatus::IOError("injected foobar");
+          io_error.SetRetryable(true);
+          *s = io_error;
+          TEST_SYNC_POINT("Let mem1 flush start");
+          TEST_SYNC_POINT("Wait for mem1 flush to finish");
+        }
+      });
+  SyncPoint::GetInstance()->LoadDependency(
+      {{"Let mem1 flush start", "Mem1 flush starts"},
+       {"DBImpl::BGWorkFlush:done", "Wait for mem1 flush to finish"},
+       {"RecoverFromRetryableBGIOError:RecoverSuccess",
+        "Wait for error recover"}});
+  // Need first flush to wait for the second flush to finish
+  SyncPoint::GetInstance()->EnableProcessing();
+  ASSERT_OK(Put(Key(1), "val1"));
+  // trigger bg flush mem0
+  ASSERT_OK(Put(Key(2), "val2"));
+  TEST_SYNC_POINT("Mem1 flush starts");
+  // trigger bg flush mem1
+  ASSERT_OK(Put(Key(3), "val3"));
+
+  TEST_SYNC_POINT("Wait for error recover");
+  ASSERT_EQ(1, NumTableFilesAtLevel(0));
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+}
+
+TEST_F(DBFlushTest, AbortNonAtomicFlushWhenBGError) {
+  // Fix a bug in when atomic_flush=false.
+  // The bug can happen as follows:
+  // Start Flush0 for memtable M0 to SST0
+  // Start Flush1 for memtable M1 to SST1
+  // Flush1 returns OK, but doesn't install output MANIFEST and let whoever
+  // flushes M0 to take care of it
+  // Start Flush2 for memtable M2 to SST2
+  // Flush0 finishes with a retryable IOError
+  //   - It rollbacks M0 AND M1
+  //   - Deletes SST1 and SST2
+  // Flush2 finishes, does not rollback M2,
+  //  - releases the pending file number that keeps SST2 alive
+  //  - deletes SST2
+  //
+  // Then auto-recovery starts, error opening SST2 when try to install
+  // flush result
+  //
+  // The fix is to let Flush2 rollback M2 if it finds that
+  // there is a background error.
+  Options opts = CurrentOptions();
+  opts.atomic_flush = false;
+  opts.memtable_factory.reset(test::NewSpecialSkipListFactory(1));
+  opts.max_write_buffer_number = 64;
+  opts.max_background_flushes = 4;
+  env_->SetBackgroundThreads(4, Env::HIGH);
+  DestroyAndReopen(opts);
+  std::atomic_int flush_count = 0;
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+  SyncPoint::GetInstance()->SetCallBack(
+      "FlushJob::WriteLevel0Table:s", [&](void* s_ptr) {
+        int c = flush_count.fetch_add(1);
+        if (c == 0) {
+          Status* s = (Status*)(s_ptr);
+          IOStatus io_error = IOStatus::IOError("injected foobar");
+          io_error.SetRetryable(true);
+          *s = io_error;
+          TEST_SYNC_POINT("Let mem1 flush start");
+          TEST_SYNC_POINT("Wait for mem1 flush to finish");
+
+          TEST_SYNC_POINT("Let mem2 flush start");
+          TEST_SYNC_POINT("Wait for mem2 to start writing table");
+        }
+      });
+
+  SyncPoint::GetInstance()->SetCallBack(
+      "FlushJob::WriteLevel0Table", [&](void* mems) {
+        autovector<MemTable*>* mems_ptr = (autovector<MemTable*>*)mems;
+        if ((*mems_ptr)[0]->GetID() == 3) {
+          TEST_SYNC_POINT("Mem2 flush starts writing table");
+          TEST_SYNC_POINT("Mem2 flush waits until rollback");
+        }
+      });
+  SyncPoint::GetInstance()->LoadDependency(
+      {{"Let mem1 flush start", "Mem1 flush starts"},
+       {"DBImpl::BGWorkFlush:done", "Wait for mem1 flush to finish"},
+       {"Let mem2 flush start", "Mem2 flush starts"},
+       {"Mem2 flush starts writing table",
+        "Wait for mem2 to start writing table"},
+       {"RollbackMemtableFlush", "Mem2 flush waits until rollback"},
+       {"RecoverFromRetryableBGIOError:RecoverSuccess",
+        "Wait for error recover"}});
+  SyncPoint::GetInstance()->EnableProcessing();
+
+  ASSERT_OK(Put(Key(1), "val1"));
+  // trigger bg flush mem0
+  ASSERT_OK(Put(Key(2), "val2"));
+  TEST_SYNC_POINT("Mem1 flush starts");
+  // trigger bg flush mem1
+  ASSERT_OK(Put(Key(3), "val3"));
+
+  TEST_SYNC_POINT("Mem2 flush starts");
+  ASSERT_OK(Put(Key(4), "val4"));
+
+  TEST_SYNC_POINT("Wait for error recover");
+  // Recovery flush writes 3 memtables together into 1 file.
+  ASSERT_EQ(1, NumTableFilesAtLevel(0));
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+}
+
+TEST_F(DBFlushTest, NonAtomicNormalFlushAbortWhenBGError) {
+  Options opts = CurrentOptions();
+  opts.atomic_flush = false;
+  opts.memtable_factory.reset(test::NewSpecialSkipListFactory(1));
+  opts.max_write_buffer_number = 64;
+  opts.max_background_flushes = 1;
+  env_->SetBackgroundThreads(2, Env::HIGH);
+  DestroyAndReopen(opts);
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+  std::atomic_int flush_write_table_count = 0;
+  SyncPoint::GetInstance()->SetCallBack(
+      "FlushJob::WriteLevel0Table:s", [&](void* s_ptr) {
+        int c = flush_write_table_count.fetch_add(1);
+        if (c == 0) {
+          Status* s = (Status*)(s_ptr);
+          IOStatus io_error = IOStatus::IOError("injected foobar");
+          io_error.SetRetryable(true);
+          *s = io_error;
+        }
+      });
+
+  SyncPoint::GetInstance()->EnableProcessing();
+  SyncPoint::GetInstance()->LoadDependency(
+      {{"Let error recovery start",
+        "RecoverFromRetryableBGIOError:BeforeStart"},
+       {"RecoverFromRetryableBGIOError:RecoverSuccess",
+        "Wait for error recover"}});
+
+  ASSERT_OK(Put(Key(1), "val1"));
+  // trigger bg flush0 for mem0
+  ASSERT_OK(Put(Key(2), "val2"));
+  // Not checking status since this wait can finish before flush starts.
+  dbfull()->TEST_WaitForFlushMemTable().PermitUncheckedError();
+
+  // trigger bg flush1 for mem1, should see bg error and abort
+  // before picking a memtable to flush
+  ASSERT_OK(Put(Key(3), "val3"));
+  ASSERT_NOK(dbfull()->TEST_WaitForFlushMemTable());
+  ASSERT_EQ(0, NumTableFilesAtLevel(0));
+
+  TEST_SYNC_POINT("Let error recovery start");
+  TEST_SYNC_POINT("Wait for error recover");
+  // Recovery flush writes 2 memtables together into 1 file.
+  ASSERT_EQ(1, NumTableFilesAtLevel(0));
+  // 1 for flush 0 and 1 for recovery flush
+  ASSERT_EQ(2, flush_write_table_count);
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+}
+
+TEST_F(DBFlushTest, DBStuckAfterAtomicFlushError) {
+  // Test for a bug with atomic flush where DB can become stuck
+  // after a flush error. A repro timeline:
+  //
+  // Start Flush0 for mem0
+  // Start Flush1 for mem1
+  // Now Flush1 will wait for Flush0 to install mem0
+  // Flush0 finishes with retryable IOError, rollbacks mem0
+  // Resume starts and waits for background job to finish, i.e., Flush1
+  // Fill memtable again, trigger Flush2 for mem0
+  // Flush2 will get error status, and not rollback mem0, see code in
+  // https://github.com/facebook/rocksdb/blob/b927ba5936216861c2c35ab68f50ba4a78e65747/db/db_impl/db_impl_compaction_flush.cc#L725
+  //
+  // DB is stuck since mem0 can never be picked now
+  //
+  // The fix is to rollback mem0 in Flush2, and let Flush1 also abort upon
+  // background error besides waiting for older memtables to be installed.
+  // The recovery flush in this case should pick up all memtables
+  // and write them to a single L0 file.
+  Options opts = CurrentOptions();
+  opts.atomic_flush = true;
+  opts.memtable_factory.reset(test::NewSpecialSkipListFactory(1));
+  opts.max_write_buffer_number = 64;
+  opts.max_background_flushes = 4;
+  env_->SetBackgroundThreads(4, Env::HIGH);
+  DestroyAndReopen(opts);
+
+  std::atomic_int flush_count = 0;
+  SyncPoint::GetInstance()->ClearAllCallBacks();
+  SyncPoint::GetInstance()->DisableProcessing();
+  SyncPoint::GetInstance()->SetCallBack(
+      "FlushJob::WriteLevel0Table:s", [&](void* s_ptr) {
+        int c = flush_count.fetch_add(1);
+        if (c == 0) {
+          Status* s = (Status*)(s_ptr);
+          IOStatus io_error = IOStatus::IOError("injected foobar");
+          io_error.SetRetryable(true);
+          *s = io_error;
+          TEST_SYNC_POINT("Let flush for mem1 start");
+          // Wait for Flush1 to start waiting to install flush result
+          TEST_SYNC_POINT("Wait for flush for mem1");
+        }
+      });
+  SyncPoint::GetInstance()->LoadDependency(
+      {{"Let flush for mem1 start", "Flush for mem1"},
+       {"DBImpl::AtomicFlushMemTablesToOutputFiles:WaitCV",
+        "Wait for flush for mem1"},
+       {"RecoverFromRetryableBGIOError:BeforeStart",
+        "Wait for resume to start"},
+       {"Recovery should continue here",
+        "RecoverFromRetryableBGIOError:BeforeStart2"},
+       {"RecoverFromRetryableBGIOError:RecoverSuccess",
+        "Wait for error recover"}});
+  SyncPoint::GetInstance()->EnableProcessing();
+  ASSERT_OK(Put(Key(1), "val1"));
+  // trigger Flush0 for mem0
+  ASSERT_OK(Put(Key(2), "val2"));
+
+  // trigger Flush1 for mem1
+  TEST_SYNC_POINT("Flush for mem1");
+  ASSERT_OK(Put(Key(3), "val3"));
+
+  // Wait until resume started to schedule another flush
+  TEST_SYNC_POINT("Wait for resume to start");
+  // This flush should not be scheduled due to bg error
+  ASSERT_OK(Put(Key(4), "val4"));
+
+  // TEST_WaitForBackgroundWork() returns background error
+  // after all background work is done.
+  ASSERT_NOK(dbfull()->TEST_WaitForBackgroundWork());
+  // Flush should abort and not writing any table
+  ASSERT_EQ(0, NumTableFilesAtLevel(0));
+
+  // Wait until this flush is done.
+  TEST_SYNC_POINT("Recovery should continue here");
+  TEST_SYNC_POINT("Wait for error recover");
+  // error recovery can schedule new flushes, but should not
+  // encounter error
+  ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork());
+  ASSERT_EQ(1, NumTableFilesAtLevel(0));
+}
 }  // namespace ROCKSDB_NAMESPACE

 int main(int argc, char** argv) {
--- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@ -273,8 +273,9 @@ DBImpl::DBImpl(const DBOptions& options, const std::string& dbname,
  periodic_task_functions_.emplace(PeriodicTaskType::kFlushInfoLog,
                                   [this]() { this->FlushInfoLog(); });
  periodic_task_functions_.emplace(
-      PeriodicTaskType::kRecordSeqnoTime,
-      [this]() { this->RecordSeqnoToTimeMapping(); });
+      PeriodicTaskType::kRecordSeqnoTime, [this]() {
+        this->RecordSeqnoToTimeMapping(/*populate_historical_seconds=*/0);
+      });

  versions_.reset(new VersionSet(dbname_, &immutable_db_options_, file_options_,
                                 table_cache_.get(), write_buffer_manager_,
@ -387,13 +388,18 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
    }
  }

-  // We cannot guarantee consistency of the WAL. So force flush Memtables of
-  // all the column families
  if (s.ok()) {
-    FlushOptions flush_opts;
-    // We allow flush to stall write since we are trying to resume from error.
-    flush_opts.allow_write_stall = true;
-    s = FlushAllColumnFamilies(flush_opts, context.flush_reason);
+    if (context.flush_reason == FlushReason::kErrorRecoveryRetryFlush) {
+      s = RetryFlushesForErrorRecovery(FlushReason::kErrorRecoveryRetryFlush,
+                                       true /* wait */);
+    } else {
+      // We cannot guarantee consistency of the WAL. So force flush Memtables of
+      // all the column families
+      FlushOptions flush_opts;
+      // We allow flush to stall write since we are trying to resume from error.
+      flush_opts.allow_write_stall = true;
+      s = FlushAllColumnFamilies(flush_opts, context.flush_reason);
+    }
    if (!s.ok()) {
      ROCKS_LOG_INFO(immutable_db_options_.info_log,
                     "DB resume requested but failed due to Flush failure [%s]",
@ -453,6 +459,25 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
  if (shutdown_initiated_) {
    s = Status::ShutdownInProgress();
  }
+  if (s.ok() && context.flush_after_recovery) {
+    // Since we drop all non-recovery flush requests during recovery,
+    // and new memtable may fill up during recovery,
+    // schedule one more round of flush.
+    Status status = RetryFlushesForErrorRecovery(
+        FlushReason::kCatchUpAfterErrorRecovery, false /* wait */);
+    if (!status.ok()) {
+      // FlushAllColumnFamilies internally should take care of setting
+      // background error if needed.
+      ROCKS_LOG_INFO(immutable_db_options_.info_log,
+                     "The catch up flush after successful recovery failed [%s]",
+                     s.ToString().c_str());
+    }
+    // FlushAllColumnFamilies releases and re-acquires mutex.
+    if (shutdown_initiated_) {
+      s = Status::ShutdownInProgress();
+    }
+  }
+
  if (s.ok()) {
    for (auto cfd : *versions_->GetColumnFamilySet()) {
      SchedulePendingCompaction(cfd);
@ -748,7 +773,6 @@ void DBImpl::PrintStatistics() {
 }

 Status DBImpl::StartPeriodicTaskScheduler() {
-
 #ifndef NDEBUG
  // It only used by test to disable scheduler
  bool disable_scheduler = false;
@ -792,42 +816,82 @@ Status DBImpl::StartPeriodicTaskScheduler() {
  return s;
 }

-Status DBImpl::RegisterRecordSeqnoTimeWorker() {
-  uint64_t min_time_duration = std::numeric_limits<uint64_t>::max();
-  uint64_t max_time_duration = std::numeric_limits<uint64_t>::min();
+Status DBImpl::RegisterRecordSeqnoTimeWorker(bool from_db_open) {
+  uint64_t min_preserve_seconds = std::numeric_limits<uint64_t>::max();
+  uint64_t max_preserve_seconds = std::numeric_limits<uint64_t>::min();
+  bool mapping_was_empty = false;
  {
    InstrumentedMutexLock l(&mutex_);

    for (auto cfd : *versions_->GetColumnFamilySet()) {
      // preserve time is the max of 2 options.
-      uint64_t preserve_time_duration =
+      uint64_t preserve_seconds =
          std::max(cfd->ioptions()->preserve_internal_time_seconds,
                   cfd->ioptions()->preclude_last_level_data_seconds);
-      if (!cfd->IsDropped() && preserve_time_duration > 0) {
-        min_time_duration = std::min(preserve_time_duration, min_time_duration);
-        max_time_duration = std::max(preserve_time_duration, max_time_duration);
+      if (!cfd->IsDropped() && preserve_seconds > 0) {
+        min_preserve_seconds = std::min(preserve_seconds, min_preserve_seconds);
+        max_preserve_seconds = std::max(preserve_seconds, max_preserve_seconds);
      }
    }
-    if (min_time_duration == std::numeric_limits<uint64_t>::max()) {
-      seqno_time_mapping_.Resize(0, 0);
+    if (min_preserve_seconds == std::numeric_limits<uint64_t>::max()) {
+      seqno_to_time_mapping_.Resize(0, 0);
    } else {
-      seqno_time_mapping_.Resize(min_time_duration, max_time_duration);
+      seqno_to_time_mapping_.Resize(min_preserve_seconds, max_preserve_seconds);
    }
+    mapping_was_empty = seqno_to_time_mapping_.Empty();
  }
+  // FIXME: because we released the db mutex, there's a race here where
+  // if e.g. I create or drop two column families in parallel, I might end up
+  // with the periodic task scheduler in the wrong state. We don't want to
+  // just keep holding the mutex, however, because of global timer and mutex
+  // in PeriodicTaskScheduler.

  uint64_t seqno_time_cadence = 0;
-  if (min_time_duration != std::numeric_limits<uint64_t>::max()) {
+  if (min_preserve_seconds != std::numeric_limits<uint64_t>::max()) {
    // round up to 1 when the time_duration is smaller than
    // kMaxSeqnoTimePairsPerCF
-    seqno_time_cadence =
-        (min_time_duration + SeqnoToTimeMapping::kMaxSeqnoTimePairsPerCF - 1) /
-        SeqnoToTimeMapping::kMaxSeqnoTimePairsPerCF;
+    seqno_time_cadence = (min_preserve_seconds +
+                          SeqnoToTimeMapping::kMaxSeqnoTimePairsPerCF - 1) /
+                         SeqnoToTimeMapping::kMaxSeqnoTimePairsPerCF;
  }

  Status s;
  if (seqno_time_cadence == 0) {
    s = periodic_task_scheduler_.Unregister(PeriodicTaskType::kRecordSeqnoTime);
  } else {
+    // Before registering the periodic task, we need to be sure to fulfill two
+    // promises:
+    // 1) Any DB created with preserve/preclude options set from the beginning
+    // will get pre-allocated seqnos with pre-populated time mappings back to
+    // the times we are interested in. (This will enable future import of data
+    // while preserving rough write time. We can only do this reliably from
+    // DB::Open, as otherwise there could be a race between CreateColumnFamily
+    // and the first Write to the DB, and seqno-to-time mappings need to be
+    // monotonic.
+    // 2) In any DB, any data written after setting preserve/preclude options
+    // must have a reasonable time estimate (so that we can accurately place
+    // the data), which means at least one entry in seqno_to_time_mapping_.
+    if (from_db_open && GetLatestSequenceNumber() == 0) {
+      // Pre-allocate seqnos and pre-populate historical mapping
+      assert(mapping_was_empty);
+
+      // We can simply modify these, before writes are allowed
+      constexpr uint64_t kMax = SeqnoToTimeMapping::kMaxSeqnoTimePairsPerSST;
+      versions_->SetLastAllocatedSequence(kMax);
+      versions_->SetLastPublishedSequence(kMax);
+      versions_->SetLastSequence(kMax);
+      // Pre-populate mappings for reserved sequence numbers.
+      RecordSeqnoToTimeMapping(max_preserve_seconds);
+    } else if (mapping_was_empty) {
+      // To ensure there is at least one mapping, we need a non-zero sequence
+      // number. Outside of DB::Open, we have to be careful.
+      versions_->EnsureNonZeroSequence();
+      assert(GetLatestSequenceNumber() > 0);
+
+      // Ensure at least one mapping (or log a warning)
+      RecordSeqnoToTimeMapping(/*populate_historical_seconds=*/0);
+    }
+
    s = periodic_task_scheduler_.Register(
        PeriodicTaskType::kRecordSeqnoTime,
        periodic_task_functions_.at(PeriodicTaskType::kRecordSeqnoTime),
@ -2010,7 +2074,6 @@ Status DBImpl::GetImpl(const ReadOptions& read_options, const Slice& key,

  assert(get_impl_options.column_family);

-
  if (read_options.timestamp) {
    const Status s = FailIfTsMismatchCf(get_impl_options.column_family,
                                        *(read_options.timestamp));
@ -2220,6 +2283,12 @@ Status DBImpl::GetImpl(const ReadOptions& read_options, const Slice& key,
    RecordTick(stats_, NUMBER_KEYS_READ);
    size_t size = 0;
    if (s.ok()) {
+      const auto& merge_threshold = read_options.merge_operand_count_threshold;
+      if (merge_threshold.has_value() &&
+          merge_context.GetNumOperands() > merge_threshold.value()) {
+        s = Status::OkMergeOperandThresholdExceeded();
+      }
+
      if (get_impl_options.get_value) {
        if (get_impl_options.value) {
          size = get_impl_options.value->size();
@ -2489,8 +2558,15 @@ std::vector<Status> DBImpl::MultiGet(
    }

    if (s.ok()) {
+      const auto& merge_threshold = read_options.merge_operand_count_threshold;
+      if (merge_threshold.has_value() &&
+          merge_context.GetNumOperands() > merge_threshold.value()) {
+        s = Status::OkMergeOperandThresholdExceeded();
+      }
+
      bytes_read += value->size();
      num_found++;
+
      curr_value_size += value->size();
      if (curr_value_size > read_options.value_size_soft_limit) {
        while (++keys_read < num_keys) {
@ -3175,6 +3251,12 @@ Status DBImpl::MultiGetImpl(
    assert(key->s);

    if (key->s->ok()) {
+      const auto& merge_threshold = read_options.merge_operand_count_threshold;
+      if (merge_threshold.has_value() &&
+          key->merge_context.GetNumOperands() > merge_threshold) {
+        *(key->s) = Status::OkMergeOperandThresholdExceeded();
+      }
+
      if (key->value) {
        bytes_read += key->value->size();
      } else {
@ -3255,14 +3337,34 @@ void DBImpl::MultiGetEntity(const ReadOptions& _read_options,
                 statuses, sorted_input);
 }

+Status DBImpl::WrapUpCreateColumnFamilies(
+    const std::vector<const ColumnFamilyOptions*>& cf_options) {
+  // NOTE: this function is skipped for create_missing_column_families and
+  // DB::Open, so new functionality here might need to go into Open also.
+  bool register_worker = false;
+  for (auto* opts_ptr : cf_options) {
+    if (opts_ptr->preserve_internal_time_seconds > 0 ||
+        opts_ptr->preclude_last_level_data_seconds > 0) {
+      register_worker = true;
+      break;
+    }
+  }
+  // Attempt both follow-up actions even if one fails
+  Status s = WriteOptionsFile(true /*need_mutex_lock*/,
+                              true /*need_enter_write_thread*/);
+  if (register_worker) {
+    s.UpdateIfOk(RegisterRecordSeqnoTimeWorker(/*from_db_open=*/false));
+  }
+  return s;
+}
+
 Status DBImpl::CreateColumnFamily(const ColumnFamilyOptions& cf_options,
                                  const std::string& column_family,
                                  ColumnFamilyHandle** handle) {
  assert(handle != nullptr);
  Status s = CreateColumnFamilyImpl(cf_options, column_family, handle);
  if (s.ok()) {
-    s = WriteOptionsFile(true /*need_mutex_lock*/,
-                         true /*need_enter_write_thread*/);
+    s.UpdateIfOk(WrapUpCreateColumnFamilies({&cf_options}));
  }
  return s;
 }
@ -3286,11 +3388,7 @@ Status DBImpl::CreateColumnFamilies(
    success_once = true;
  }
  if (success_once) {
-    Status persist_options_status = WriteOptionsFile(
-        true /*need_mutex_lock*/, true /*need_enter_write_thread*/);
-    if (s.ok() && !persist_options_status.ok()) {
-      s = persist_options_status;
-    }
+    s.UpdateIfOk(WrapUpCreateColumnFamilies({&cf_options}));
  }
  return s;
 }
@ -3303,6 +3401,8 @@ Status DBImpl::CreateColumnFamilies(
  size_t num_cf = column_families.size();
  Status s;
  bool success_once = false;
+  std::vector<const ColumnFamilyOptions*> cf_opts;
+  cf_opts.reserve(num_cf);
  for (size_t i = 0; i < num_cf; i++) {
    ColumnFamilyHandle* handle;
    s = CreateColumnFamilyImpl(column_families[i].options,
@ -3312,13 +3412,10 @@ Status DBImpl::CreateColumnFamilies(
    }
    handles->push_back(handle);
    success_once = true;
+    cf_opts.push_back(&column_families[i].options);
  }
  if (success_once) {
-    Status persist_options_status = WriteOptionsFile(
-        true /*need_mutex_lock*/, true /*need_enter_write_thread*/);
-    if (s.ok() && !persist_options_status.ok()) {
-      s = persist_options_status;
-    }
+    s.UpdateIfOk(WrapUpCreateColumnFamilies(cf_opts));
  }
  return s;
 }
@ -3406,10 +3503,6 @@ Status DBImpl::CreateColumnFamilyImpl(const ColumnFamilyOptions& cf_options,
    }
  }  // InstrumentedMutexLock l(&mutex_)

-  if (cf_options.preserve_internal_time_seconds > 0 ||
-      cf_options.preclude_last_level_data_seconds > 0) {
-    s = RegisterRecordSeqnoTimeWorker();
-  }
  sv_context.Clean();
  // this is outside the mutex
  if (s.ok()) {
@ -3503,7 +3596,7 @@ Status DBImpl::DropColumnFamilyImpl(ColumnFamilyHandle* column_family) {

  if (cfd->ioptions()->preserve_internal_time_seconds > 0 ||
      cfd->ioptions()->preclude_last_level_data_seconds > 0) {
-    s = RegisterRecordSeqnoTimeWorker();
+    s = RegisterRecordSeqnoTimeWorker(/*from_db_open=*/false);
  }

  if (s.ok()) {
@ -4088,7 +4181,6 @@ Status DBImpl::GetPropertiesOfTablesInRange(ColumnFamilyHandle* column_family,
  return s;
 }

-
 const std::string& DBImpl::GetName() const { return dbname_; }

 Env* DBImpl::GetEnv() const { return env_; }
@ -4106,7 +4198,6 @@ SystemClock* DBImpl::GetSystemClock() const {
  return immutable_db_options_.clock;
 }

-
 Status DBImpl::StartIOTrace(const TraceOptions& trace_options,
                            std::unique_ptr<TraceWriter>&& trace_writer) {
  assert(trace_writer != nullptr);
@ -4119,7 +4210,6 @@ Status DBImpl::EndIOTrace() {
  return Status::OK();
 }

-
 Options DBImpl::GetOptions(ColumnFamilyHandle* column_family) const {
  InstrumentedMutexLock l(&mutex_);
  auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(column_family);
@ -6334,19 +6424,51 @@ Status DBImpl::GetCreationTimeOfOldestFile(uint64_t* creation_time) {
  }
 }

-void DBImpl::RecordSeqnoToTimeMapping() {
-  // Get time first then sequence number, so the actual time of seqno is <=
-  // unix_time recorded
-  int64_t unix_time = 0;
-  immutable_db_options_.clock->GetCurrentTime(&unix_time)
-      .PermitUncheckedError();  // Ignore error
+void DBImpl::RecordSeqnoToTimeMapping(uint64_t populate_historical_seconds) {
+  // TECHNICALITY: Sample last sequence number *before* time, as prescribed
+  // for SeqnoToTimeMapping. We don't know how long it has been since the last
+  // sequence number was written, so we at least have a one-sided bound by
+  // sampling in this order.
  SequenceNumber seqno = GetLatestSequenceNumber();
+  int64_t unix_time_signed = 0;
+  immutable_db_options_.clock->GetCurrentTime(&unix_time_signed)
+      .PermitUncheckedError();  // Ignore error
+  uint64_t unix_time = static_cast<uint64_t>(unix_time_signed);
  bool appended = false;
  {
    InstrumentedMutexLock l(&mutex_);
-    appended = seqno_time_mapping_.Append(seqno, unix_time);
+    if (populate_historical_seconds > 0) {
+      if (seqno > 1 && unix_time > populate_historical_seconds) {
+        // seqno=0 is reserved
+        SequenceNumber from_seqno = 1;
+        appended = seqno_to_time_mapping_.PrePopulate(
+            from_seqno, seqno, unix_time - populate_historical_seconds,
+            unix_time);
+      } else {
+        // One of these will fail
+        assert(seqno > 1);
+        assert(unix_time > populate_historical_seconds);
+      }
+    } else {
+      assert(seqno > 0);
+      appended = seqno_to_time_mapping_.Append(seqno, unix_time);
+    }
  }
-  if (!appended) {
+  if (populate_historical_seconds > 0) {
+    if (appended) {
+      ROCKS_LOG_INFO(
+          immutable_db_options_.info_log,
+          "Pre-populated sequence number to time entries: [1,%" PRIu64
+          "] -> [%" PRIu64 ",%" PRIu64 "]",
+          seqno, unix_time - populate_historical_seconds, unix_time);
+    } else {
+      ROCKS_LOG_WARN(
+          immutable_db_options_.info_log,
+          "Failed to pre-populate sequence number to time entries: [1,%" PRIu64
+          "] -> [%" PRIu64 ",%" PRIu64 "]",
+          seqno, unix_time - populate_historical_seconds, unix_time);
+    }
+  } else if (!appended) {
    ROCKS_LOG_WARN(immutable_db_options_.info_log,
                   "Failed to insert sequence number to time entry: %" PRIu64
                   " -> %" PRIu64,
--- a/db/db_impl/db_impl.h
+++ b/db/db_impl/db_impl.h
@ -1197,6 +1197,10 @@ class DBImpl : public DB {

  const PeriodicTaskScheduler& TEST_GetPeriodicTaskScheduler() const;

+  static Status TEST_ValidateOptions(const DBOptions& db_options) {
+    return ValidateOptions(db_options);
+  }
+
 #endif  // NDEBUG

  // persist stats to column family "_persistent_stats"
@ -1208,8 +1212,11 @@ class DBImpl : public DB {
  // flush LOG out of application buffer
  void FlushInfoLog();

-  // record current sequence number to time mapping
-  void RecordSeqnoToTimeMapping();
+  // record current sequence number to time mapping. If
+  // populate_historical_seconds > 0 then pre-populate all the
+  // sequence numbers from [1, last] to map to [now minus
+  // populate_historical_seconds, now].
+  void RecordSeqnoToTimeMapping(uint64_t populate_historical_seconds);

  // Interface to block and signal the DB in case of stalling writes by
  // WriteBufferManager. Each DBImpl object contains ptr to WBMStallInterface.
@ -1819,10 +1826,15 @@ class DBImpl : public DB {

  const Status CreateArchivalDirectory();

+  // Create a column family, without some of the follow-up work yet
  Status CreateColumnFamilyImpl(const ColumnFamilyOptions& cf_options,
                                const std::string& cf_name,
                                ColumnFamilyHandle** handle);

+  // Follow-up work to user creating a column family or (families)
+  Status WrapUpCreateColumnFamilies(
+      const std::vector<const ColumnFamilyOptions*>& cf_options);
+
  Status DropColumnFamilyImpl(ColumnFamilyHandle* column_family);

  // Delete any unneeded files and stale in-memory entries.
@ -1947,6 +1959,8 @@ class DBImpl : public DB {
      const autovector<ColumnFamilyData*>& provided_candidate_cfds = {},
      bool entered_write_thread = false);

+  Status RetryFlushesForErrorRecovery(FlushReason flush_reason, bool wait);
+
  // Wait until flushing this column family won't stall writes
  Status WaitUntilFlushWouldNotStallWrites(ColumnFamilyData* cfd,
                                           bool* flush_needed);
@ -2095,6 +2109,12 @@ class DBImpl : public DB {
 #endif /* !NDEBUG */
  };

+  // In case of atomic flush, generates a `FlushRequest` for the latest atomic
+  // cuts for these `cfds`. Atomic cuts are recorded in
+  // `AssignAtomicFlushSeq()`. For each entry in `cfds`, all CFDs sharing the
+  // same latest atomic cut must also be present.
+  //
+  // REQUIRES: mutex held
  void GenerateFlushRequest(const autovector<ColumnFamilyData*>& cfds,
                            FlushReason flush_reason, FlushRequest* req);

@ -2146,7 +2166,7 @@ class DBImpl : public DB {
  // Cancel scheduled periodic tasks
  Status CancelPeriodicTaskScheduler();

-  Status RegisterRecordSeqnoTimeWorker();
+  Status RegisterRecordSeqnoTimeWorker(bool from_db_open);

  void PrintStatistics();

@ -2748,9 +2768,9 @@ class DBImpl : public DB {
  // Pointer to WriteBufferManager stalling interface.
  std::unique_ptr<StallInterface> wbm_stall_;

-  // seqno_time_mapping_ stores the sequence number to time mapping, it's not
+  // seqno_to_time_mapping_ stores the sequence number to time mapping, it's not
  // thread safe, both read and write need db mutex hold.
-  SeqnoToTimeMapping seqno_time_mapping_;
+  SeqnoToTimeMapping seqno_to_time_mapping_;

  // Stop write token that is acquired when first LockWAL() is called.
  // Destroyed when last UnlockWAL() is called. Controlled by DB mutex.
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@ -222,9 +222,10 @@ Status DBImpl::FlushMemTableToOutputFile(
  // `snapshot_seqs` has already been computed before this function starts.
  // Recording the max memtable ID ensures that the flush job does not flush
  // a memtable without knowing such snapshot(s).
-  uint64_t max_memtable_id = needs_to_sync_closed_wals
-                                 ? cfd->imm()->GetLatestMemTableID()
-                                 : std::numeric_limits<uint64_t>::max();
+  uint64_t max_memtable_id =
+      needs_to_sync_closed_wals
+          ? cfd->imm()->GetLatestMemTableID(false /* for_atomic_flush */)
+          : std::numeric_limits<uint64_t>::max();

  // If needs_to_sync_closed_wals is false, then the flush job will pick ALL
  // existing memtables of the column family when PickMemTable() is called
@ -250,7 +251,7 @@ Status DBImpl::FlushMemTableToOutputFile(
      GetCompressionFlush(*cfd->ioptions(), mutable_cf_options), stats_,
      &event_logger_, mutable_cf_options.report_bg_io_stats,
      true /* sync_output_directory */, true /* write_manifest */, thread_pri,
-      io_tracer_, seqno_time_mapping_, db_id_, db_session_id_,
+      io_tracer_, seqno_to_time_mapping_, db_id_, db_session_id_,
      cfd->GetFullHistoryTsLow(), &blob_callback_);
  FileMetaData file_meta;

@ -284,6 +285,24 @@ Status DBImpl::FlushMemTableToOutputFile(
  // If the log sync failed, we do not need to pick memtable. Otherwise,
  // num_flush_not_started_ needs to be rollback.
  TEST_SYNC_POINT("DBImpl::FlushMemTableToOutputFile:BeforePickMemtables");
+  // Exit a flush due to bg error should not set bg error again.
+  bool skip_set_bg_error = false;
+  if (s.ok() && !error_handler_.GetBGError().ok() &&
+      error_handler_.IsBGWorkStopped() &&
+      flush_reason != FlushReason::kErrorRecovery &&
+      flush_reason != FlushReason::kErrorRecoveryRetryFlush) {
+    // Error recovery in progress, should not pick memtable which excludes
+    // them from being picked up by recovery flush.
+    // This ensures that when bg error is set, no new flush can pick
+    // memtables.
+    skip_set_bg_error = true;
+    s = error_handler_.GetBGError();
+    assert(!s.ok());
+    ROCKS_LOG_BUFFER(log_buffer,
+                     "[JOB %d] Skip flush due to background error %s",
+                     job_context->job_id, s.ToString().c_str());
+  }
+
  if (s.ok()) {
    flush_job.PickMemTable();
    need_cancel = true;
@ -304,7 +323,8 @@ Status DBImpl::FlushMemTableToOutputFile(
  // is unlocked by the current thread.
  if (s.ok()) {
    s = flush_job.Run(&logs_with_prep_tracker_, &file_meta,
-                      &switched_to_mempurge);
+                      &switched_to_mempurge, &skip_set_bg_error,
+                      &error_handler_);
    need_cancel = false;
  }

@ -345,7 +365,8 @@ Status DBImpl::FlushMemTableToOutputFile(
    }
  }

-  if (!s.ok() && !s.IsShutdownInProgress() && !s.IsColumnFamilyDropped()) {
+  if (!s.ok() && !s.IsShutdownInProgress() && !s.IsColumnFamilyDropped() &&
+      !skip_set_bg_error) {
    if (log_io_s.ok()) {
      // Error while writing to MANIFEST.
      // In fact, versions_->io_status() can also be the result of renaming
@ -502,7 +523,7 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
        GetCompressionFlush(*cfd->ioptions(), mutable_cf_options), stats_,
        &event_logger_, mutable_cf_options.report_bg_io_stats,
        false /* sync_output_directory */, false /* write_manifest */,
-        thread_pri, io_tracer_, seqno_time_mapping_, db_id_, db_session_id_,
+        thread_pri, io_tracer_, seqno_to_time_mapping_, db_id_, db_session_id_,
        cfd->GetFullHistoryTsLow(), &blob_callback_));
  }

@ -557,6 +578,21 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
    pick_status.push_back(false);
  }

+  bool flush_for_recovery =
+      bg_flush_args[0].flush_reason_ == FlushReason::kErrorRecovery ||
+      bg_flush_args[0].flush_reason_ == FlushReason::kErrorRecoveryRetryFlush;
+  bool skip_set_bg_error = false;
+
+  if (s.ok() && !error_handler_.GetBGError().ok() &&
+      error_handler_.IsBGWorkStopped() && !flush_for_recovery) {
+    s = error_handler_.GetBGError();
+    skip_set_bg_error = true;
+    assert(!s.ok());
+    ROCKS_LOG_BUFFER(log_buffer,
+                     "[JOB %d] Skip flush due to background error %s",
+                     job_context->job_id, s.ToString().c_str());
+  }
+
  if (s.ok()) {
    for (int i = 0; i != num_cfs; ++i) {
      jobs[i]->PickMemTable();
@ -621,7 +657,10 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
        }
      }
    }
-  } else {
+  } else if (!skip_set_bg_error) {
+    // When `skip_set_bg_error` is true, no memtable is picked so
+    // there is no need to call Cancel() or RollbackMemtableFlush().
+    //
    // Need to undo atomic flush if something went wrong, i.e. s is not OK and
    // it is not because of CF drop.
    // Have to cancel the flush jobs that have NOT executed because we need to
@ -634,8 +673,8 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
    for (int i = 0; i != num_cfs; ++i) {
      if (exec_status[i].second.ok() && exec_status[i].first) {
        auto& mems = jobs[i]->GetMemTables();
-        cfds[i]->imm()->RollbackMemtableFlush(mems,
-                                              file_meta[i].fd.GetNumber());
+        cfds[i]->imm()->RollbackMemtableFlush(
+            mems, /*rollback_succeeding_memtables=*/false);
      }
    }
  }
@ -677,10 +716,7 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
    };

    bool resuming_from_bg_err =
-        error_handler_.IsDBStopped() ||
-        (bg_flush_args[0].flush_reason_ == FlushReason::kErrorRecovery ||
-         bg_flush_args[0].flush_reason_ ==
-             FlushReason::kErrorRecoveryRetryFlush);
+        error_handler_.IsDBStopped() || flush_for_recovery;
    while ((!resuming_from_bg_err || error_handler_.GetRecoveryError().ok())) {
      std::pair<Status, bool> res = wait_to_install_func();

@ -691,15 +727,27 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
        s = res.first;
        break;
      } else if (!res.second) {
+        // we are the oldest immutable memtable
+        break;
+      }
+      // We are not the oldest immutable memtable
+      TEST_SYNC_POINT_CALLBACK(
+          "DBImpl::AtomicFlushMemTablesToOutputFiles:WaitCV", &res);
+      //
+      // If bg work is stopped, recovery thread first calls
+      // WaitForBackgroundWork() before proceeding to flush for recovery. This
+      // flush can block WaitForBackgroundWork() while waiting for recovery
+      // flush to install result. To avoid this deadlock, we should abort here
+      // if there is background error.
+      if (!flush_for_recovery && error_handler_.IsBGWorkStopped() &&
+          !error_handler_.GetBGError().ok()) {
+        s = error_handler_.GetBGError();
+        assert(!s.ok());
        break;
      }
      atomic_flush_install_cv_.Wait();

-      resuming_from_bg_err =
-          error_handler_.IsDBStopped() ||
-          (bg_flush_args[0].flush_reason_ == FlushReason::kErrorRecovery ||
-           bg_flush_args[0].flush_reason_ ==
-               FlushReason::kErrorRecoveryRetryFlush);
+      resuming_from_bg_err = error_handler_.IsDBStopped() || flush_for_recovery;
    }

    if (!resuming_from_bg_err) {
@ -715,6 +763,17 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(
      // installation.
      s = error_handler_.GetRecoveryError();
    }
+    // Since we are not installing these memtables, need to rollback
+    // to allow future flush job to pick up these memtables.
+    if (!s.ok()) {
+      for (int i = 0; i != num_cfs; ++i) {
+        assert(exec_status[i].first);
+        assert(exec_status[i].second.ok());
+        auto& mems = jobs[i]->GetMemTables();
+        cfds[i]->imm()->RollbackMemtableFlush(
+            mems, /*rollback_succeeding_memtables=*/false);
+      }
+    }
  }

  if (s.ok()) {
@ -818,7 +877,7 @@ Status DBImpl::AtomicFlushMemTablesToOutputFiles(

  // Need to undo atomic flush if something went wrong, i.e. s is not OK and
  // it is not because of CF drop.
-  if (!s.ok() && !s.IsColumnFamilyDropped()) {
+  if (!s.ok() && !s.IsColumnFamilyDropped() && !skip_set_bg_error) {
    if (log_io_s.ok()) {
      // Error while writing to MANIFEST.
      // In fact, versions_->io_status() can also be the result of renaming
@ -1462,7 +1521,8 @@ Status DBImpl::CompactFilesImpl(
  // without releasing the lock, so we're guaranteed a compaction can be formed.
  assert(c != nullptr);

-  c->SetInputVersion(version);
+  c->FinalizeInputInfo(version);
+
  // deletion compaction currently not allowed in CompactFiles.
  assert(!c->deletion_compaction());

@ -2171,7 +2231,8 @@ void DBImpl::GenerateFlushRequest(const autovector<ColumnFamilyData*>& cfds,
      // cfd may be null, see DBImpl::ScheduleFlushes
      continue;
    }
-    uint64_t max_memtable_id = cfd->imm()->GetLatestMemTableID();
+    uint64_t max_memtable_id = cfd->imm()->GetLatestMemTableID(
+        immutable_db_options_.atomic_flush /* for_atomic_flush */);
    req->cfd_to_max_mem_id_to_persist.emplace(cfd, max_memtable_id);
  }
 }
@ -2215,15 +2276,7 @@ Status DBImpl::FlushMemTable(ColumnFamilyData* cfd,
    }
    WaitForPendingWrites();

-    if (flush_reason != FlushReason::kErrorRecoveryRetryFlush &&
-        (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load())) {
-      // Note that, when flush reason is kErrorRecoveryRetryFlush, during the
-      // auto retry resume, we want to avoid creating new small memtables.
-      // Therefore, SwitchMemtable will not be called. Also, since ResumeImpl
-      // will iterate through all the CFs and call FlushMemtable during auto
-      // retry resume, it is possible that in some CFs,
-      // cfd->imm()->NumNotFlushed() = 0. In this case, so no flush request will
-      // be created and scheduled, status::OK() will be returned.
+    if (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()) {
      s = SwitchMemtable(cfd, &context);
    }
    const uint64_t flush_memtable_id = std::numeric_limits<uint64_t>::max();
@ -2232,10 +2285,10 @@ Status DBImpl::FlushMemTable(ColumnFamilyData* cfd,
          !cached_recoverable_state_empty_.load()) {
        FlushRequest req{flush_reason, {{cfd, flush_memtable_id}}};
        flush_reqs.emplace_back(std::move(req));
-        memtable_ids_to_wait.emplace_back(cfd->imm()->GetLatestMemTableID());
+        memtable_ids_to_wait.emplace_back(
+            cfd->imm()->GetLatestMemTableID(false /* for_atomic_flush */));
      }
-      if (immutable_db_options_.persist_stats_to_disk &&
-          flush_reason != FlushReason::kErrorRecoveryRetryFlush) {
+      if (immutable_db_options_.persist_stats_to_disk) {
        ColumnFamilyData* cfd_stats =
            versions_->GetColumnFamilySet()->GetColumnFamily(
                kPersistentStatsColumnFamilyName);
@ -2261,7 +2314,8 @@ Status DBImpl::FlushMemTable(ColumnFamilyData* cfd,
            FlushRequest req{flush_reason, {{cfd_stats, flush_memtable_id}}};
            flush_reqs.emplace_back(std::move(req));
            memtable_ids_to_wait.emplace_back(
-                cfd_stats->imm()->GetLatestMemTableID());
+                cfd_stats->imm()->GetLatestMemTableID(
+                    false /* for_atomic_flush */));
          }
        }
      }
@ -2312,8 +2366,7 @@ Status DBImpl::FlushMemTable(ColumnFamilyData* cfd,
    }
    s = WaitForFlushMemTables(
        cfds, flush_memtable_ids,
-        (flush_reason == FlushReason::kErrorRecovery ||
-         flush_reason == FlushReason::kErrorRecoveryRetryFlush));
+        flush_reason == FlushReason::kErrorRecovery /* resuming_from_bg_err */);
    InstrumentedMutexLock lock_guard(&mutex_);
    for (auto* tmp_cfd : cfds) {
      tmp_cfd->UnrefAndTryDelete();
@ -2408,8 +2461,7 @@ Status DBImpl::AtomicFlushMemTables(
    }

    for (auto cfd : cfds) {
-      if ((cfd->mem()->IsEmpty() && cached_recoverable_state_empty_.load()) ||
-          flush_reason == FlushReason::kErrorRecoveryRetryFlush) {
+      if (cfd->mem()->IsEmpty() && cached_recoverable_state_empty_.load()) {
        continue;
      }
      cfd->Ref();
@ -2454,8 +2506,7 @@ Status DBImpl::AtomicFlushMemTables(
    }
    s = WaitForFlushMemTables(
        cfds, flush_memtable_ids,
-        (flush_reason == FlushReason::kErrorRecovery ||
-         flush_reason == FlushReason::kErrorRecoveryRetryFlush));
+        flush_reason == FlushReason::kErrorRecovery /* resuming_from_bg_err */);
    InstrumentedMutexLock lock_guard(&mutex_);
    for (auto* cfd : cfds) {
      cfd->UnrefAndTryDelete();
@ -2464,6 +2515,68 @@ Status DBImpl::AtomicFlushMemTables(
  return s;
 }

+Status DBImpl::RetryFlushesForErrorRecovery(FlushReason flush_reason,
+                                            bool wait) {
+  mutex_.AssertHeld();
+  assert(flush_reason == FlushReason::kErrorRecoveryRetryFlush ||
+         flush_reason == FlushReason::kCatchUpAfterErrorRecovery);
+
+  // Collect referenced CFDs.
+  autovector<ColumnFamilyData*> cfds;
+  for (ColumnFamilyData* cfd : *versions_->GetColumnFamilySet()) {
+    if (!cfd->IsDropped() && cfd->initialized() &&
+        cfd->imm()->NumNotFlushed() != 0) {
+      cfd->Ref();
+      cfd->imm()->FlushRequested();
+      cfds.push_back(cfd);
+    }
+  }
+
+  // Submit flush requests for all immutable memtables needing flush.
+  // `flush_memtable_ids` will be populated such that all immutable
+  // memtables eligible for flush are waited on before this function
+  // returns.
+  autovector<uint64_t> flush_memtable_ids;
+  if (immutable_db_options_.atomic_flush) {
+    FlushRequest flush_req;
+    GenerateFlushRequest(cfds, flush_reason, &flush_req);
+    SchedulePendingFlush(flush_req);
+    for (auto& iter : flush_req.cfd_to_max_mem_id_to_persist) {
+      flush_memtable_ids.push_back(iter.second);
+    }
+  } else {
+    for (auto cfd : cfds) {
+      flush_memtable_ids.push_back(
+          cfd->imm()->GetLatestMemTableID(false /* for_atomic_flush */));
+      // Impose no bound on the highest memtable ID flushed. There is no
+      // reason to do so outside of atomic flush.
+      FlushRequest flush_req{
+          flush_reason,
+          {{cfd,
+            std::numeric_limits<uint64_t>::max() /* max_mem_id_to_persist */}}};
+      SchedulePendingFlush(flush_req);
+    }
+  }
+  MaybeScheduleFlushOrCompaction();
+
+  Status s;
+  if (wait) {
+    mutex_.Unlock();
+    autovector<const uint64_t*> flush_memtable_id_ptrs;
+    for (auto& flush_memtable_id : flush_memtable_ids) {
+      flush_memtable_id_ptrs.push_back(&flush_memtable_id);
+    }
+    s = WaitForFlushMemTables(cfds, flush_memtable_id_ptrs,
+                              true /* resuming_from_bg_err */);
+    mutex_.Lock();
+  }
+
+  for (auto* cfd : cfds) {
+    cfd->UnrefAndTryDelete();
+  }
+  return s;
+}
+
 // Calling FlushMemTable(), whether from DB::Flush() or from Backup Engine, can
 // cause write stall, for example if one memtable is being flushed already.
 // This method tries to avoid write stall (similar to CompactRange() behavior)
@ -2677,6 +2790,11 @@ void DBImpl::MaybeScheduleFlushOrCompaction() {
    // There has been a hard error and this call is not part of the recovery
    // sequence. Bail out here so we don't get into an endless loop of
    // scheduling BG work which will again call this function
+    //
+    // Note that a non-recovery flush can still be scheduled if
+    // error_handler_.IsRecoveryInProgress() returns true. We rely on
+    // BackgroundCallFlush() to check flush reason and drop non-recovery
+    // flushes.
    return;
  } else if (shutting_down_.load(std::memory_order_acquire)) {
    // DB is being deleted; no more background compactions
@ -3006,6 +3124,24 @@ Status DBImpl::BackgroundFlush(bool* made_progress, JobContext* job_context,
    // This cfd is already referenced
    FlushRequest flush_req = PopFirstFromFlushQueue();
    FlushReason flush_reason = flush_req.flush_reason;
+    if (!error_handler_.GetBGError().ok() && error_handler_.IsBGWorkStopped() &&
+        flush_reason != FlushReason::kErrorRecovery &&
+        flush_reason != FlushReason::kErrorRecoveryRetryFlush) {
+      // Stop non-recovery flush when bg work is stopped
+      // Note that we drop the flush request here.
+      // Recovery thread should schedule further flushes after bg error
+      // is cleared.
+      status = error_handler_.GetBGError();
+      assert(!status.ok());
+      ROCKS_LOG_BUFFER(log_buffer,
+                       "[JOB %d] Abort flush due to background error %s",
+                       job_context->job_id, status.ToString().c_str());
+      *reason = flush_reason;
+      for (auto item : flush_req.cfd_to_max_mem_id_to_persist) {
+        item.first->UnrefAndTryDelete();
+      }
+      return status;
+    }
    if (!immutable_db_options_.atomic_flush &&
        ShouldRescheduleFlushRequestToRetainUDT(flush_req)) {
      assert(flush_req.cfd_to_max_mem_id_to_persist.size() == 1);
@ -3148,9 +3284,9 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
      bg_cv_.SignalAll();  // In case a waiter can proceed despite the error
      mutex_.Unlock();
      ROCKS_LOG_ERROR(immutable_db_options_.info_log,
-                      "Waiting after background flush error: %s"
+                      "[JOB %d] Waiting after background flush error: %s"
                      "Accumulated background error counts: %" PRIu64,
-                      s.ToString().c_str(), error_cnt);
+                      job_context.job_id, s.ToString().c_str(), error_cnt);
      log_buffer.FlushBufferToLog();
      LogFlush(immutable_db_options_.info_log);
      immutable_db_options_.clock->SleepForMicroseconds(1000000);
@ -3954,7 +4090,12 @@ void DBImpl::BuildCompactionJobInfo(
  compaction_job_info->base_input_level = c->start_level();
  compaction_job_info->output_level = c->output_level();
  compaction_job_info->stats = compaction_job_stats;
-  compaction_job_info->table_properties = c->GetTableProperties();
+  const auto& input_table_properties = c->GetInputTableProperties();
+  const auto& output_table_properties = c->GetOutputTableProperties();
+  compaction_job_info->table_properties.insert(input_table_properties.begin(),
+                                               input_table_properties.end());
+  compaction_job_info->table_properties.insert(output_table_properties.begin(),
+                                               output_table_properties.end());
  compaction_job_info->compaction_reason = c->compaction_reason();
  compaction_job_info->compression = c->output_compression();

--- a/db/db_impl/db_impl_debug.cc
+++ b/db/db_impl/db_impl_debug.cc
@ -306,7 +306,7 @@ const PeriodicTaskScheduler& DBImpl::TEST_GetPeriodicTaskScheduler() const {

 SeqnoToTimeMapping DBImpl::TEST_GetSeqnoToTimeMapping() const {
  InstrumentedMutexLock l(&mutex_);
-  return seqno_time_mapping_;
+  return seqno_to_time_mapping_;
 }


--- a/db/db_impl/db_impl_open.cc
+++ b/db/db_impl/db_impl_open.cc
@ -25,6 +25,7 @@
 #include "rocksdb/wal_filter.h"
 #include "test_util/sync_point.h"
 #include "util/rate_limiter_impl.h"
+#include "util/string_util.h"
 #include "util/udt_util.h"

 namespace ROCKSDB_NAMESPACE {
@ -291,6 +292,18 @@ Status DBImpl::ValidateOptions(const DBOptions& db_options) {
        "writes in direct IO require writable_file_max_buffer_size > 0");
  }

+  if (db_options.daily_offpeak_time_utc != "") {
+    int start_time, end_time;
+    if (!TryParseTimeRangeString(db_options.daily_offpeak_time_utc, start_time,
+                                 end_time)) {
+      return Status::InvalidArgument(
+          "daily_offpeak_time_utc should be set in the format HH:mm-HH:mm "
+          "(e.g. 04:30-07:30)");
+    } else if (start_time == end_time) {
+      return Status::InvalidArgument(
+          "start_time and end_time cannot be the same");
+    }
+  }
  return Status::OK();
 }

@ -1298,7 +1311,7 @@ Status DBImpl::RecoverLogFiles(const std::vector<uint64_t>& wal_numbers,
          flushed = true;

          cfd->CreateNewMemtable(*cfd->GetLatestMutableCFOptions(),
-                                 *next_sequence);
+                                 *next_sequence - 1);
        }
      }
    }
@ -1649,7 +1662,7 @@ Status DBImpl::WriteLevel0TableForRecovery(int job_id, ColumnFamilyData* cfd,
          TableFileCreationReason::kRecovery, 0 /* oldest_key_time */,
          0 /* file_creation_time */, db_id_, db_session_id_,
          0 /* target_file_size */, meta.fd.GetNumber());
-      SeqnoToTimeMapping empty_seqno_time_mapping;
+      SeqnoToTimeMapping empty_seqno_to_time_mapping;
      Version* version = cfd->current();
      version->Ref();
      const ReadOptions read_option(Env::IOActivity::kDBOpen);
@ -1661,7 +1674,7 @@ Status DBImpl::WriteLevel0TableForRecovery(int job_id, ColumnFamilyData* cfd,
          snapshot_seqs, earliest_write_conflict_snapshot, kMaxSequenceNumber,
          snapshot_checker, paranoid_file_checks, cfd->internal_stats(), &io_s,
          io_tracer_, BlobFileCreationReason::kRecovery,
-          empty_seqno_time_mapping, &event_logger_, job_id, Env::IO_HIGH,
+          empty_seqno_to_time_mapping, &event_logger_, job_id, Env::IO_HIGH,
          nullptr /* table_properties */, write_hint,
          nullptr /*full_history_ts_low*/, &blob_callback_, version,
          &num_input_entries);
@ -2056,7 +2069,9 @@ Status DBImpl::Open(const DBOptions& db_options, const std::string& dbname,
          // missing column family, create it
          ColumnFamilyHandle* handle = nullptr;
          impl->mutex_.Unlock();
-          s = impl->CreateColumnFamily(cf.options, cf.name, &handle);
+          // NOTE: the work normally done in WrapUpCreateColumnFamilies will
+          // be done separately below.
+          s = impl->CreateColumnFamilyImpl(cf.options, cf.name, &handle);
          impl->mutex_.Lock();
          if (s.ok()) {
            handles->push_back(handle);
@ -2232,7 +2247,7 @@ Status DBImpl::Open(const DBOptions& db_options, const std::string& dbname,
  }

  if (s.ok()) {
-    s = impl->RegisterRecordSeqnoTimeWorker();
+    s = impl->RegisterRecordSeqnoTimeWorker(/*from_db_open=*/true);
  }
  if (!s.ok()) {
    for (auto* h : *handles) {
--- a/db/db_impl/db_impl_secondary.cc
+++ b/db/db_impl/db_impl_secondary.cc
@ -885,7 +885,7 @@ Status DBImplSecondary::CompactWithoutInstallation(
      *mutable_cf_options, mutable_db_options_, 0));
  assert(c != nullptr);

-  c->SetInputVersion(version);
+  c->FinalizeInputInfo(version);

  // Create output directory if it's not existed yet
  std::unique_ptr<FSDirectory> output_dir;
--- a/db/db_iter.cc
+++ b/db/db_iter.cc
@ -83,7 +83,8 @@ DBIter::DBIter(Env* _env, const ReadOptions& read_options,
      cfd_(cfd),
      timestamp_ub_(read_options.timestamp),
      timestamp_lb_(read_options.iter_start_ts),
-      timestamp_size_(timestamp_ub_ ? timestamp_ub_->size() : 0) {
+      timestamp_size_(timestamp_ub_ ? timestamp_ub_->size() : 0),
+      auto_readahead_size_(read_options.auto_readahead_size) {
  RecordTick(statistics_, NO_ITERATOR_CREATED);
  if (pin_thru_lifetime_) {
    pinned_iters_mgr_.StartPinning();
@ -238,6 +239,31 @@ bool DBIter::SetValueAndColumnsFromEntity(Slice slice) {
  return true;
 }

+bool DBIter::SetValueAndColumnsFromMergeResult(const Status& merge_status,
+                                               ValueType result_type) {
+  if (!merge_status.ok()) {
+    valid_ = false;
+    status_ = merge_status;
+    return false;
+  }
+
+  if (result_type == kTypeWideColumnEntity) {
+    if (!SetValueAndColumnsFromEntity(saved_value_)) {
+      assert(!valid_);
+      return false;
+    }
+
+    valid_ = true;
+    return true;
+  }
+
+  assert(result_type == kTypeValue);
+  SetValueAndColumnsFromPlain(pinned_value_.data() ? pinned_value_
+                                                   : saved_value_);
+  valid_ = true;
+  return true;
+}
+
 // PRE: saved_key_ has the current user key if skipping_saved_key
 // POST: saved_key_ should have the next user key if valid_,
 //       if the current entry is a result of merge
@ -554,8 +580,7 @@ bool DBIter::MergeValuesNewToOld() {
    if (kTypeValue == ikey.type) {
      // hit a put, merge the put value with operands and store the
      // final result in saved_value_. We are done!
-      const Slice val = iter_.value();
-      if (!Merge(&val, ikey.user_key)) {
+      if (!MergeWithPlainBaseValue(iter_.value(), ikey.user_key)) {
        return false;
      }
      // iter_ is positioned after put
@ -584,7 +609,7 @@ bool DBIter::MergeValuesNewToOld() {
        return false;
      }
      valid_ = true;
-      if (!Merge(&blob_value_, ikey.user_key)) {
+      if (!MergeWithPlainBaseValue(blob_value_, ikey.user_key)) {
        return false;
      }

@ -598,7 +623,7 @@ bool DBIter::MergeValuesNewToOld() {
      }
      return true;
    } else if (kTypeWideColumnEntity == ikey.type) {
-      if (!MergeEntity(iter_.value(), ikey.user_key)) {
+      if (!MergeWithWideColumnBaseValue(iter_.value(), ikey.user_key)) {
        return false;
      }

@ -628,7 +653,7 @@ bool DBIter::MergeValuesNewToOld() {
  // a deletion marker.
  // feed null as the existing value to the merge operator, such that
  // client can differentiate this scenario and do things accordingly.
-  if (!Merge(nullptr, saved_key_.GetUserKey())) {
+  if (!MergeWithNoBaseValue(saved_key_.GetUserKey())) {
    return false;
  }
  assert(status_.ok());
@ -719,15 +744,22 @@ bool DBIter::ReverseToBackward() {
  // When current_entry_is_merged_ is true, iter_ may be positioned on the next
  // key, which may not exist or may have prefix different from current.
  // If that's the case, seek to saved_key_.
-  if (current_entry_is_merged_ &&
-      (!expect_total_order_inner_iter() || !iter_.Valid())) {
+  //
+  // In case of auto_readahead_size enabled, index_iter moves forward during
+  // forward scan for block cache lookup and points to different block. If Prev
+  // op is called, it needs to call SeekForPrev to point to right index_iter_ in
+  // BlockBasedTableIterator. This only happens when direction is changed from
+  // forward to backward.
+  if ((current_entry_is_merged_ &&
+       (!expect_total_order_inner_iter() || !iter_.Valid())) ||
+      auto_readahead_size_) {
    IterKey last_key;
    // Using kMaxSequenceNumber and kValueTypeForSeek
    // (not kValueTypeForSeekForPrev) to seek to a key strictly smaller
    // than saved_key_.
    last_key.SetInternalKey(ParsedInternalKey(
        saved_key_.GetUserKey(), kMaxSequenceNumber, kValueTypeForSeek));
-    if (!expect_total_order_inner_iter()) {
+    if (!expect_total_order_inner_iter() || auto_readahead_size_) {
      iter_.SeekForPrev(last_key.GetInternalKey());
    } else {
      // Some iterators may not support SeekForPrev(), so we avoid using it
@ -979,7 +1011,7 @@ bool DBIter::FindValueForCurrentKey() {
      if (last_not_merge_type == kTypeDeletion ||
          last_not_merge_type == kTypeSingleDeletion ||
          last_not_merge_type == kTypeDeletionWithTimestamp) {
-        if (!Merge(nullptr, saved_key_.GetUserKey())) {
+        if (!MergeWithNoBaseValue(saved_key_.GetUserKey())) {
          return false;
        }
        return true;
@ -994,7 +1026,7 @@ bool DBIter::FindValueForCurrentKey() {
          return false;
        }
        valid_ = true;
-        if (!Merge(&blob_value_, saved_key_.GetUserKey())) {
+        if (!MergeWithPlainBaseValue(blob_value_, saved_key_.GetUserKey())) {
          return false;
        }

@ -1002,14 +1034,15 @@ bool DBIter::FindValueForCurrentKey() {

        return true;
      } else if (last_not_merge_type == kTypeWideColumnEntity) {
-        if (!MergeEntity(pinned_value_, saved_key_.GetUserKey())) {
+        if (!MergeWithWideColumnBaseValue(pinned_value_,
+                                          saved_key_.GetUserKey())) {
          return false;
        }

        return true;
      } else {
        assert(last_not_merge_type == kTypeValue);
-        if (!Merge(&pinned_value_, saved_key_.GetUserKey())) {
+        if (!MergeWithPlainBaseValue(pinned_value_, saved_key_.GetUserKey())) {
          return false;
        }
        return true;
@ -1185,8 +1218,7 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() {
    }

    if (ikey.type == kTypeValue) {
-      const Slice val = iter_.value();
-      if (!Merge(&val, saved_key_.GetUserKey())) {
+      if (!MergeWithPlainBaseValue(iter_.value(), saved_key_.GetUserKey())) {
        return false;
      }
      return true;
@ -1205,7 +1237,7 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() {
        return false;
      }
      valid_ = true;
-      if (!Merge(&blob_value_, saved_key_.GetUserKey())) {
+      if (!MergeWithPlainBaseValue(blob_value_, saved_key_.GetUserKey())) {
        return false;
      }

@ -1213,7 +1245,8 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() {

      return true;
    } else if (ikey.type == kTypeWideColumnEntity) {
-      if (!MergeEntity(iter_.value(), saved_key_.GetUserKey())) {
+      if (!MergeWithWideColumnBaseValue(iter_.value(),
+                                        saved_key_.GetUserKey())) {
        return false;
      }

@ -1227,7 +1260,7 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() {
    }
  }

-  if (!Merge(nullptr, saved_key_.GetUserKey())) {
+  if (!MergeWithNoBaseValue(saved_key_.GetUserKey())) {
    return false;
  }

@ -1250,47 +1283,42 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() {
  return true;
 }

-bool DBIter::Merge(const Slice* val, const Slice& user_key) {
+bool DBIter::MergeWithNoBaseValue(const Slice& user_key) {
  // `op_failure_scope` (an output parameter) is not provided (set to nullptr)
  // since a failure must be propagated regardless of its value.
-  Status s = MergeHelper::TimedFullMerge(
-      merge_operator_, user_key, val, merge_context_.GetOperands(),
-      &saved_value_, logger_, statistics_, clock_, &pinned_value_,
-      /* update_num_ops_stats */ true,
-      /* op_failure_scope */ nullptr);
-  if (!s.ok()) {
-    valid_ = false;
-    status_ = s;
-    return false;
-  }
-
-  SetValueAndColumnsFromPlain(pinned_value_.data() ? pinned_value_
-                                                   : saved_value_);
-
-  valid_ = true;
-  return true;
+  ValueType result_type;
+  const Status s = MergeHelper::TimedFullMerge(
+      merge_operator_, user_key, MergeHelper::kNoBaseValue,
+      merge_context_.GetOperands(), logger_, statistics_, clock_,
+      /* update_num_ops_stats */ true, &saved_value_, &pinned_value_,
+      &result_type, /* op_failure_scope */ nullptr);
+  return SetValueAndColumnsFromMergeResult(s, result_type);
 }

-bool DBIter::MergeEntity(const Slice& entity, const Slice& user_key) {
+bool DBIter::MergeWithPlainBaseValue(const Slice& value,
+                                     const Slice& user_key) {
  // `op_failure_scope` (an output parameter) is not provided (set to nullptr)
  // since a failure must be propagated regardless of its value.
-  Status s = MergeHelper::TimedFullMergeWithEntity(
-      merge_operator_, user_key, entity, merge_context_.GetOperands(),
-      &saved_value_, logger_, statistics_, clock_,
-      /* update_num_ops_stats */ true,
-      /* op_failure_scope */ nullptr);
-  if (!s.ok()) {
-    valid_ = false;
-    status_ = s;
-    return false;
-  }
+  ValueType result_type;
+  const Status s = MergeHelper::TimedFullMerge(
+      merge_operator_, user_key, MergeHelper::kPlainBaseValue, value,
+      merge_context_.GetOperands(), logger_, statistics_, clock_,
+      /* update_num_ops_stats */ true, &saved_value_, &pinned_value_,
+      &result_type, /* op_failure_scope */ nullptr);
+  return SetValueAndColumnsFromMergeResult(s, result_type);
+}

-  if (!SetValueAndColumnsFromEntity(saved_value_)) {
-    return false;
-  }
-
-  valid_ = true;
-  return true;
+bool DBIter::MergeWithWideColumnBaseValue(const Slice& entity,
+                                          const Slice& user_key) {
+  // `op_failure_scope` (an output parameter) is not provided (set to nullptr)
+  // since a failure must be propagated regardless of its value.
+  ValueType result_type;
+  const Status s = MergeHelper::TimedFullMerge(
+      merge_operator_, user_key, MergeHelper::kWideBaseValue, entity,
+      merge_context_.GetOperands(), logger_, statistics_, clock_,
+      /* update_num_ops_stats */ true, &saved_value_, &pinned_value_,
+      &result_type, /* op_failure_scope */ nullptr);
+  return SetValueAndColumnsFromMergeResult(s, result_type);
 }

 // Move backwards until the key smaller than saved_key_.
--- a/db/db_iter.h
+++ b/db/db_iter.h
@ -313,14 +313,20 @@ class DBIter final : public Iterator {

  bool SetValueAndColumnsFromEntity(Slice slice);

+  bool SetValueAndColumnsFromMergeResult(const Status& merge_status,
+                                         ValueType result_type);
+
  void ResetValueAndColumns() {
    value_.clear();
    wide_columns_.clear();
  }

+  // The following methods perform the actual merge operation for the
+  // no base value/plain base value/wide-column base value cases.
  // If user-defined timestamp is enabled, `user_key` includes timestamp.
-  bool Merge(const Slice* val, const Slice& user_key);
-  bool MergeEntity(const Slice& entity, const Slice& user_key);
+  bool MergeWithNoBaseValue(const Slice& user_key);
+  bool MergeWithPlainBaseValue(const Slice& value, const Slice& user_key);
+  bool MergeWithWideColumnBaseValue(const Slice& entity, const Slice& user_key);

  const SliceTransform* prefix_extractor_;
  Env* const env_;
@ -396,6 +402,7 @@ class DBIter final : public Iterator {
  const Slice* const timestamp_lb_;
  const size_t timestamp_size_;
  std::string saved_timestamp_;
+  bool auto_readahead_size_;
 };

 // Return a new iterator that converts internal keys (yielded by
--- a/db/db_merge_operator_test.cc
+++ b/db/db_merge_operator_test.cc
@ -6,9 +6,12 @@
 #include <vector>

 #include "db/db_test_util.h"
+#include "db/dbformat.h"
 #include "db/forward_iterator.h"
 #include "port/stack_trace.h"
 #include "rocksdb/merge_operator.h"
+#include "rocksdb/snapshot.h"
+#include "rocksdb/utilities/debug.h"
 #include "util/random.h"
 #include "utilities/merge_operators.h"
 #include "utilities/merge_operators/string_append/stringappend2.h"
@ -202,7 +205,6 @@ TEST_F(DBMergeOperatorTest, MergeErrorOnIteration) {
  VerifyDBInternal({{"k1", "v1"}, {"k2", "corrupted"}, {"k2", "v2"}});
 }

-
 TEST_F(DBMergeOperatorTest, MergeOperatorFailsWithMustMerge) {
  // This is like a mini-stress test dedicated to `OpFailureScope::kMustMerge`.
  // Some or most of it might be deleted upon adding that option to the actual
@ -358,6 +360,98 @@ TEST_F(DBMergeOperatorTest, MergeOperatorFailsWithMustMerge) {
  }
 }

+TEST_F(DBMergeOperatorTest, MergeOperandThresholdExceeded) {
+  Options options = CurrentOptions();
+  options.create_if_missing = true;
+  options.merge_operator = MergeOperators::CreatePutOperator();
+  options.env = env_;
+  Reopen(options);
+
+  std::vector<Slice> keys{"foo", "bar", "baz"};
+
+  // Write base values.
+  for (const auto& key : keys) {
+    ASSERT_OK(Put(key, key.ToString() + "0"));
+  }
+
+  // Write merge operands. Note that the first key has 1 merge operand, the
+  // second one has 2 merge operands, and the third one has 3 merge operands.
+  // Also, we'll take some snapshots to make sure the merge operands are
+  // preserved during flush.
+  std::vector<ManagedSnapshot> snapshots;
+  snapshots.reserve(3);
+
+  for (size_t i = 0; i < keys.size(); ++i) {
+    snapshots.emplace_back(db_);
+
+    const std::string suffix = std::to_string(i + 1);
+
+    for (size_t j = i; j < keys.size(); ++j) {
+      ASSERT_OK(Merge(keys[j], keys[j].ToString() + suffix));
+    }
+  }
+
+  // Verify the results and status codes of various types of point lookups.
+  auto verify = [&](const std::optional<size_t>& threshold) {
+    ReadOptions read_options;
+    read_options.merge_operand_count_threshold = threshold;
+
+    // Check Get()
+    {
+      for (size_t i = 0; i < keys.size(); ++i) {
+        PinnableSlice value;
+        const Status status =
+            db_->Get(read_options, db_->DefaultColumnFamily(), keys[i], &value);
+        ASSERT_OK(status);
+        ASSERT_EQ(status.IsOkMergeOperandThresholdExceeded(),
+                  threshold.has_value() && i + 1 > threshold.value());
+        ASSERT_EQ(value, keys[i].ToString() + std::to_string(i + 1));
+      }
+    }
+
+    // Check old-style MultiGet()
+    {
+      std::vector<std::string> values;
+      std::vector<Status> statuses = db_->MultiGet(read_options, keys, &values);
+
+      for (size_t i = 0; i < keys.size(); ++i) {
+        ASSERT_OK(statuses[i]);
+        ASSERT_EQ(statuses[i].IsOkMergeOperandThresholdExceeded(),
+                  threshold.has_value() && i + 1 > threshold.value());
+        ASSERT_EQ(values[i], keys[i].ToString() + std::to_string(i + 1));
+      }
+    }
+
+    // Check batched MultiGet()
+    {
+      std::vector<PinnableSlice> values(keys.size());
+      std::vector<Status> statuses(keys.size());
+      db_->MultiGet(read_options, db_->DefaultColumnFamily(), keys.size(),
+                    keys.data(), values.data(), statuses.data());
+
+      for (size_t i = 0; i < keys.size(); ++i) {
+        ASSERT_OK(statuses[i]);
+        ASSERT_EQ(statuses[i].IsOkMergeOperandThresholdExceeded(),
+                  threshold.has_value() && i + 1 > threshold.value());
+        ASSERT_EQ(values[i], keys[i].ToString() + std::to_string(i + 1));
+      }
+    }
+  };
+
+  // Test the case when the feature is disabled as well as various thresholds.
+  verify(std::nullopt);
+  for (size_t i = 0; i < 5; ++i) {
+    verify(i);
+  }
+
+  // Flush and try again to test the case when results are served from SSTs.
+  ASSERT_OK(Flush());
+  verify(std::nullopt);
+  for (size_t i = 0; i < 5; ++i) {
+    verify(i);
+  }
+}
+
 TEST_F(DBMergeOperatorTest, DataBlockBinaryAndHash) {
  // Basic test to check that merge operator works with data block index type
  // DataBlockBinaryAndHash.
@ -857,6 +951,98 @@ TEST_P(PerConfigMergeOperatorPinningTest, Randomized) {
  VerifyDBFromMap(true_data);
 }

+TEST_F(DBMergeOperatorTest, MaxSuccessiveMergesBaseValues) {
+  Options options = CurrentOptions();
+  options.create_if_missing = true;
+  options.merge_operator = MergeOperators::CreatePutOperator();
+  options.max_successive_merges = 1;
+  options.env = env_;
+  Reopen(options);
+
+  constexpr char foo[] = "foo";
+  constexpr char bar[] = "bar";
+  constexpr char baz[] = "baz";
+  constexpr char qux[] = "qux";
+  constexpr char corge[] = "corge";
+
+  // No base value
+  {
+    constexpr char key[] = "key1";
+
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), key, foo));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), key, bar));
+
+    PinnableSlice result;
+    ASSERT_OK(
+        db_->Get(ReadOptions(), db_->DefaultColumnFamily(), key, &result));
+    ASSERT_EQ(result, bar);
+
+    // We expect the second Merge to be converted to a Put because of
+    // max_successive_merges.
+    constexpr size_t max_key_versions = 8;
+    std::vector<KeyVersion> key_versions;
+    ASSERT_OK(GetAllKeyVersions(db_, db_->DefaultColumnFamily(), key, key,
+                                max_key_versions, &key_versions));
+    ASSERT_EQ(key_versions.size(), 2);
+    ASSERT_EQ(key_versions[0].type, kTypeValue);
+    ASSERT_EQ(key_versions[1].type, kTypeMerge);
+  }
+
+  // Plain base value
+  {
+    constexpr char key[] = "key2";
+
+    ASSERT_OK(db_->Put(WriteOptions(), db_->DefaultColumnFamily(), key, foo));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), key, bar));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), key, baz));
+
+    PinnableSlice result;
+    ASSERT_OK(
+        db_->Get(ReadOptions(), db_->DefaultColumnFamily(), key, &result));
+    ASSERT_EQ(result, baz);
+
+    // We expect the second Merge to be converted to a Put because of
+    // max_successive_merges.
+    constexpr size_t max_key_versions = 8;
+    std::vector<KeyVersion> key_versions;
+    ASSERT_OK(GetAllKeyVersions(db_, db_->DefaultColumnFamily(), key, key,
+                                max_key_versions, &key_versions));
+    ASSERT_EQ(key_versions.size(), 3);
+    ASSERT_EQ(key_versions[0].type, kTypeValue);
+    ASSERT_EQ(key_versions[1].type, kTypeMerge);
+    ASSERT_EQ(key_versions[2].type, kTypeValue);
+  }
+
+  // Wide-column base value
+  {
+    constexpr char key[] = "key3";
+    const WideColumns columns{{kDefaultWideColumnName, foo}, {bar, baz}};
+
+    ASSERT_OK(db_->PutEntity(WriteOptions(), db_->DefaultColumnFamily(), key,
+                             columns));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), key, qux));
+    ASSERT_OK(
+        db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), key, corge));
+
+    PinnableWideColumns result;
+    ASSERT_OK(db_->GetEntity(ReadOptions(), db_->DefaultColumnFamily(), key,
+                             &result));
+    const WideColumns expected{{kDefaultWideColumnName, corge}, {bar, baz}};
+    ASSERT_EQ(result.columns(), expected);
+
+    // We expect the second Merge to be converted to a PutEntity because of
+    // max_successive_merges.
+    constexpr size_t max_key_versions = 8;
+    std::vector<KeyVersion> key_versions;
+    ASSERT_OK(GetAllKeyVersions(db_, db_->DefaultColumnFamily(), key, key,
+                                max_key_versions, &key_versions));
+    ASSERT_EQ(key_versions.size(), 3);
+    ASSERT_EQ(key_versions[0].type, kTypeWideColumnEntity);
+    ASSERT_EQ(key_versions[1].type, kTypeMerge);
+    ASSERT_EQ(key_versions[2].type, kTypeWideColumnEntity);
+  }
+}
+
 }  // namespace ROCKSDB_NAMESPACE

 int main(int argc, char** argv) {
--- a/db/db_options_test.cc
+++ b/db/db_options_test.cc
@ -19,6 +19,7 @@
 #include "rocksdb/convenience.h"
 #include "rocksdb/rate_limiter.h"
 #include "rocksdb/stats_history.h"
+#include "test_util/mock_time_env.h"
 #include "test_util/sync_point.h"
 #include "test_util/testutil.h"
 #include "util/random.h"
@ -1033,6 +1034,198 @@ TEST_F(DBOptionsTest, SetFIFOCompactionOptions) {
  ASSERT_EQ(fifo_temp_opt[1].age, 30000);
 }

+TEST_F(DBOptionsTest, OffPeakTimes) {
+  Options options;
+  options.create_if_missing = true;
+  Random rnd(test::RandomSeed());
+
+  auto verify_invalid = [&]() {
+    Status s = DBImpl::TEST_ValidateOptions(options);
+    ASSERT_NOK(s);
+    ASSERT_TRUE(s.IsInvalidArgument());
+  };
+
+  auto verify_valid = [&]() {
+    Status s = DBImpl::TEST_ValidateOptions(options);
+    ASSERT_OK(s);
+    ASSERT_FALSE(s.IsInvalidArgument());
+  };
+  std::vector<std::string> invalid_cases = {
+      "06:30-",
+      "-23:30",  // Both need to be set
+      "00:00-00:00",
+      "06:30-06:30"  //  Start time cannot be the same as end time
+      "12:30 PM-23:30",
+      "12:01AM-11:00PM",  // Invalid format
+      "01:99-22:00",      // Invalid value for minutes
+      "00:00-24:00",      // 24:00 is an invalid value
+      "6-7",
+      "6:-7",
+      "06:31.42-7:00",
+      "6.31:42-7:00",
+      "6:0-7:",
+      "15:0.2-3:.7",
+      ":00-00:02",
+      "02:00-:00",
+      "random-value",
+      "No:No-Hi:Hi",
+  };
+
+  std::vector<std::string> valid_cases = {
+      "",  // Not enabled. Valid case
+      "06:30-11:30",
+      "06:30-23:30",
+      "13:30-14:30",
+      "00:00-23:59",  // Entire Day
+      "23:30-01:15",  // From 11:30PM to 1:15AM next day. Valid case.
+      "1:0000000000000-2:000000000042",  // Weird, but we can parse the int.
+  };
+
+  for (std::string invalid_case : invalid_cases) {
+    options.daily_offpeak_time_utc = invalid_case;
+    verify_invalid();
+  }
+  for (std::string valid_case : valid_cases) {
+    options.daily_offpeak_time_utc = valid_case;
+    verify_valid();
+  }
+
+  auto verify_is_now_offpeak = [&](bool expected, int now_utc_hour,
+                                   int now_utc_minute, int now_utc_second = 0) {
+    auto mock_clock = std::make_shared<MockSystemClock>(env_->GetSystemClock());
+    // Add some extra random days to current time
+    int days = rnd.Uniform(100);
+    mock_clock->SetCurrentTime(days * 86400 + now_utc_hour * 3600 +
+                               now_utc_minute * 60 + now_utc_second);
+    Status s = DBImpl::TEST_ValidateOptions(options);
+    ASSERT_OK(s);
+    auto db_options = MutableDBOptions(options);
+    ASSERT_EQ(expected, db_options.IsNowOffPeak(mock_clock.get()));
+  };
+
+  options.daily_offpeak_time_utc = "";
+  verify_is_now_offpeak(false, 12, 30);
+
+  options.daily_offpeak_time_utc = "06:30-11:30";
+  verify_is_now_offpeak(false, 5, 30);
+  verify_is_now_offpeak(true, 6, 30);
+  verify_is_now_offpeak(true, 10, 30);
+  verify_is_now_offpeak(true, 11, 30);
+  verify_is_now_offpeak(false, 13, 30);
+
+  options.daily_offpeak_time_utc = "23:30-04:30";
+  verify_is_now_offpeak(false, 6, 30);
+  verify_is_now_offpeak(true, 23, 30);
+  verify_is_now_offpeak(true, 0, 0);
+  verify_is_now_offpeak(true, 1, 0);
+  verify_is_now_offpeak(true, 4, 30);
+  verify_is_now_offpeak(false, 4, 31);
+
+  // Entire day offpeak
+  options.daily_offpeak_time_utc = "00:00-23:59";
+  verify_is_now_offpeak(true, 0, 0);
+  verify_is_now_offpeak(true, 12, 00);
+  verify_is_now_offpeak(true, 23, 59);
+  verify_is_now_offpeak(true, 23, 59, 1);
+  verify_is_now_offpeak(true, 23, 59, 59);
+
+  // Open the db and test by Get/SetDBOptions
+  options.daily_offpeak_time_utc = "";
+  DestroyAndReopen(options);
+  ASSERT_EQ("", dbfull()->GetDBOptions().daily_offpeak_time_utc);
+  for (std::string invalid_case : invalid_cases) {
+    ASSERT_NOK(
+        dbfull()->SetDBOptions({{"daily_offpeak_time_utc", invalid_case}}));
+  }
+  for (std::string valid_case : valid_cases) {
+    ASSERT_OK(dbfull()->SetDBOptions({{"daily_offpeak_time_utc", valid_case}}));
+    ASSERT_EQ(valid_case, dbfull()->GetDBOptions().daily_offpeak_time_utc);
+  }
+  Close();
+
+  // Sets off-peak time from 11:30PM to 4:30AM next day.
+  // Starting at 1:30PM, use mock sleep to make time pass
+  // and see if IsNowOffPeak() returns correctly per time changes
+  int now_hour = 13;
+  int now_minute = 30;
+  options.daily_offpeak_time_utc = "23:30-04:30";
+  auto mock_clock = std::make_shared<MockSystemClock>(env_->GetSystemClock());
+  auto mock_env = std::make_unique<CompositeEnvWrapper>(env_, mock_clock);
+  // Add some extra random days to current time
+  int days = rnd.Uniform(100);
+  mock_clock->SetCurrentTime(days * 86400 + now_hour * 3600 + now_minute * 60);
+  options.env = mock_env.get();
+
+  // Starting at 1:30PM. It's not off-peak
+  DestroyAndReopen(options);
+  ASSERT_FALSE(MutableDBOptions(dbfull()->GetDBOptions())
+                   .IsNowOffPeak(mock_clock.get()));
+
+  // Now it's at 4:30PM. Still not off-peak
+  mock_clock->MockSleepForSeconds(3 * 3600);
+  ASSERT_FALSE(MutableDBOptions(dbfull()->GetDBOptions())
+                   .IsNowOffPeak(mock_clock.get()));
+
+  // Now it's at 11:30PM. It's off-peak
+  mock_clock->MockSleepForSeconds(7 * 3600);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Now it's at 2:30AM next day. It's still off-peak
+  mock_clock->MockSleepForSeconds(3 * 3600);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Now it's at 4:30AM. It's still off-peak
+  mock_clock->MockSleepForSeconds(2 * 3600);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Sleep for one more minute. It's at 4:31AM It's no longer off-peak
+  mock_clock->MockSleepForSeconds(60);
+  ASSERT_FALSE(MutableDBOptions(dbfull()->GetDBOptions())
+                   .IsNowOffPeak(mock_clock.get()));
+  Close();
+
+  // Entire day offpeak
+  options.daily_offpeak_time_utc = "00:00-23:59";
+  DestroyAndReopen(options);
+  // It doesn't matter what time it is. It should be just offpeak.
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Mock Sleep for 3 hours. It's still off-peak
+  mock_clock->MockSleepForSeconds(3 * 3600);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Mock Sleep for 20 hours. It's still off-peak
+  mock_clock->MockSleepForSeconds(20 * 3600);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Mock Sleep for 59 minutes. It's still off-peak
+  mock_clock->MockSleepForSeconds(59 * 60);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Mock Sleep for 59 seconds. It's still off-peak
+  mock_clock->MockSleepForSeconds(59);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  // Mock Sleep for 1 second (exactly 24h passed). It's still off-peak
+  mock_clock->MockSleepForSeconds(1);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+  // Another second for sanity check
+  mock_clock->MockSleepForSeconds(1);
+  ASSERT_TRUE(MutableDBOptions(dbfull()->GetDBOptions())
+                  .IsNowOffPeak(mock_clock.get()));
+
+  Close();
+}
+
 TEST_F(DBOptionsTest, CompactionReadaheadSizeChange) {
  for (bool use_direct_reads : {true, false}) {
    SpecialEnv env(env_);
--- a/db/db_test.cc
+++ b/db/db_test.cc
@ -644,6 +644,33 @@ TEST_F(DBTest, ReadFromPersistedTier) {
        ASSERT_OK(db_->Get(ropt, handles_[1], "bar", &value));
      }

+      const auto check_multiget_func =
+          [&](const ReadOptions& read_opts,
+              std::vector<ColumnFamilyHandle*> cfhs, std::vector<Slice>& keys,
+              std::vector<std::string>& values,
+              bool batched) -> std::vector<Status> {
+        if (!batched) {
+          return db_->MultiGet(read_opts, cfhs, keys, &values);
+        } else {
+          size_t num_keys = keys.size();
+          std::vector<Status> statuses;
+          std::vector<PinnableSlice> pinnable_values;
+          statuses.resize(num_keys);
+          pinnable_values.resize(num_keys);
+          values.resize(num_keys);
+          db_->MultiGet(read_opts, cfhs[0], num_keys, keys.data(),
+                        pinnable_values.data(), statuses.data(), false);
+          for (size_t i = 0; i < statuses.size(); ++i) {
+            if (statuses[i].ok()) {
+              values[i].assign(pinnable_values[i].data(),
+                               pinnable_values[i].size());
+              pinnable_values[i].Reset();
+            }
+          }
+          return statuses;
+        }
+      };
+
      // Multiget
      std::vector<ColumnFamilyHandle*> multiget_cfs;
      multiget_cfs.push_back(handles_[1]);
@ -652,14 +679,17 @@ TEST_F(DBTest, ReadFromPersistedTier) {
      multiget_keys.push_back("foo");
      multiget_keys.push_back("bar");
      std::vector<std::string> multiget_values;
-      auto statuses =
-          db_->MultiGet(ropt, multiget_cfs, multiget_keys, &multiget_values);
-      if (wopt.disableWAL) {
-        ASSERT_TRUE(statuses[0].IsNotFound());
-        ASSERT_TRUE(statuses[1].IsNotFound());
-      } else {
-        ASSERT_OK(statuses[0]);
-        ASSERT_OK(statuses[1]);
+      for (int i = 0; i < 2; i++) {
+        bool batched = i == 0;
+        auto statuses = check_multiget_func(ropt, multiget_cfs, multiget_keys,
+                                            multiget_values, batched);
+        if (wopt.disableWAL) {
+          ASSERT_TRUE(statuses[0].IsNotFound());
+          ASSERT_TRUE(statuses[1].IsNotFound());
+        } else {
+          ASSERT_OK(statuses[0]);
+          ASSERT_OK(statuses[1]);
+        }
      }

      // 2nd round: flush and put a new value in memtable.
@ -683,16 +713,21 @@ TEST_F(DBTest, ReadFromPersistedTier) {
      // Expect same result in multiget
      multiget_cfs.push_back(handles_[1]);
      multiget_keys.push_back("rocksdb");
-      statuses =
-          db_->MultiGet(ropt, multiget_cfs, multiget_keys, &multiget_values);
-      ASSERT_TRUE(statuses[0].ok());
-      ASSERT_EQ("first", multiget_values[0]);
-      ASSERT_TRUE(statuses[1].ok());
-      ASSERT_EQ("one", multiget_values[1]);
-      if (wopt.disableWAL) {
-        ASSERT_TRUE(statuses[2].IsNotFound());
-      } else {
-        ASSERT_OK(statuses[2]);
+      multiget_values.clear();
+
+      for (int i = 0; i < 2; i++) {
+        bool batched = i == 0;
+        auto statuses = check_multiget_func(ropt, multiget_cfs, multiget_keys,
+                                            multiget_values, batched);
+        ASSERT_TRUE(statuses[0].ok());
+        ASSERT_EQ("first", multiget_values[0]);
+        ASSERT_TRUE(statuses[1].ok());
+        ASSERT_EQ("one", multiget_values[1]);
+        if (wopt.disableWAL) {
+          ASSERT_TRUE(statuses[2].IsNotFound());
+        } else {
+          ASSERT_OK(statuses[2]);
+        }
      }

      // 3rd round: delete and flush
@ -712,17 +747,21 @@ TEST_F(DBTest, ReadFromPersistedTier) {
      ASSERT_TRUE(db_->Get(ropt, handles_[1], "rocksdb", &value).ok());
      ASSERT_EQ(value, "hello");

-      statuses =
-          db_->MultiGet(ropt, multiget_cfs, multiget_keys, &multiget_values);
-      ASSERT_TRUE(statuses[0].IsNotFound());
-      if (wopt.disableWAL) {
-        ASSERT_TRUE(statuses[1].ok());
-        ASSERT_EQ("one", multiget_values[1]);
-      } else {
-        ASSERT_TRUE(statuses[1].IsNotFound());
+      multiget_values.clear();
+      for (int i = 0; i < 2; i++) {
+        bool batched = i == 0;
+        auto statuses = check_multiget_func(ropt, multiget_cfs, multiget_keys,
+                                            multiget_values, batched);
+        ASSERT_TRUE(statuses[0].IsNotFound());
+        if (wopt.disableWAL) {
+          ASSERT_TRUE(statuses[1].ok());
+          ASSERT_EQ("one", multiget_values[1]);
+        } else {
+          ASSERT_TRUE(statuses[1].IsNotFound());
+        }
+        ASSERT_TRUE(statuses[2].ok());
+        ASSERT_EQ("hello", multiget_values[2]);
      }
-      ASSERT_TRUE(statuses[2].ok());
-      ASSERT_EQ("hello", multiget_values[2]);
      if (wopt.disableWAL == 0) {
        DestroyAndReopen(options);
      }
@ -6992,8 +7031,9 @@ TEST_F(DBTest, RowCache) {
    using CacheWrapper::CacheWrapper;
    const char* Name() const override { return "FailInsertionCache"; }
    Status Insert(const Slice&, Cache::ObjectPtr, const CacheItemHelper*,
-                  size_t, Handle** = nullptr,
-                  Priority = Priority::LOW) override {
+                  size_t, Handle** = nullptr, Priority = Priority::LOW,
+                  const Slice& /*compressed*/ = Slice(),
+                  CompressionType /*type*/ = kNoCompression) override {
      return Status::MemoryLimit();
    }
  };
--- a/db/db_test_util.cc
+++ b/db/db_test_util.cc
@ -699,6 +699,7 @@ void DBTestBase::Destroy(const Options& options, bool delete_cf_paths) {
 }

 Status DBTestBase::ReadOnlyReopen(const Options& options) {
+  Close();
  MaybeInstallTimeElapseOnlySleep(options);
  return DB::OpenForReadOnly(options, dbname_, &db_);
 }
@ -1716,12 +1717,12 @@ TargetCacheChargeTrackingCache<R>::TargetCacheChargeTrackingCache(
      cache_charge_increments_sum_(0) {}

 template <CacheEntryRole R>
-Status TargetCacheChargeTrackingCache<R>::Insert(const Slice& key,
-                                                 ObjectPtr value,
-                                                 const CacheItemHelper* helper,
-                                                 size_t charge, Handle** handle,
-                                                 Priority priority) {
-  Status s = target_->Insert(key, value, helper, charge, handle, priority);
+Status TargetCacheChargeTrackingCache<R>::Insert(
+    const Slice& key, ObjectPtr value, const CacheItemHelper* helper,
+    size_t charge, Handle** handle, Priority priority, const Slice& compressed,
+    CompressionType type) {
+  Status s = target_->Insert(key, value, helper, charge, handle, priority,
+                             compressed, type);
  if (helper == kCrmHelper) {
    if (last_peak_tracked_) {
      cache_charge_peak_ = 0;
--- a/db/db_test_util.h
+++ b/db/db_test_util.h
@ -233,6 +233,7 @@ class SpecialEnv : public EnvWrapper {
      size_t GetUniqueId(char* id, size_t max_size) const override {
        return base_->GetUniqueId(id, max_size);
      }
+      uint64_t GetFileSize() final { return base_->GetFileSize(); }
    };
    class ManifestFile : public WritableFile {
     public:
@ -345,6 +346,7 @@ class SpecialEnv : public EnvWrapper {
      Status Allocate(uint64_t offset, uint64_t len) override {
        return base_->Allocate(offset, len);
      }
+      uint64_t GetFileSize() final { return base_->GetFileSize(); }

     private:
      SpecialEnv* env_;
@ -936,8 +938,9 @@ class TargetCacheChargeTrackingCache : public CacheWrapper {

  Status Insert(const Slice& key, ObjectPtr value,
                const CacheItemHelper* helper, size_t charge,
-                Handle** handle = nullptr,
-                Priority priority = Priority::LOW) override;
+                Handle** handle = nullptr, Priority priority = Priority::LOW,
+                const Slice& compressed = Slice(),
+                CompressionType type = kNoCompression) override;

  using Cache::Release;
  bool Release(Handle* handle, bool erase_if_last_ref = false) override;
--- a/db/db_with_timestamp_basic_test.cc
+++ b/db/db_with_timestamp_basic_test.cc
@ -1617,6 +1617,105 @@ TEST_F(DBBasicTestWithTimestamp, MultiGetRangeFiltering) {
  Close();
 }

+TEST_F(DBBasicTestWithTimestamp, GetWithRowCache) {
+  Options options = CurrentOptions();
+  options.env = env_;
+  options.create_if_missing = true;
+  options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
+  LRUCacheOptions cache_options;
+  cache_options.capacity = 8192;
+  options.row_cache = cache_options.MakeSharedRowCache();
+
+  const size_t kTimestampSize = Timestamp(0, 0).size();
+  TestComparator test_cmp(kTimestampSize);
+  options.comparator = &test_cmp;
+  DestroyAndReopen(options);
+
+  WriteOptions write_opts;
+  std::string ts_early = Timestamp(1, 0);
+  std::string ts_later = Timestamp(10, 0);
+  Slice ts_later_slice = ts_later;
+
+  const Snapshot* snap_with_nothing = db_->GetSnapshot();
+  ASSERT_OK(db_->Put(write_opts, "foo", ts_early, "bar"));
+  const Snapshot* snap_with_foo = db_->GetSnapshot();
+
+  // Ensure file has sequence number greater than snapshot_with_foo
+  for (int i = 0; i < 10; i++) {
+    std::string numStr = std::to_string(i);
+    ASSERT_OK(db_->Put(write_opts, numStr, ts_later, numStr));
+  }
+  ASSERT_OK(Flush());
+  ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 0);
+  ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 0);
+
+  ReadOptions read_opts;
+  read_opts.timestamp = &ts_later_slice;
+
+  std::string read_value;
+  std::string read_ts;
+  Status s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+  ASSERT_OK(s);
+  ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 0);
+  ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 1);
+  ASSERT_EQ(read_ts, ts_early);
+
+  s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+  ASSERT_OK(s);
+  ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 1);
+  ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 1);
+  // Row cache is not storing the ts when record is inserted/updated.
+  // To be fixed after enabling ROW_CACHE with timestamp.
+  // ASSERT_EQ(read_ts, ts_early);
+
+  {
+    std::string ts_nothing = Timestamp(0, 0);
+    Slice ts_nothing_slice = ts_nothing;
+    read_opts.timestamp = &ts_nothing_slice;
+    s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+    ASSERT_TRUE(s.IsNotFound());
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 1);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 2);
+
+    read_opts.timestamp = &ts_later_slice;
+    s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+    ASSERT_OK(s);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 2);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 2);
+  }
+
+  {
+    read_opts.snapshot = snap_with_foo;
+
+    s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+    ASSERT_OK(s);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 2);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 3);
+
+    s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+    ASSERT_OK(s);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 3);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 3);
+  }
+
+  {
+    read_opts.snapshot = snap_with_nothing;
+    s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+    ASSERT_TRUE(s.IsNotFound());
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 3);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 4);
+
+    s = db_->Get(read_opts, "foo", &read_value, &read_ts);
+    ASSERT_TRUE(s.IsNotFound());
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 3);
+    ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 5);
+  }
+
+  db_->ReleaseSnapshot(snap_with_nothing);
+  db_->ReleaseSnapshot(snap_with_foo);
+  Close();
+}
+
 TEST_P(DBBasicTestWithTimestampTableOptions, MultiGetPrefixFilter) {
  Options options = CurrentOptions();
  options.env = env_;
@ -4375,4 +4474,4 @@ int main(int argc, char** argv) {
  ::testing::InitGoogleTest(&argc, argv);
  RegisterCustomObjects(argc, argv);
  return RUN_ALL_TESTS();
-}
+}
--- a/db/error_handler.cc
+++ b/db/error_handler.cc
@ -655,6 +655,7 @@ const Status& ErrorHandler::StartRecoverFromRetryableBGIOError(
  }

  recovery_in_prog_ = true;
+  TEST_SYNC_POINT("StartRecoverFromRetryableBGIOError::in_progress");
  recovery_thread_.reset(
      new port::Thread(&ErrorHandler::RecoverFromRetryableBGIOError, this));

@ -669,14 +670,18 @@ const Status& ErrorHandler::StartRecoverFromRetryableBGIOError(
 // mutex is released.
 void ErrorHandler::RecoverFromRetryableBGIOError() {
  TEST_SYNC_POINT("RecoverFromRetryableBGIOError:BeforeStart");
+  TEST_SYNC_POINT("RecoverFromRetryableBGIOError:BeforeStart2");
  InstrumentedMutexLock l(db_mutex_);
  if (end_recovery_) {
    EventHelpers::NotifyOnErrorRecoveryEnd(db_options_.listeners, bg_error_,
                                           Status::ShutdownInProgress(),
                                           db_mutex_);
+
+    recovery_in_prog_ = false;
    return;
  }
  DBRecoverContext context = recover_context_;
+  context.flush_after_recovery = true;
  int resume_count = db_options_.max_bgerror_resume_count;
  uint64_t wait_interval = db_options_.bgerror_resume_retry_interval;
  uint64_t retry_count = 0;
@ -686,6 +691,7 @@ void ErrorHandler::RecoverFromRetryableBGIOError() {
      EventHelpers::NotifyOnErrorRecoveryEnd(db_options_.listeners, bg_error_,
                                             Status::ShutdownInProgress(),
                                             db_mutex_);
+      recovery_in_prog_ = false;
      return;
    }
    TEST_SYNC_POINT("RecoverFromRetryableBGIOError:BeforeResume0");
--- a/db/error_handler.h
+++ b/db/error_handler.h
@ -19,10 +19,13 @@ class DBImpl;
 // FlushReason, which tells the flush job why this flush is called.
 struct DBRecoverContext {
  FlushReason flush_reason;
+  bool flush_after_recovery;

-  DBRecoverContext() : flush_reason(FlushReason::kErrorRecovery) {}
-
-  DBRecoverContext(FlushReason reason) : flush_reason(reason) {}
+  DBRecoverContext()
+      : flush_reason(FlushReason::kErrorRecovery),
+        flush_after_recovery(false) {}
+  DBRecoverContext(FlushReason reason)
+      : flush_reason(reason), flush_after_recovery(false) {}
 };

 class ErrorHandler {
--- a/db/event_helpers.cc
+++ b/db/event_helpers.cc
@ -240,6 +240,8 @@ void EventHelpers::NotifyOnErrorRecoveryEnd(
      info.new_bg_error.PermitUncheckedError();
    }
    db_mutex->Lock();
+  } else {
+    old_bg_error.PermitUncheckedError();
  }
 }

--- a/db/flush_job.cc
+++ b/db/flush_job.cc
@ -79,6 +79,8 @@ const char* GetFlushReasonString(FlushReason flush_reason) {
      return "Error Recovery Retry Flush";
    case FlushReason::kWalFull:
      return "WAL Full";
+    case FlushReason::kCatchUpAfterErrorRecovery:
+      return "Catch Up After Error Recovery";
    default:
      return "Invalid";
  }
@ -98,7 +100,7 @@ FlushJob::FlushJob(
    Statistics* stats, EventLogger* event_logger, bool measure_io_stats,
    const bool sync_output_directory, const bool write_manifest,
    Env::Priority thread_pri, const std::shared_ptr<IOTracer>& io_tracer,
-    const SeqnoToTimeMapping& seqno_time_mapping, const std::string& db_id,
+    const SeqnoToTimeMapping& seqno_to_time_mapping, const std::string& db_id,
    const std::string& db_session_id, std::string full_history_ts_low,
    BlobFileCompletionCallback* blob_callback)
    : dbname_(dbname),
@ -134,7 +136,7 @@ FlushJob::FlushJob(
      clock_(db_options_.clock),
      full_history_ts_low_(std::move(full_history_ts_low)),
      blob_callback_(blob_callback),
-      db_impl_seqno_time_mapping_(seqno_time_mapping) {
+      db_impl_seqno_to_time_mapping_(seqno_to_time_mapping) {
  // Update the thread status to indicate flush.
  ReportStartedFlush();
  TEST_SYNC_POINT("FlushJob::FlushJob()");
@ -215,7 +217,8 @@ void FlushJob::PickMemTable() {
 }

 Status FlushJob::Run(LogsWithPrepTracker* prep_tracker, FileMetaData* file_meta,
-                     bool* switched_to_mempurge) {
+                     bool* switched_to_mempurge, bool* skipped_since_bg_error,
+                     ErrorHandler* error_handler) {
  TEST_SYNC_POINT("FlushJob::Start");
  db_mutex_->AssertHeld();
  assert(pick_memtable_called);
@ -303,17 +306,32 @@ Status FlushJob::Run(LogsWithPrepTracker* prep_tracker, FileMetaData* file_meta,
  }

  if (!s.ok()) {
-    cfd_->imm()->RollbackMemtableFlush(mems_, meta_.fd.GetNumber());
+    cfd_->imm()->RollbackMemtableFlush(
+        mems_, /*rollback_succeeding_memtables=*/!db_options_.atomic_flush);
  } else if (write_manifest_) {
-    TEST_SYNC_POINT("FlushJob::InstallResults");
-    // Replace immutable memtable with the generated Table
-    s = cfd_->imm()->TryInstallMemtableFlushResults(
-        cfd_, mutable_cf_options_, mems_, prep_tracker, versions_, db_mutex_,
-        meta_.fd.GetNumber(), &job_context_->memtables_to_free, db_directory_,
-        log_buffer_, &committed_flush_jobs_info_,
-        !(mempurge_s.ok()) /* write_edit : true if no mempurge happened (or if aborted),
+    assert(!db_options_.atomic_flush);
+    if (!db_options_.atomic_flush &&
+        flush_reason_ != FlushReason::kErrorRecovery &&
+        flush_reason_ != FlushReason::kErrorRecoveryRetryFlush &&
+        error_handler && !error_handler->GetBGError().ok() &&
+        error_handler->IsBGWorkStopped()) {
+      cfd_->imm()->RollbackMemtableFlush(
+          mems_, /*rollback_succeeding_memtables=*/!db_options_.atomic_flush);
+      s = error_handler->GetBGError();
+      if (skipped_since_bg_error) {
+        *skipped_since_bg_error = true;
+      }
+    } else {
+      TEST_SYNC_POINT("FlushJob::InstallResults");
+      // Replace immutable memtable with the generated Table
+      s = cfd_->imm()->TryInstallMemtableFlushResults(
+              cfd_, mutable_cf_options_, mems_, prep_tracker, versions_, db_mutex_,
+              meta_.fd.GetNumber(), &job_context_->memtables_to_free, db_directory_,
+              log_buffer_, &committed_flush_jobs_info_,
+              !(mempurge_s.ok()) /* write_edit : true if no mempurge happened (or if aborted),
                              but 'false' if mempurge successful: no new min log number
                              or new level 0 file path to write to manifest. */);
+    }
  }

  if (s.ok() && file_meta != nullptr) {
@ -833,10 +851,11 @@ Status FlushJob::WriteLevel0Table() {
  Status s;

  SequenceNumber smallest_seqno = mems_.front()->GetEarliestSequenceNumber();
-  if (!db_impl_seqno_time_mapping_.Empty()) {
-    // make a local copy, as the seqno_time_mapping from db_impl is not thread
-    // safe, which will be used while not holding the db_mutex.
-    seqno_to_time_mapping_ = db_impl_seqno_time_mapping_.Copy(smallest_seqno);
+  if (!db_impl_seqno_to_time_mapping_.Empty()) {
+    // make a local copy, as the seqno_to_time_mapping from db_impl is not
+    // thread safe, which will be used while not holding the db_mutex.
+    seqno_to_time_mapping_ =
+        db_impl_seqno_to_time_mapping_.Copy(smallest_seqno);
  }

  std::vector<BlobFileAddition> blob_file_additions;
@ -965,6 +984,7 @@ Status FlushJob::WriteLevel0Table() {
                     &table_properties_, write_hint, full_history_ts_low,
                     blob_callback_, base_, &num_input_entries,
                     &memtable_payload_bytes, &memtable_garbage_bytes);
+      TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table:s", &s);
      // TODO: Cleanup io_status in BuildTable and table builders
      assert(!s.ok() || io_s.ok());
      io_s.PermitUncheckedError();
--- a/db/flush_job.h
+++ b/db/flush_job.h
@ -83,9 +83,14 @@ class FlushJob {
  // Require db_mutex held.
  // Once PickMemTable() is called, either Run() or Cancel() has to be called.
  void PickMemTable();
+  // @param skip_since_bg_error If not nullptr and if atomic_flush=false,
+  // then it is set to true if flush installation is skipped and memtable
+  // is rolled back due to existing background error.
  Status Run(LogsWithPrepTracker* prep_tracker = nullptr,
             FileMetaData* file_meta = nullptr,
-             bool* switched_to_mempurge = nullptr);
+             bool* switched_to_mempurge = nullptr,
+             bool* skipped_since_bg_error = nullptr,
+             ErrorHandler* error_handler = nullptr);
  void Cancel();
  const autovector<MemTable*>& GetMemTables() const { return mems_; }

@ -205,9 +210,9 @@ class FlushJob {
  const std::string full_history_ts_low_;
  BlobFileCompletionCallback* blob_callback_;

-  // reference to the seqno_time_mapping_ in db_impl.h, not safe to read without
-  // db mutex
-  const SeqnoToTimeMapping& db_impl_seqno_time_mapping_;
+  // reference to the seqno_to_time_mapping_ in db_impl.h, not safe to read
+  // without db mutex
+  const SeqnoToTimeMapping& db_impl_seqno_to_time_mapping_;
  SeqnoToTimeMapping seqno_to_time_mapping_;

  // Keeps track of the newest user-defined timestamp for this flush job if
--- a/db/flush_job_test.cc
+++ b/db/flush_job_test.cc
@ -457,7 +457,8 @@ TEST_F(FlushJobTest, FlushMemtablesMultipleColumnFamilies) {
    // Verify that imm is empty
    ASSERT_EQ(std::numeric_limits<uint64_t>::max(),
              all_cfds[k]->imm()->GetEarliestMemTableID());
-    ASSERT_EQ(0, all_cfds[k]->imm()->GetLatestMemTableID());
+    ASSERT_EQ(0, all_cfds[k]->imm()->GetLatestMemTableID(
+                     false /* for_atomic_flush */));
    ++k;
  }

--- a/db/memtable.cc
+++ b/db/memtable.cc
@ -1054,25 +1054,15 @@ static bool SaveValue(void* arg, const char* entry) {
          assert(s->do_merge);

          if (s->value || s->columns) {
-            std::string result;
            // `op_failure_scope` (an output parameter) is not provided (set to
            // nullptr) since a failure must be propagated regardless of its
            // value.
            *(s->status) = MergeHelper::TimedFullMerge(
-                merge_operator, s->key->user_key(), &v,
-                merge_context->GetOperands(), &result, s->logger, s->statistics,
-                s->clock, /* result_operand */ nullptr,
-                /* update_num_ops_stats */ true,
+                merge_operator, s->key->user_key(),
+                MergeHelper::kPlainBaseValue, v, merge_context->GetOperands(),
+                s->logger, s->statistics, s->clock,
+                /* update_num_ops_stats */ true, s->value, s->columns,
                /* op_failure_scope */ nullptr);
-
-            if (s->status->ok()) {
-              if (s->value) {
-                *(s->value) = std::move(result);
-              } else {
-                assert(s->columns);
-                s->columns->SetPlainValue(std::move(result));
-              }
-            }
          }
        } else if (s->value) {
          s->value->assign(v.data(), v.size());
@ -1117,35 +1107,15 @@ static bool SaveValue(void* arg, const char* entry) {
        } else if (*(s->merge_in_progress)) {
          assert(s->do_merge);

-          if (s->value) {
-            Slice value_of_default;
-            *(s->status) = WideColumnSerialization::GetValueOfDefaultColumn(
-                v, value_of_default);
-            if (s->status->ok()) {
-              // `op_failure_scope` (an output parameter) is not provided (set
-              // to nullptr) since a failure must be propagated regardless of
-              // its value.
-              *(s->status) = MergeHelper::TimedFullMerge(
-                  merge_operator, s->key->user_key(), &value_of_default,
-                  merge_context->GetOperands(), s->value, s->logger,
-                  s->statistics, s->clock, /* result_operand */ nullptr,
-                  /* update_num_ops_stats */ true,
-                  /* op_failure_scope */ nullptr);
-            }
-          } else if (s->columns) {
-            std::string result;
-            // `op_failure_scope` (an output parameter) is not provided (set to
-            // nullptr) since a failure must be propagated regardless of its
-            // value.
-            *(s->status) = MergeHelper::TimedFullMergeWithEntity(
-                merge_operator, s->key->user_key(), v,
-                merge_context->GetOperands(), &result, s->logger, s->statistics,
-                s->clock, /* update_num_ops_stats */ true,
+          if (s->value || s->columns) {
+            // `op_failure_scope` (an output parameter) is not provided (set
+            // to nullptr) since a failure must be propagated regardless of
+            // its value.
+            *(s->status) = MergeHelper::TimedFullMerge(
+                merge_operator, s->key->user_key(), MergeHelper::kWideBaseValue,
+                v, merge_context->GetOperands(), s->logger, s->statistics,
+                s->clock, /* update_num_ops_stats */ true, s->value, s->columns,
                /* op_failure_scope */ nullptr);
-
-            if (s->status->ok()) {
-              *(s->status) = s->columns->SetWideColumnValue(std::move(result));
-            }
          }
        } else if (s->value) {
          Slice value_of_default;
@ -1176,25 +1146,14 @@ static bool SaveValue(void* arg, const char* entry) {
      case kTypeRangeDeletion: {
        if (*(s->merge_in_progress)) {
          if (s->value || s->columns) {
-            std::string result;
            // `op_failure_scope` (an output parameter) is not provided (set to
            // nullptr) since a failure must be propagated regardless of its
            // value.
            *(s->status) = MergeHelper::TimedFullMerge(
-                merge_operator, s->key->user_key(), nullptr,
-                merge_context->GetOperands(), &result, s->logger, s->statistics,
-                s->clock, /* result_operand */ nullptr,
-                /* update_num_ops_stats */ true,
+                merge_operator, s->key->user_key(), MergeHelper::kNoBaseValue,
+                merge_context->GetOperands(), s->logger, s->statistics,
+                s->clock, /* update_num_ops_stats */ true, s->value, s->columns,
                /* op_failure_scope */ nullptr);
-
-            if (s->status->ok()) {
-              if (s->value) {
-                *(s->value) = std::move(result);
-              } else {
-                assert(s->columns);
-                s->columns->SetPlainValue(std::move(result));
-              }
-            }
          } else {
            // We have found a final value (a base deletion) and have newer
            // merge operands that we do not intend to merge. Nothing remains
@ -1227,25 +1186,14 @@ static bool SaveValue(void* arg, const char* entry) {
        if (s->do_merge && merge_operator->ShouldMerge(
                               merge_context->GetOperandsDirectionBackward())) {
          if (s->value || s->columns) {
-            std::string result;
            // `op_failure_scope` (an output parameter) is not provided (set to
            // nullptr) since a failure must be propagated regardless of its
            // value.
            *(s->status) = MergeHelper::TimedFullMerge(
-                merge_operator, s->key->user_key(), nullptr,
-                merge_context->GetOperands(), &result, s->logger, s->statistics,
-                s->clock, /* result_operand */ nullptr,
-                /* update_num_ops_stats */ true,
+                merge_operator, s->key->user_key(), MergeHelper::kNoBaseValue,
+                merge_context->GetOperands(), s->logger, s->statistics,
+                s->clock, /* update_num_ops_stats */ true, s->value, s->columns,
                /* op_failure_scope */ nullptr);
-
-            if (s->status->ok()) {
-              if (s->value) {
-                *(s->value) = std::move(result);
-              } else {
-                assert(s->columns);
-                s->columns->SetPlainValue(std::move(result));
-              }
-            }
          }

          *(s->found_final_value) = true;
--- a/db/memtable_list.cc
+++ b/db/memtable_list.cc
@ -434,23 +434,57 @@ void MemTableList::PickMemtablesToFlush(uint64_t max_memtable_id,
 }

 void MemTableList::RollbackMemtableFlush(const autovector<MemTable*>& mems,
-                                         uint64_t /*file_number*/) {
+                                         bool rollback_succeeding_memtables) {
+  TEST_SYNC_POINT("RollbackMemtableFlush");
  AutoThreadOperationStageUpdater stage_updater(
      ThreadStatus::STAGE_MEMTABLE_ROLLBACK);
-  assert(!mems.empty());
-
-  // If the flush was not successful, then just reset state.
-  // Maybe a succeeding attempt to flush will be successful.
+#ifndef NDEBUG
  for (MemTable* m : mems) {
    assert(m->flush_in_progress_);
    assert(m->file_number_ == 0);
-
-    m->flush_in_progress_ = false;
-    m->flush_completed_ = false;
-    m->edit_.Clear();
-    num_flush_not_started_++;
  }
-  imm_flush_needed.store(true, std::memory_order_release);
+#endif
+
+  if (rollback_succeeding_memtables && !mems.empty()) {
+    std::list<MemTable*>& memlist = current_->memlist_;
+    auto it = memlist.rbegin();
+    for (; *it != mems[0] && it != memlist.rend(); ++it) {
+    }
+    // mems should be in memlist
+    assert(*it == mems[0]);
+    if (*it == mems[0]) {
+      ++it;
+    }
+    while (it != memlist.rend()) {
+      MemTable* m = *it;
+      // Only rollback complete, not in-progress,
+      // in_progress can be flushes that are still writing SSTs
+      if (m->flush_completed_) {
+        m->flush_in_progress_ = false;
+        m->flush_completed_ = false;
+        m->edit_.Clear();
+        m->file_number_ = 0;
+        num_flush_not_started_++;
+        ++it;
+      } else {
+        break;
+      }
+    }
+  }
+
+  for (MemTable* m : mems) {
+    if (m->flush_in_progress_) {
+      assert(m->file_number_ == 0);
+      m->file_number_ = 0;
+      m->flush_in_progress_ = false;
+      m->flush_completed_ = false;
+      m->edit_.Clear();
+      num_flush_not_started_++;
+    }
+  }
+  if (!mems.empty()) {
+    imm_flush_needed.store(true, std::memory_order_release);
+  }
 }

 // Try record a successful flush in the manifest file. It might just return
--- a/db/memtable_list.h
+++ b/db/memtable_list.h
@ -271,8 +271,20 @@ class MemTableList {

  // Reset status of the given memtable list back to pending state so that
  // they can get picked up again on the next round of flush.
+  //
+  // @param rollback_succeeding_memtables If true, will rollback adjacent
+  // younger memtables whose flush is completed. Specifically, suppose the
+  // current immutable memtables are M_0,M_1...M_N ordered from youngest to
+  // oldest. Suppose that the youngest memtable in `mems` is M_K. We will try to
+  // rollback M_K-1, M_K-2... until the first memtable whose flush is
+  // not completed. These are the memtables that would have been installed
+  // by this flush job if it were to succeed. This flag is currently used
+  // by non atomic_flush rollback.
+  // Note that we also do rollback in `write_manifest_cb` by calling
+  // `RemoveMemTablesOrRestoreFlags()`. There we rollback the entire batch so
+  // it is similar to what we do here with rollback_succeeding_memtables=true.
  void RollbackMemtableFlush(const autovector<MemTable*>& mems,
-                             uint64_t file_number);
+                             bool rollback_succeeding_memtables);

  // Try commit a successful flush in the manifest file. It might just return
  // Status::OK letting a concurrent flush to do the actual the recording.
@ -374,11 +386,21 @@ class MemTableList {
    return memlist.back()->GetID();
  }

-  uint64_t GetLatestMemTableID() const {
+  uint64_t GetLatestMemTableID(bool for_atomic_flush) const {
    auto& memlist = current_->memlist_;
    if (memlist.empty()) {
      return 0;
    }
+    if (for_atomic_flush) {
+      // Scan the memtable list from new to old
+      for (auto it = memlist.begin(); it != memlist.end(); ++it) {
+        MemTable* m = *it;
+        if (m->atomic_flush_seqno_ != kMaxSequenceNumber) {
+          return m->GetID();
+        }
+      }
+      return 0;
+    }
    return memlist.front()->GetID();
  }

--- a/db/memtable_list_test.cc
+++ b/db/memtable_list_test.cc
@ -682,7 +682,7 @@ TEST_F(MemTableListTest, FlushPendingTest) {
  ASSERT_FALSE(list.imm_flush_needed.load(std::memory_order_acquire));

  // Revert flush
-  list.RollbackMemtableFlush(to_flush, 0);
+  list.RollbackMemtableFlush(to_flush, false);
  ASSERT_FALSE(list.IsFlushPending());
  ASSERT_TRUE(list.imm_flush_needed.load(std::memory_order_acquire));
  to_flush.clear();
@ -732,7 +732,7 @@ TEST_F(MemTableListTest, FlushPendingTest) {
  ASSERT_FALSE(list.imm_flush_needed.load(std::memory_order_acquire));

  // Rollback first pick of tables
-  list.RollbackMemtableFlush(to_flush, 0);
+  list.RollbackMemtableFlush(to_flush, false);
  ASSERT_TRUE(list.IsFlushPending());
  ASSERT_TRUE(list.imm_flush_needed.load(std::memory_order_acquire));
  to_flush.clear();
@ -833,7 +833,7 @@ TEST_F(MemTableListTest, FlushPendingTest) {
  // Add another table
  list.Add(tables[5], &to_delete);
  ASSERT_EQ(1, list.NumNotFlushed());
-  ASSERT_EQ(5, list.GetLatestMemTableID());
+  ASSERT_EQ(5, list.GetLatestMemTableID(false /* for_atomic_flush */));
  memtable_id = 4;
  // Pick tables to flush. The tables to pick must have ID smaller than or
  // equal to 4. Therefore, no table will be selected in this case.
--- a/db/merge_helper.cc
+++ b/db/merge_helper.cc
@ -24,6 +24,7 @@
 #include "rocksdb/system_clock.h"
 #include "table/format.h"
 #include "table/internal_iterator.h"
+#include "util/overload.h"

 namespace ROCKSDB_NAMESPACE {

@ -57,120 +58,326 @@ MergeHelper::MergeHelper(Env* env, const Comparator* user_comparator,
  }
 }

-Status MergeHelper::TimedFullMerge(
-    const MergeOperator* merge_operator, const Slice& key, const Slice* value,
-    const std::vector<Slice>& operands, std::string* result, Logger* logger,
-    Statistics* statistics, SystemClock* clock, Slice* result_operand,
-    bool update_num_ops_stats,
-    MergeOperator::OpFailureScope* op_failure_scope) {
-  assert(merge_operator != nullptr);
-
-  if (operands.empty()) {
-    assert(value != nullptr && result != nullptr);
-    result->assign(value->data(), value->size());
-    return Status::OK();
-  }
+template <typename Visitor>
+Status MergeHelper::TimedFullMergeCommonImpl(
+    const MergeOperator* merge_operator, const Slice& key,
+    MergeOperator::MergeOperationInputV3::ExistingValue&& existing_value,
+    const std::vector<Slice>& operands, Logger* logger, Statistics* statistics,
+    SystemClock* clock, bool update_num_ops_stats,
+    MergeOperator::OpFailureScope* op_failure_scope, Visitor&& visitor) {
+  assert(merge_operator);
+  assert(!operands.empty());

  if (update_num_ops_stats) {
    RecordInHistogram(statistics, READ_NUM_MERGE_OPERANDS,
                      static_cast<uint64_t>(operands.size()));
  }

+  const MergeOperator::MergeOperationInputV3 merge_in(
+      key, std::move(existing_value), operands, logger);
+  MergeOperator::MergeOperationOutputV3 merge_out;
+
  bool success = false;
-  Slice tmp_result_operand(nullptr, 0);
-  const MergeOperator::MergeOperationInput merge_in(key, value, operands,
-                                                    logger);
-  MergeOperator::MergeOperationOutput merge_out(*result, tmp_result_operand);
+
  {
-    // Setup to time the merge
    StopWatchNano timer(clock, statistics != nullptr);
    PERF_TIMER_GUARD(merge_operator_time_nanos);

-    // Do the merge
-    success = merge_operator->FullMergeV2(merge_in, &merge_out);
-
-    if (tmp_result_operand.data()) {
-      // FullMergeV2 result is an existing operand
-      if (result_operand != nullptr) {
-        *result_operand = tmp_result_operand;
-      } else {
-        result->assign(tmp_result_operand.data(), tmp_result_operand.size());
-      }
-    } else if (result_operand) {
-      *result_operand = Slice(nullptr, 0);
-    }
+    success = merge_operator->FullMergeV3(merge_in, &merge_out);

    RecordTick(statistics, MERGE_OPERATION_TOTAL_TIME,
               statistics ? timer.ElapsedNanos() : 0);
  }

-  if (op_failure_scope != nullptr) {
-    *op_failure_scope = merge_out.op_failure_scope;
-    // Apply default per merge_operator.h
-    if (*op_failure_scope == MergeOperator::OpFailureScope::kDefault) {
-      *op_failure_scope = MergeOperator::OpFailureScope::kTryMerge;
-    }
-  }
-
  if (!success) {
    RecordTick(statistics, NUMBER_MERGE_FAILURES);
+
+    if (op_failure_scope) {
+      *op_failure_scope = merge_out.op_failure_scope;
+      // Apply default per merge_operator.h
+      if (*op_failure_scope == MergeOperator::OpFailureScope::kDefault) {
+        *op_failure_scope = MergeOperator::OpFailureScope::kTryMerge;
+      }
+    }
+
    return Status::Corruption(Status::SubCode::kMergeOperatorFailed);
  }

-  return Status::OK();
+  return std::visit(std::forward<Visitor>(visitor),
+                    std::move(merge_out.new_value));
 }

-Status MergeHelper::TimedFullMergeWithEntity(
-    const MergeOperator* merge_operator, const Slice& key, Slice base_entity,
-    const std::vector<Slice>& operands, std::string* result, Logger* logger,
-    Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+Status MergeHelper::TimedFullMergeImpl(
+    const MergeOperator* merge_operator, const Slice& key,
+    MergeOperator::MergeOperationInputV3::ExistingValue&& existing_value,
+    const std::vector<Slice>& operands, Logger* logger, Statistics* statistics,
+    SystemClock* clock, bool update_num_ops_stats, std::string* result,
+    Slice* result_operand, ValueType* result_type,
    MergeOperator::OpFailureScope* op_failure_scope) {
-  WideColumns base_columns;
+  assert(result);
+  assert(result_type);

-  {
-    const Status s =
-        WideColumnSerialization::Deserialize(base_entity, base_columns);
-    if (!s.ok()) {
-      return s;
-    }
+  auto visitor = overload{
+      [&](std::string&& new_value) -> Status {
+        *result_type = kTypeValue;
+
+        if (result_operand) {
+          *result_operand = Slice(nullptr, 0);
+        }
+
+        *result = std::move(new_value);
+
+        return Status::OK();
+      },
+      [&](MergeOperator::MergeOperationOutputV3::NewColumns&& new_columns)
+          -> Status {
+        *result_type = kTypeWideColumnEntity;
+
+        if (result_operand) {
+          *result_operand = Slice(nullptr, 0);
+        }
+
+        result->clear();
+
+        WideColumns sorted_columns;
+        sorted_columns.reserve(new_columns.size());
+
+        for (const auto& column : new_columns) {
+          sorted_columns.emplace_back(column.first, column.second);
+        }
+
+        WideColumnsHelper::SortColumns(sorted_columns);
+
+        return WideColumnSerialization::Serialize(sorted_columns, *result);
+      },
+      [&](Slice&& operand) -> Status {
+        *result_type = kTypeValue;
+
+        if (result_operand) {
+          *result_operand = operand;
+          result->clear();
+        } else {
+          result->assign(operand.data(), operand.size());
+        }
+
+        return Status::OK();
+      }};
+
+  return TimedFullMergeCommonImpl(merge_operator, key,
+                                  std::move(existing_value), operands, logger,
+                                  statistics, clock, update_num_ops_stats,
+                                  op_failure_scope, std::move(visitor));
+}
+
+Status MergeHelper::TimedFullMergeImpl(
+    const MergeOperator* merge_operator, const Slice& key,
+    MergeOperator::MergeOperationInputV3::ExistingValue&& existing_value,
+    const std::vector<Slice>& operands, Logger* logger, Statistics* statistics,
+    SystemClock* clock, bool update_num_ops_stats, std::string* result_value,
+    PinnableWideColumns* result_entity,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  assert(result_value || result_entity);
+  assert(!result_value || !result_entity);
+
+  auto visitor = overload{
+      [&](std::string&& new_value) -> Status {
+        if (result_value) {
+          *result_value = std::move(new_value);
+
+          return Status::OK();
+        }
+
+        assert(result_entity);
+        result_entity->SetPlainValue(std::move(new_value));
+
+        return Status::OK();
+      },
+      [&](MergeOperator::MergeOperationOutputV3::NewColumns&& new_columns)
+          -> Status {
+        if (result_value) {
+          if (!new_columns.empty() &&
+              new_columns.front().first == kDefaultWideColumnName) {
+            *result_value = std::move(new_columns.front().second);
+          } else {
+            result_value->clear();
+          }
+
+          return Status::OK();
+        }
+
+        assert(result_entity);
+
+        WideColumns sorted_columns;
+        sorted_columns.reserve(new_columns.size());
+
+        for (const auto& column : new_columns) {
+          sorted_columns.emplace_back(column.first, column.second);
+        }
+
+        WideColumnsHelper::SortColumns(sorted_columns);
+
+        std::string result;
+        const Status s =
+            WideColumnSerialization::Serialize(sorted_columns, result);
+        if (!s.ok()) {
+          result_entity->Reset();
+          return s;
+        }
+
+        return result_entity->SetWideColumnValue(std::move(result));
+      },
+      [&](Slice&& operand) -> Status {
+        if (result_value) {
+          result_value->assign(operand.data(), operand.size());
+
+          return Status::OK();
+        }
+
+        assert(result_entity);
+        result_entity->SetPlainValue(operand);
+
+        return Status::OK();
+      }};
+
+  return TimedFullMergeCommonImpl(merge_operator, key,
+                                  std::move(existing_value), operands, logger,
+                                  statistics, clock, update_num_ops_stats,
+                                  op_failure_scope, std::move(visitor));
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, NoBaseValueTag,
+    const std::vector<Slice>& operands, Logger* logger, Statistics* statistics,
+    SystemClock* clock, bool update_num_ops_stats, std::string* result,
+    Slice* result_operand, ValueType* result_type,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value;
+
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result, result_operand,
+                            result_type, op_failure_scope);
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, PlainBaseValueTag,
+    const Slice& value, const std::vector<Slice>& operands, Logger* logger,
+    Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+    std::string* result, Slice* result_operand, ValueType* result_type,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value(value);
+
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result, result_operand,
+                            result_type, op_failure_scope);
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+    const Slice& entity, const std::vector<Slice>& operands, Logger* logger,
+    Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+    std::string* result, Slice* result_operand, ValueType* result_type,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value;
+
+  Slice entity_copy(entity);
+  WideColumns existing_columns;
+
+  const Status s =
+      WideColumnSerialization::Deserialize(entity_copy, existing_columns);
+  if (!s.ok()) {
+    return s;
  }

-  const bool has_default_column =
-      WideColumnsHelper::HasDefaultColumn(base_columns);
+  existing_value = std::move(existing_columns);

-  Slice value_of_default;
-  if (has_default_column) {
-    value_of_default = base_columns[0].value();
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result, result_operand,
+                            result_type, op_failure_scope);
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+    const WideColumns& columns, const std::vector<Slice>& operands,
+    Logger* logger, Statistics* statistics, SystemClock* clock,
+    bool update_num_ops_stats, std::string* result, Slice* result_operand,
+    ValueType* result_type, MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value(columns);
+
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result, result_operand,
+                            result_type, op_failure_scope);
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, NoBaseValueTag,
+    const std::vector<Slice>& operands, Logger* logger, Statistics* statistics,
+    SystemClock* clock, bool update_num_ops_stats, std::string* result_value,
+    PinnableWideColumns* result_entity,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value;
+
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result_value, result_entity,
+                            op_failure_scope);
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, PlainBaseValueTag,
+    const Slice& value, const std::vector<Slice>& operands, Logger* logger,
+    Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+    std::string* result_value, PinnableWideColumns* result_entity,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value(value);
+
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result_value, result_entity,
+                            op_failure_scope);
+}
+
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+    const Slice& entity, const std::vector<Slice>& operands, Logger* logger,
+    Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+    std::string* result_value, PinnableWideColumns* result_entity,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value;
+
+  Slice entity_copy(entity);
+  WideColumns existing_columns;
+
+  const Status s =
+      WideColumnSerialization::Deserialize(entity_copy, existing_columns);
+  if (!s.ok()) {
+    return s;
  }

-  std::string merge_result;
+  existing_value = std::move(existing_columns);

-  {
-    const Status s = TimedFullMerge(merge_operator, key, &value_of_default,
-                                    operands, &merge_result, logger, statistics,
-                                    clock, nullptr /* result_operand */,
-                                    update_num_ops_stats, op_failure_scope);
-    if (!s.ok()) {
-      return s;
-    }
-  }
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result_value, result_entity,
+                            op_failure_scope);
+}

-  if (has_default_column) {
-    base_columns[0].value() = merge_result;
+Status MergeHelper::TimedFullMerge(
+    const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+    const WideColumns& columns, const std::vector<Slice>& operands,
+    Logger* logger, Statistics* statistics, SystemClock* clock,
+    bool update_num_ops_stats, std::string* result_value,
+    PinnableWideColumns* result_entity,
+    MergeOperator::OpFailureScope* op_failure_scope) {
+  MergeOperator::MergeOperationInputV3::ExistingValue existing_value(columns);

-    const Status s = WideColumnSerialization::Serialize(base_columns, *result);
-    if (!s.ok()) {
-      return s;
-    }
-  } else {
-    const Status s =
-        WideColumnSerialization::Serialize(merge_result, base_columns, *result);
-    if (!s.ok()) {
-      return s;
-    }
-  }
-
-  return Status::OK();
+  return TimedFullMergeImpl(merge_operator, key, std::move(existing_value),
+                            operands, logger, statistics, clock,
+                            update_num_ops_stats, result_value, result_entity,
+                            op_failure_scope);
 }

 // PRE:  iter points to the first merge type entry
@ -288,7 +495,7 @@ Status MergeHelper::MergeUntil(InternalIterator* iter,
      // hit a put/delete/single delete
      //   => merge the put value or a nullptr with operands_
      //   => store result in operands_.back() (and update keys_.back())
-      //   => change the entry type to kTypeValue for keys_.back()
+      //   => change the entry type for keys_.back()
      // We are done! Success!

      // If there are no operands, just return the Status::OK(). That will cause
@ -301,24 +508,23 @@ Status MergeHelper::MergeUntil(InternalIterator* iter,
      // TODO: if we're in compaction and it's a put, it would be nice to run
      // compaction filter on it.
      std::string merge_result;
+      ValueType merge_result_type;
      MergeOperator::OpFailureScope op_failure_scope;

      if (range_del_agg &&
          range_del_agg->ShouldDelete(
              ikey, RangeDelPositioningMode::kForwardTraversal)) {
-        s = TimedFullMerge(user_merge_operator_, ikey.user_key, nullptr,
-                           merge_context_.GetOperands(), &merge_result, logger_,
-                           stats_, clock_,
-                           /* result_operand */ nullptr,
-                           /* update_num_ops_stats */ false, &op_failure_scope);
+        s = TimedFullMerge(user_merge_operator_, ikey.user_key, kNoBaseValue,
+                           merge_context_.GetOperands(), logger_, stats_,
+                           clock_, /* update_num_ops_stats */ false,
+                           &merge_result, /* result_operand */ nullptr,
+                           &merge_result_type, &op_failure_scope);
      } else if (ikey.type == kTypeValue) {
-        const Slice val = iter->value();
-
-        s = TimedFullMerge(user_merge_operator_, ikey.user_key, &val,
-                           merge_context_.GetOperands(), &merge_result, logger_,
-                           stats_, clock_,
-                           /* result_operand */ nullptr,
-                           /* update_num_ops_stats */ false, &op_failure_scope);
+        s = TimedFullMerge(user_merge_operator_, ikey.user_key, kPlainBaseValue,
+                           iter->value(), merge_context_.GetOperands(), logger_,
+                           stats_, clock_, /* update_num_ops_stats */ false,
+                           &merge_result, /* result_operand */ nullptr,
+                           &merge_result_type, &op_failure_scope);
      } else if (ikey.type == kTypeBlobIndex) {
        BlobIndex blob_index;

@ -348,22 +554,23 @@ Status MergeHelper::MergeUntil(InternalIterator* iter,
          c_iter_stats->total_blob_bytes_read += bytes_read;
        }

-        s = TimedFullMerge(user_merge_operator_, ikey.user_key, &blob_value,
-                           merge_context_.GetOperands(), &merge_result, logger_,
-                           stats_, clock_,
-                           /* result_operand */ nullptr,
-                           /* update_num_ops_stats */ false, &op_failure_scope);
+        s = TimedFullMerge(user_merge_operator_, ikey.user_key, kPlainBaseValue,
+                           blob_value, merge_context_.GetOperands(), logger_,
+                           stats_, clock_, /* update_num_ops_stats */ false,
+                           &merge_result, /* result_operand */ nullptr,
+                           &merge_result_type, &op_failure_scope);
      } else if (ikey.type == kTypeWideColumnEntity) {
-        s = TimedFullMergeWithEntity(
-            user_merge_operator_, ikey.user_key, iter->value(),
-            merge_context_.GetOperands(), &merge_result, logger_, stats_,
-            clock_, /* update_num_ops_stats */ false, &op_failure_scope);
+        s = TimedFullMerge(user_merge_operator_, ikey.user_key, kWideBaseValue,
+                           iter->value(), merge_context_.GetOperands(), logger_,
+                           stats_, clock_, /* update_num_ops_stats */ false,
+                           &merge_result, /* result_operand */ nullptr,
+                           &merge_result_type, &op_failure_scope);
      } else {
-        s = TimedFullMerge(user_merge_operator_, ikey.user_key, nullptr,
-                           merge_context_.GetOperands(), &merge_result, logger_,
-                           stats_, clock_,
-                           /* result_operand */ nullptr,
-                           /* update_num_ops_stats */ false, &op_failure_scope);
+        s = TimedFullMerge(user_merge_operator_, ikey.user_key, kNoBaseValue,
+                           merge_context_.GetOperands(), logger_, stats_,
+                           clock_, /* update_num_ops_stats */ false,
+                           &merge_result, /* result_operand */ nullptr,
+                           &merge_result_type, &op_failure_scope);
      }

      // We store the result in keys_.back() and operands_.back()
@ -371,10 +578,12 @@ Status MergeHelper::MergeUntil(InternalIterator* iter,
      if (s.ok()) {
        // The original key encountered
        original_key = std::move(keys_.back());
-        orig_ikey.type = ikey.type == kTypeWideColumnEntity
-                             ? kTypeWideColumnEntity
-                             : kTypeValue;
+
+        assert(merge_result_type == kTypeValue ||
+               merge_result_type == kTypeWideColumnEntity);
+        orig_ikey.type = merge_result_type;
        UpdateInternalKey(&original_key, orig_ikey.sequence, orig_ikey.type);
+
        keys_.clear();
        merge_context_.Clear();
        keys_.emplace_front(std::move(original_key));
@ -499,19 +708,24 @@ Status MergeHelper::MergeUntil(InternalIterator* iter,
    assert(merge_context_.GetNumOperands() >= 1);
    assert(merge_context_.GetNumOperands() == keys_.size());
    std::string merge_result;
+    ValueType merge_result_type;
    MergeOperator::OpFailureScope op_failure_scope;
-    s = TimedFullMerge(user_merge_operator_, orig_ikey.user_key, nullptr,
-                       merge_context_.GetOperands(), &merge_result, logger_,
-                       stats_, clock_,
-                       /* result_operand */ nullptr,
-                       /* update_num_ops_stats */ false, &op_failure_scope);
+    s = TimedFullMerge(user_merge_operator_, orig_ikey.user_key, kNoBaseValue,
+                       merge_context_.GetOperands(), logger_, stats_, clock_,
+                       /* update_num_ops_stats */ false, &merge_result,
+                       /* result_operand */ nullptr, &merge_result_type,
+                       &op_failure_scope);
    if (s.ok()) {
      // The original key encountered
      // We are certain that keys_ is not empty here (see assertions couple of
      // lines before).
      original_key = std::move(keys_.back());
-      orig_ikey.type = kTypeValue;
+
+      assert(merge_result_type == kTypeValue ||
+             merge_result_type == kTypeWideColumnEntity);
+      orig_ikey.type = merge_result_type;
      UpdateInternalKey(&original_key, orig_ikey.sequence, orig_ikey.type);
+
      keys_.clear();
      merge_context_.Clear();
      keys_.emplace_front(std::move(original_key));
--- a/db/merge_helper.h
+++ b/db/merge_helper.h
@ -41,30 +41,94 @@ class MergeHelper {
              Statistics* stats = nullptr,
              const std::atomic<bool>* shutting_down = nullptr);

-  // Wrapper around MergeOperator::FullMergeV2() that records perf statistics.
-  // Result of merge will be written to result if status returned is OK.
-  // If operands is empty, the value will simply be copied to result.
-  // Set `update_num_ops_stats` to true if it is from a user read, so that
-  // the latency is sensitive.
+  // Wrappers around MergeOperator::FullMergeV3() that record perf statistics.
+  // Set `update_num_ops_stats` to true if it is from a user read so that
+  // the corresponding statistics are updated.
  // Returns one of the following statuses:
  // - OK: Entries were successfully merged.
  // - Corruption: Merge operator reported unsuccessful merge. The scope of the
  //   damage will be stored in `*op_failure_scope` when `op_failure_scope` is
  //   not nullptr
+
+  // Empty tag types to disambiguate overloads
+  struct NoBaseValueTag {};
+  static constexpr NoBaseValueTag kNoBaseValue{};
+
+  struct PlainBaseValueTag {};
+  static constexpr PlainBaseValueTag kPlainBaseValue{};
+
+  struct WideBaseValueTag {};
+  static constexpr WideBaseValueTag kWideBaseValue{};
+
+  // Variants that expose the merge result directly (in serialized form for wide
+  // columns) as well as its value type. Used by iterator and compaction.
  static Status TimedFullMerge(const MergeOperator* merge_operator,
-                               const Slice& key, const Slice* value,
+                               const Slice& key, NoBaseValueTag,
                               const std::vector<Slice>& operands,
-                               std::string* result, Logger* logger,
-                               Statistics* statistics, SystemClock* clock,
-                               Slice* result_operand, bool update_num_ops_stats,
+                               Logger* logger, Statistics* statistics,
+                               SystemClock* clock, bool update_num_ops_stats,
+                               std::string* result, Slice* result_operand,
+                               ValueType* result_type,
                               MergeOperator::OpFailureScope* op_failure_scope);

-  static Status TimedFullMergeWithEntity(
-      const MergeOperator* merge_operator, const Slice& key, Slice base_entity,
-      const std::vector<Slice>& operands, std::string* result, Logger* logger,
+  static Status TimedFullMerge(
+      const MergeOperator* merge_operator, const Slice& key, PlainBaseValueTag,
+      const Slice& value, const std::vector<Slice>& operands, Logger* logger,
      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      std::string* result, Slice* result_operand, ValueType* result_type,
      MergeOperator::OpFailureScope* op_failure_scope);

+  static Status TimedFullMerge(
+      const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+      const Slice& entity, const std::vector<Slice>& operands, Logger* logger,
+      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      std::string* result, Slice* result_operand, ValueType* result_type,
+      MergeOperator::OpFailureScope* op_failure_scope);
+
+  static Status TimedFullMerge(
+      const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+      const WideColumns& columns, const std::vector<Slice>& operands,
+      Logger* logger, Statistics* statistics, SystemClock* clock,
+      bool update_num_ops_stats, std::string* result, Slice* result_operand,
+      ValueType* result_type, MergeOperator::OpFailureScope* op_failure_scope);
+
+  // Variants that expose the merge result translated to the form requested by
+  // the client. (For example, if the result is a wide-column structure but the
+  // client requested the results in plain-value form, the value of the default
+  // column is returned.) Used by point lookups.
+  static Status TimedFullMerge(const MergeOperator* merge_operator,
+                               const Slice& key, NoBaseValueTag,
+                               const std::vector<Slice>& operands,
+                               Logger* logger, Statistics* statistics,
+                               SystemClock* clock, bool update_num_ops_stats,
+                               std::string* result_value,
+                               PinnableWideColumns* result_entity,
+                               MergeOperator::OpFailureScope* op_failure_scope);
+
+  static Status TimedFullMerge(
+      const MergeOperator* merge_operator, const Slice& key, PlainBaseValueTag,
+      const Slice& value, const std::vector<Slice>& operands, Logger* logger,
+      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      std::string* result_value, PinnableWideColumns* result_entity,
+      MergeOperator::OpFailureScope* op_failure_scope);
+
+  static Status TimedFullMerge(
+      const MergeOperator* merge_operator, const Slice& key, WideBaseValueTag,
+      const Slice& entity, const std::vector<Slice>& operands, Logger* logger,
+      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      std::string* result_value, PinnableWideColumns* result_entity,
+      MergeOperator::OpFailureScope* op_failure_scope);
+
+  static Status TimedFullMerge(const MergeOperator* merge_operator,
+                               const Slice& key, WideBaseValueTag,
+                               const WideColumns& columns,
+                               const std::vector<Slice>& operands,
+                               Logger* logger, Statistics* statistics,
+                               SystemClock* clock, bool update_num_ops_stats,
+                               std::string* result_value,
+                               PinnableWideColumns* result_entity,
+                               MergeOperator::OpFailureScope* op_failure_scope);
+
  // During compaction, merge entries until we hit
  //     - a corrupted key
  //     - a Put/Delete,
@ -198,6 +262,30 @@ class MergeHelper {
    // This is a best-effort facility, so memory_order_relaxed is sufficient.
    return shutting_down_ && shutting_down_->load(std::memory_order_relaxed);
  }
+
+  template <typename Visitor>
+  static Status TimedFullMergeCommonImpl(
+      const MergeOperator* merge_operator, const Slice& key,
+      MergeOperator::MergeOperationInputV3::ExistingValue&& existing_value,
+      const std::vector<Slice>& operands, Logger* logger,
+      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      MergeOperator::OpFailureScope* op_failure_scope, Visitor&& visitor);
+
+  static Status TimedFullMergeImpl(
+      const MergeOperator* merge_operator, const Slice& key,
+      MergeOperator::MergeOperationInputV3::ExistingValue&& existing_value,
+      const std::vector<Slice>& operands, Logger* logger,
+      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      std::string* result, Slice* result_operand, ValueType* result_type,
+      MergeOperator::OpFailureScope* op_failure_scope);
+
+  static Status TimedFullMergeImpl(
+      const MergeOperator* merge_operator, const Slice& key,
+      MergeOperator::MergeOperationInputV3::ExistingValue&& existing_value,
+      const std::vector<Slice>& operands, Logger* logger,
+      Statistics* statistics, SystemClock* clock, bool update_num_ops_stats,
+      std::string* result_value, PinnableWideColumns* result_entity,
+      MergeOperator::OpFailureScope* op_failure_scope);
 };

 // MergeOutputIterator can be used to iterate over the result of a merge.
--- a/db/periodic_task_scheduler.h
+++ b/db/periodic_task_scheduler.h
@ -42,15 +42,16 @@ class PeriodicTaskScheduler {
  PeriodicTaskScheduler& operator=(const PeriodicTaskScheduler&) = delete;
  PeriodicTaskScheduler& operator=(PeriodicTaskScheduler&&) = delete;

-  // Register a task with its default repeat period
+  // Register a task with its default repeat period. Thread safe call.
  Status Register(PeriodicTaskType task_type, const PeriodicTaskFunc& fn);

  // Register a task with specified repeat period. 0 is an invalid argument
-  // (kInvalidPeriodSec). To stop the task, please use Unregister() specifically
+  // (kInvalidPeriodSec). To stop the task, please use Unregister().
+  // Thread safe call.
  Status Register(PeriodicTaskType task_type, const PeriodicTaskFunc& fn,
                  uint64_t repeat_period_seconds);

-  // Unregister the task
+  // Unregister the task. Thread safe call.
  Status Unregister(PeriodicTaskType task_type);

 #ifndef NDEBUG
@ -105,4 +106,3 @@ class PeriodicTaskScheduler {
 };

 }  // namespace ROCKSDB_NAMESPACE
-
--- a/db/repair.cc
+++ b/db/repair.cc
@ -471,7 +471,7 @@ class Repairer {
          0 /* file_creation_time */, "DB Repairer" /* db_id */, db_session_id_,
          0 /*target_file_size*/, meta.fd.GetNumber());

-      SeqnoToTimeMapping empty_seqno_time_mapping;
+      SeqnoToTimeMapping empty_seqno_to_time_mapping;
      status = BuildTable(
          dbname_, /* versions */ nullptr, immutable_db_options_, tboptions,
          file_options_, read_options, table_cache_.get(), iter.get(),
@ -479,8 +479,9 @@ class Repairer {
          {}, kMaxSequenceNumber, kMaxSequenceNumber, snapshot_checker,
          false /* paranoid_file_checks*/, nullptr /* internal_stats */, &io_s,
          nullptr /*IOTracer*/, BlobFileCreationReason::kRecovery,
-          empty_seqno_time_mapping, nullptr /* event_logger */, 0 /* job_id */,
-          Env::IO_HIGH, nullptr /* table_properties */, write_hint);
+          empty_seqno_to_time_mapping, nullptr /* event_logger */,
+          0 /* job_id */, Env::IO_HIGH, nullptr /* table_properties */,
+          write_hint);
      ROCKS_LOG_INFO(db_options_.info_log,
                     "Log #%" PRIu64 ": %d ops saved to Table #%" PRIu64 " %s",
                     log, counter, meta.fd.GetNumber(),
--- a/db/seqno_time_test.cc
+++ b/db/seqno_time_test.cc
@ -12,13 +12,13 @@
 #include "rocksdb/utilities/debug.h"
 #include "test_util/mock_time_env.h"

-
 namespace ROCKSDB_NAMESPACE {

 class SeqnoTimeTest : public DBTestBase {
 public:
  SeqnoTimeTest() : DBTestBase("seqno_time_test", /*env_do_fsync=*/false) {
    mock_clock_ = std::make_shared<MockSystemClock>(env_->GetSystemClock());
+    mock_clock_->SetCurrentTime(kMockStartTime);
    mock_env_ = std::make_unique<CompositeEnvWrapper>(env_, mock_clock_);
  }

@ -26,6 +26,10 @@ class SeqnoTimeTest : public DBTestBase {
  std::unique_ptr<Env> mock_env_;
  std::shared_ptr<MockSystemClock> mock_clock_;

+  // Sufficient starting time that preserve time doesn't under-flow into
+  // pre-history
+  static constexpr uint32_t kMockStartTime = 10000000;
+
  void SetUp() override {
    mock_clock_->InstallTimedWaitFixCallback();
    SyncPoint::GetInstance()->SetCallBack(
@ -34,6 +38,7 @@ class SeqnoTimeTest : public DBTestBase {
              reinterpret_cast<PeriodicTaskScheduler*>(arg);
          periodic_task_scheduler_ptr->TEST_OverrideTimer(mock_clock_.get());
        });
+    mock_clock_->SetCurrentTime(kMockStartTime);
  }

  // make sure the file is not in cache, otherwise it won't have IO info
@ -77,11 +82,6 @@ TEST_F(SeqnoTimeTest, TemperatureBasicUniversal) {
  options.num_levels = kNumLevels;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(kKeyPerSec)); });
-
  int sst_num = 0;
  // Write files that are overlap and enough to trigger compaction
  for (; sst_num < kNumTrigger; sst_num++) {
@ -189,11 +189,6 @@ TEST_F(SeqnoTimeTest, TemperatureBasicLevel) {
  options.disable_auto_compactions = true;
  DestroyAndReopen(options);

-  // pass some time first, otherwise the first a few keys write time are going
-  // to be zero, and internally zero has special meaning: kUnknownSeqnoTime
-  dbfull()->TEST_WaitForPeriodicTaskRun(
-      [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(10)); });
-
  int sst_num = 0;
  // Write files that are overlap
  for (; sst_num < 4; sst_num++) {
@ -320,7 +315,9 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
  DestroyAndReopen(options);

  std::set<uint64_t> checked_file_nums;
-  SequenceNumber start_seq = dbfull()->GetLatestSequenceNumber();
+  SequenceNumber start_seq = dbfull()->GetLatestSequenceNumber() + 1;
+  uint64_t start_time = mock_clock_->NowSeconds();
+
  // Write a key every 10 seconds
  for (int i = 0; i < 200; i++) {
    ASSERT_OK(Put(Key(i), "value"));
@ -338,21 +335,20 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
  ASSERT_FALSE(tp_mapping.Empty());
  auto seqs = tp_mapping.TEST_GetInternalMapping();
  // about ~20 seqs->time entries, because the sample rate is 10000/100, and it
-  // passes 2k time.
-  ASSERT_GE(seqs.size(), 19);
-  ASSERT_LE(seqs.size(), 21);
-  SequenceNumber seq_end = dbfull()->GetLatestSequenceNumber();
-  for (auto i = start_seq; i < start_seq + 10; i++) {
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i), (i + 1) * 10);
-  }
-  start_seq += 10;
+  // passes 2k time. Add (roughly) one for starting entry.
+  ASSERT_GE(seqs.size(), 20);
+  ASSERT_LE(seqs.size(), 22);
+  SequenceNumber seq_end = dbfull()->GetLatestSequenceNumber() + 1;
  for (auto i = start_seq; i < seq_end; i++) {
    // The result is within the range
-    ASSERT_GE(tp_mapping.GetOldestApproximateTime(i), (i - 10) * 10);
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i), (i + 10) * 10);
+    ASSERT_GE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+              start_time + (i - start_seq) * 10 - 100);
+    ASSERT_LE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+              start_time + (i - start_seq) * 10);
  }
  checked_file_nums.insert(it->second->orig_file_number);
  start_seq = seq_end;
+  start_time = mock_clock_->NowSeconds();

  // Write a key every 1 seconds
  for (int i = 0; i < 200; i++) {
@ -360,7 +356,7 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
    dbfull()->TEST_WaitForPeriodicTaskRun(
        [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(1)); });
  }
-  seq_end = dbfull()->GetLatestSequenceNumber();
+  seq_end = dbfull()->GetLatestSequenceNumber() + 1;
  ASSERT_OK(Flush());
  tables_props.clear();
  ASSERT_OK(dbfull()->GetPropertiesOfAllTables(&tables_props));
@ -382,13 +378,14 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
  ASSERT_GE(seqs.size(), 1);
  ASSERT_LE(seqs.size(), 3);
  for (auto i = start_seq; i < seq_end; i++) {
-    // The result is not very accurate, as there is more data write within small
-    // range of time
-    ASSERT_GE(tp_mapping.GetOldestApproximateTime(i), (i - start_seq) + 1000);
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i), (i - start_seq) + 3000);
+    ASSERT_GE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+              start_time + (i - start_seq) - 100);
+    ASSERT_LE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+              start_time + (i - start_seq));
  }
  checked_file_nums.insert(it->second->orig_file_number);
  start_seq = seq_end;
+  start_time = mock_clock_->NowSeconds();

  // Write a key every 200 seconds
  for (int i = 0; i < 200; i++) {
@ -396,7 +393,7 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
    dbfull()->TEST_WaitForPeriodicTaskRun(
        [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(200)); });
  }
-  seq_end = dbfull()->GetLatestSequenceNumber();
+  seq_end = dbfull()->GetLatestSequenceNumber() + 1;
  ASSERT_OK(Flush());
  tables_props.clear();
  ASSERT_OK(dbfull()->GetPropertiesOfAllTables(&tables_props));
@ -417,20 +414,18 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
  // The sequence number -> time entries should be maxed
  ASSERT_GE(seqs.size(), 99);
  ASSERT_LE(seqs.size(), 101);
-  for (auto i = start_seq; i < seq_end - 99; i++) {
-    // likely the first 100 entries reports 0
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i), (i - start_seq) + 3000);
-  }
-  start_seq += 101;
-
  for (auto i = start_seq; i < seq_end; i++) {
-    ASSERT_GE(tp_mapping.GetOldestApproximateTime(i),
-              (i - start_seq) * 200 + 22200);
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i),
-              (i - start_seq) * 200 + 22600);
+    // aged out entries allowed to report time=0
+    if ((seq_end - i) * 200 <= 10000) {
+      ASSERT_GE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+                start_time + (i - start_seq) * 200 - 100);
+    }
+    ASSERT_LE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+              start_time + (i - start_seq) * 200);
  }
  checked_file_nums.insert(it->second->orig_file_number);
  start_seq = seq_end;
+  start_time = mock_clock_->NowSeconds();

  // Write a key every 100 seconds
  for (int i = 0; i < 200; i++) {
@ -438,7 +433,7 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
    dbfull()->TEST_WaitForPeriodicTaskRun(
        [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(100)); });
  }
-  seq_end = dbfull()->GetLatestSequenceNumber();
+  seq_end = dbfull()->GetLatestSequenceNumber() + 1;
  ASSERT_OK(Flush());
  tables_props.clear();
  ASSERT_OK(dbfull()->GetPropertiesOfAllTables(&tables_props));
@ -484,18 +479,15 @@ TEST_P(SeqnoTimeTablePropTest, BasicSeqnoToTimeMapping) {
  seqs = tp_mapping.TEST_GetInternalMapping();
  ASSERT_GE(seqs.size(), 99);
  ASSERT_LE(seqs.size(), 101);
-  for (auto i = start_seq; i < seq_end - 99; i++) {
-    // likely the first 100 entries reports 0
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i),
-              (i - start_seq) * 100 + 50000);
-  }
-  start_seq += 101;
-
  for (auto i = start_seq; i < seq_end; i++) {
-    ASSERT_GE(tp_mapping.GetOldestApproximateTime(i),
-              (i - start_seq) * 100 + 52200);
-    ASSERT_LE(tp_mapping.GetOldestApproximateTime(i),
-              (i - start_seq) * 100 + 52400);
+    // aged out entries allowed to report time=0
+    // FIXME: should be <=
+    if ((seq_end - i) * 100 < 10000) {
+      ASSERT_GE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+                start_time + (i - start_seq) * 100 - 100);
+    }
+    ASSERT_LE(tp_mapping.GetProximalTimeBeforeSeqno(i),
+              start_time + (i - start_seq) * 100);
  }
  ASSERT_OK(db_->Close());
 }
@ -620,14 +612,12 @@ TEST_P(SeqnoTimeTablePropTest, MultiCFs) {
  ASSERT_GE(seqs.size(), 99);
  ASSERT_LE(seqs.size(), 101);

-  for (int j = 0; j < 2; j++) {
    for (int i = 0; i < 200; i++) {
      ASSERT_OK(Put(0, Key(i), "value"));
      dbfull()->TEST_WaitForPeriodicTaskRun(
          [&] { mock_clock_->MockSleepForSeconds(static_cast<int>(100)); });
    }
    ASSERT_OK(Flush(0));
-  }
  ASSERT_OK(dbfull()->TEST_WaitForCompact());
  tables_props.clear();
  ASSERT_OK(dbfull()->GetPropertiesOfAllTables(handles_[0], &tables_props));
@ -735,8 +725,9 @@ TEST_P(SeqnoTimeTablePropTest, SeqnoToTimeMappingUniversal) {
    ASSERT_OK(tp_mapping.Sort());
    ASSERT_FALSE(tp_mapping.Empty());
    auto seqs = tp_mapping.TEST_GetInternalMapping();
-    ASSERT_GE(seqs.size(), 10 - 1);
-    ASSERT_LE(seqs.size(), 10 + 1);
+    // Add (roughly) one for starting entry.
+    ASSERT_GE(seqs.size(), 10);
+    ASSERT_LE(seqs.size(), 10 + 2);
  }

  // Trigger a compaction
@ -826,6 +817,179 @@ TEST_P(SeqnoTimeTablePropTest, SeqnoToTimeMappingUniversal) {
  Close();
 }

+TEST_P(SeqnoTimeTablePropTest, PrePopulateInDB) {
+  Options base_options = CurrentOptions();
+  base_options.env = mock_env_.get();
+  base_options.disable_auto_compactions = true;
+  base_options.create_missing_column_families = true;
+  Options track_options = base_options;
+  constexpr uint32_t kPreserveSecs = 1234567;
+  SetTrackTimeDurationOptions(kPreserveSecs, track_options);
+  SeqnoToTimeMapping sttm;
+  SequenceNumber latest_seqno;
+  uint64_t start_time, end_time;
+
+  // #### DB#1, #2: No pre-population without preserve/preclude ####
+  // #### But a single entry is added when preserve/preclude enabled ####
+  for (bool with_write : {false, true}) {
+    SCOPED_TRACE("with_write=" + std::to_string(with_write));
+    DestroyAndReopen(base_options);
+    sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+    ASSERT_TRUE(sttm.Empty());
+    ASSERT_EQ(db_->GetLatestSequenceNumber(), 0U);
+
+    if (with_write) {
+      // Ensure that writes before new CF with preserve/preclude option don't
+      // interfere with the seqno-to-time mapping getting a starting entry.
+      ASSERT_OK(Put("foo", "bar"));
+      ASSERT_OK(Flush());
+    }
+
+    // Unfortunately, if we add a CF with preserve/preclude option after
+    // open, that does not reserve seqnos with pre-populated time mappings.
+    CreateColumnFamilies({"one"}, track_options);
+
+    // No pre-population (unfortunately), just a single starting entry
+    sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+    latest_seqno = db_->GetLatestSequenceNumber();
+    start_time = mock_clock_->NowSeconds();
+    ASSERT_EQ(sttm.Size(), 1);
+    ASSERT_EQ(latest_seqno, 1U);
+    // Current time maps to starting entry / seqno
+    ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time), 1U);
+    // Any older times are unknown.
+    ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time - 1),
+              kUnknownSeqnoBeforeAll);
+
+    // Now check that writes can proceed normally (passing about 20% of preserve
+    // time)
+    for (int i = 0; i < 20; i++) {
+      ASSERT_OK(Put(Key(i), "value"));
+      dbfull()->TEST_WaitForPeriodicTaskRun([&] {
+        mock_clock_->MockSleepForSeconds(static_cast<int>(kPreserveSecs / 99));
+      });
+    }
+    ASSERT_OK(Flush());
+
+    // Check that mappings are getting populated
+    sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+    latest_seqno = db_->GetLatestSequenceNumber();
+    end_time = mock_clock_->NowSeconds();
+    ASSERT_EQ(sttm.Size(), 21);
+    ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(end_time), latest_seqno);
+    ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time), 1U);
+    ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time - 1),
+              kUnknownSeqnoBeforeAll);
+  }
+
+  // ### DB#3, #4: Read-only DB with preserve/preclude after not ####
+  // Make sure we don't hit issues with read-only DBs, which don't need
+  // the mapping in the DB state (though it wouldn't hurt anything)
+  for (bool with_write : {false, true}) {
+    SCOPED_TRACE("with_write=" + std::to_string(with_write));
+    DestroyAndReopen(base_options);
+    if (with_write) {
+      ASSERT_OK(Put("foo", "bar"));
+      ASSERT_OK(Flush());
+    }
+
+    ASSERT_OK(ReadOnlyReopen(base_options));
+    if (with_write) {
+      ASSERT_EQ(Get("foo"), "bar");
+    }
+    sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+    ASSERT_EQ(sttm.Size(), 0);
+
+    ASSERT_OK(ReadOnlyReopen(track_options));
+    if (with_write) {
+      ASSERT_EQ(Get("foo"), "bar");
+    }
+    sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+    ASSERT_EQ(sttm.Size(), 0);
+  }
+
+  // #### DB#5: Destroy and open with preserve/preclude option ####
+  DestroyAndReopen(track_options);
+
+  // Ensure pre-population
+  constexpr auto kPrePopPairs = SeqnoToTimeMapping::kMaxSeqnoTimePairsPerSST;
+  sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+  latest_seqno = db_->GetLatestSequenceNumber();
+  start_time = mock_clock_->NowSeconds();
+  ASSERT_EQ(sttm.Size(), kPrePopPairs);
+  // One nono-zero sequence number per pre-populated pair (this could be
+  // revised if we want to use interpolation for better approximate time
+  // mappings with no guarantee of erring in just one direction).
+  ASSERT_EQ(latest_seqno, kPrePopPairs);
+  // Current time maps to last pre-allocated seqno
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time), latest_seqno);
+  // Oldest tracking time maps to first pre-allocated seqno
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time - kPreserveSecs), 1);
+
+  // In more detail, check that estimated seqnos (pre-allocated) are uniformly
+  // spread over the tracked time.
+  for (auto ratio : {0.0, 0.433, 0.678, 0.987, 1.0}) {
+    // Round up query time
+    uint64_t t = start_time - kPreserveSecs +
+                 static_cast<uint64_t>(ratio * kPreserveSecs + 0.9999999);
+    // Round down estimated seqno
+    SequenceNumber s =
+        static_cast<SequenceNumber>(ratio * (latest_seqno - 1)) + 1;
+    // Match
+    ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(t), s);
+  }
+
+  // Now check that writes can proceed normally (passing about 20% of preserve
+  // time)
+  for (int i = 0; i < 20; i++) {
+    ASSERT_OK(Put(Key(i), "value"));
+    dbfull()->TEST_WaitForPeriodicTaskRun([&] {
+      mock_clock_->MockSleepForSeconds(static_cast<int>(kPreserveSecs / 99));
+    });
+  }
+  ASSERT_OK(Flush());
+
+  // Can still see some pre-populated mappings, though some displaced
+  sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+  latest_seqno = db_->GetLatestSequenceNumber();
+  end_time = mock_clock_->NowSeconds();
+  ASSERT_EQ(sttm.Size(), kPrePopPairs);
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(end_time), latest_seqno);
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time - kPreserveSecs / 2),
+            kPrePopPairs / 2);
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time - kPreserveSecs),
+            kUnknownSeqnoBeforeAll);
+
+  // Make sure we don't hit issues with read-only DBs, which don't need
+  // the mapping in the DB state (though it wouldn't hurt anything)
+  ASSERT_OK(ReadOnlyReopen(track_options));
+  ASSERT_EQ(Get(Key(0)), "value");
+  sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+  ASSERT_EQ(sttm.Size(), 0);
+
+  // #### DB#6: Destroy and open+create an extra CF with preserve/preclude ####
+  // (default CF does not have the option)
+  Destroy(track_options);
+  ReopenWithColumnFamilies({"default", "one"},
+                           List({base_options, track_options}));
+
+  // Ensure pre-population (not as exhaustive checking here)
+  sttm = dbfull()->TEST_GetSeqnoToTimeMapping();
+  latest_seqno = db_->GetLatestSequenceNumber();
+  start_time = mock_clock_->NowSeconds();
+  ASSERT_EQ(sttm.Size(), kPrePopPairs);
+  // One nono-zero sequence number per pre-populated pair (this could be
+  // revised if we want to use interpolation for better approximate time
+  // mappings with no guarantee of erring in just one direction).
+  ASSERT_EQ(latest_seqno, kPrePopPairs);
+  // Current time maps to last pre-allocated seqno
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time), latest_seqno);
+  // Oldest tracking time maps to first pre-allocated seqno
+  ASSERT_EQ(sttm.GetProximalSeqnoBeforeTime(start_time - kPreserveSecs), 1);
+
+  Close();
+}
+
 TEST_F(SeqnoTimeTest, MappingAppend) {
  SeqnoToTimeMapping test(/*max_time_duration=*/100, /*max_capacity=*/10);

@ -843,8 +1007,9 @@ TEST_F(SeqnoTimeTest, MappingAppend) {
  ASSERT_FALSE(test.Append(8, 12));
  ASSERT_EQ(size, test.Size());

-  // Append with the same seqno, newer time will be accepted
-  ASSERT_TRUE(test.Append(10, 12));
+  // Append with the same seqno, newer time is rejected because that makes
+  // GetProximalSeqnoBeforeTime queries worse (see later test)
+  ASSERT_FALSE(test.Append(10, 12));
  ASSERT_EQ(size, test.Size());
  // older time will be ignored
  ASSERT_FALSE(test.Append(10, 9));
@ -853,25 +1018,220 @@ TEST_F(SeqnoTimeTest, MappingAppend) {
  // new seqno with old time will be ignored
  ASSERT_FALSE(test.Append(12, 8));
  ASSERT_EQ(size, test.Size());
+
+  // new seqno with same time is accepted by replacing last entry
+  // (improves GetProximalSeqnoBeforeTime queries without blowing up size)
+  ASSERT_TRUE(test.Append(12, 11));
+  ASSERT_EQ(size, test.Size());
 }

-TEST_F(SeqnoTimeTest, GetOldestApproximateTime) {
+TEST_F(SeqnoTimeTest, ProximalFunctions) {
  SeqnoToTimeMapping test(/*max_time_duration=*/100, /*max_capacity=*/10);

-  ASSERT_EQ(test.GetOldestApproximateTime(10), kUnknownSeqnoTime);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(1), kUnknownTimeBeforeAll);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(1000000000000U),
+            kUnknownTimeBeforeAll);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(1), kUnknownSeqnoBeforeAll);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(1000000000000U),
+            kUnknownSeqnoBeforeAll);

-  test.Append(3, 10);
+  // (Taken from example in SeqnoToTimeMapping class comment)
+  // Time 500 is after seqno 10 and before seqno 11
+  EXPECT_TRUE(test.Append(10, 500));

-  ASSERT_EQ(test.GetOldestApproximateTime(2), kUnknownSeqnoTime);
-  ASSERT_EQ(test.GetOldestApproximateTime(3), 10);
-  ASSERT_EQ(test.GetOldestApproximateTime(10), 10);
+  // Seqno too early
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(9), kUnknownTimeBeforeAll);
+  // We only know that 500 is after 10
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(10), kUnknownTimeBeforeAll);
+  // Found
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(11), 500U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(1000000000000U), 500U);

-  test.Append(10, 100);
+  // Time too early
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(499), kUnknownSeqnoBeforeAll);
+  // Found
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(500), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(501), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(1000000000000U), 10U);

-  test.Append(100, 1000);
-  ASSERT_EQ(test.GetOldestApproximateTime(10), 100);
-  ASSERT_EQ(test.GetOldestApproximateTime(40), 100);
-  ASSERT_EQ(test.GetOldestApproximateTime(111), 1000);
+  // More samples
+  EXPECT_TRUE(test.Append(20, 600));
+  EXPECT_TRUE(test.Append(30, 700));
+
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(10), kUnknownTimeBeforeAll);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(11), 500U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(20), 500U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(21), 600U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(30), 600U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(31), 700U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(1000000000000U), 700U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(499), kUnknownSeqnoBeforeAll);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(500), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(501), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(599), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(600), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(601), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(699), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(700), 30U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(701), 30U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(1000000000000U), 30U);
+
+  // Redundant sample ignored
+  EXPECT_EQ(test.Size(), 3U);
+  EXPECT_FALSE(test.Append(30, 700));
+  EXPECT_EQ(test.Size(), 3U);
+
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(30), 600U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(31), 700U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(699), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(700), 30U);
+
+  // Later sample with same seqno is ignored, to provide best results
+  // for GetProximalSeqnoBeforeTime function while saving entries
+  // in SeqnoToTimeMapping.
+  EXPECT_FALSE(test.Append(30, 800));
+
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(30), 600U);
+  // Could return 800, but saving space in SeqnoToTimeMapping instead.
+  // Can reconsider if/when GetProximalTimeBeforeSeqno is used in
+  // production.
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(31), 700U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(699), 20U);
+  // If the existing {30, 700} entry were replaced with {30, 800}, this
+  // would return seqno 20 instead of 30, which would preclude more than
+  // necessary for "preclude_last_level_data_seconds" feature.
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(700), 30U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(800), 30U);
+
+  // Still OK
+  EXPECT_TRUE(test.Append(40, 900));
+
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(30), 600U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(41), 900U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(899), 30U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(900), 40U);
+
+  // Burst of writes during a short time creates an opportunity
+  // for better results from GetProximalSeqnoBeforeTime(), at the
+  // expense of GetProximalTimeBeforeSeqno().
+  EXPECT_TRUE(test.Append(50, 900));
+
+  // These are subject to later revision depending on priorities
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(49), 700U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(51), 900U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(899), 30U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(900), 50U);
+}
+
+TEST_F(SeqnoTimeTest, PrePopulate) {
+  SeqnoToTimeMapping test(/*max_time_duration=*/100, /*max_capacity=*/10);
+
+  EXPECT_EQ(test.Size(), 0U);
+
+  // Smallest case is like two Appends
+  test.PrePopulate(10, 11, 500, 600);
+
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(10), kUnknownTimeBeforeAll);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(11), 500U);
+  EXPECT_EQ(test.GetProximalTimeBeforeSeqno(12), 600U);
+
+  test.Clear();
+
+  // Populate a small range
+  uint64_t kTimeIncrement = 1234567;
+  test.PrePopulate(1, 12, kTimeIncrement, kTimeIncrement * 2);
+
+  for (uint64_t i = 0; i <= 12; ++i) {
+    // NOTE: with 1 and 12 as the pre-populated end points, the duration is
+    // broken into 11 equal(-ish) spans
+    uint64_t t = kTimeIncrement + (i * kTimeIncrement) / 11 - 1;
+    EXPECT_EQ(test.GetProximalSeqnoBeforeTime(t), i);
+  }
+
+  test.Clear();
+
+  // Populate an excessively large range (in the future we might want to
+  // interpolate estimated times for seqnos between entries)
+  test.PrePopulate(1, 34567, kTimeIncrement, kTimeIncrement * 2);
+
+  for (auto ratio : {0.0, 0.433, 0.678, 0.987, 1.0}) {
+    // Round up query time
+    uint64_t t = kTimeIncrement +
+                 static_cast<uint64_t>(ratio * kTimeIncrement + 0.9999999);
+    // Round down estimated seqno
+    SequenceNumber s = static_cast<SequenceNumber>(ratio * (34567 - 1)) + 1;
+    // Match
+    // TODO: for now this is exact, but in the future might need approximation
+    // bounds to account for limited samples.
+    EXPECT_EQ(test.GetProximalSeqnoBeforeTime(t), s);
+  }
+}
+
+TEST_F(SeqnoTimeTest, TruncateOldEntries) {
+  constexpr uint64_t kMaxTimeDuration = 42;
+  SeqnoToTimeMapping test(kMaxTimeDuration, /*max_capacity=*/10);
+
+  EXPECT_EQ(test.Size(), 0U);
+
+  // Safe on empty mapping
+  test.TruncateOldEntries(500);
+
+  EXPECT_EQ(test.Size(), 0U);
+
+  // (Taken from example in SeqnoToTimeMapping class comment)
+  // Time 500 is after seqno 10 and before seqno 11
+  EXPECT_TRUE(test.Append(10, 500));
+  EXPECT_TRUE(test.Append(20, 600));
+  EXPECT_TRUE(test.Append(30, 700));
+  EXPECT_TRUE(test.Append(40, 800));
+  EXPECT_TRUE(test.Append(50, 900));
+
+  EXPECT_EQ(test.Size(), 5U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(500), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(599), 10U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(600), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(699), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(700), 30U);
+  // etc.
+
+  // Must keep first entry
+  test.TruncateOldEntries(500 + kMaxTimeDuration);
+  EXPECT_EQ(test.Size(), 5U);
+  test.TruncateOldEntries(599 + kMaxTimeDuration);
+  EXPECT_EQ(test.Size(), 5U);
+
+  // Purges first entry
+  test.TruncateOldEntries(600 + kMaxTimeDuration);
+  EXPECT_EQ(test.Size(), 4U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(500), kUnknownSeqnoBeforeAll);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(599), kUnknownSeqnoBeforeAll);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(600), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(699), 20U);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(700), 30U);
+
+  // No effect
+  test.TruncateOldEntries(600 + kMaxTimeDuration);
+  EXPECT_EQ(test.Size(), 4U);
+  test.TruncateOldEntries(699 + kMaxTimeDuration);
+  EXPECT_EQ(test.Size(), 4U);
+
+  // Purges next two
+  test.TruncateOldEntries(899 + kMaxTimeDuration);
+  EXPECT_EQ(test.Size(), 2U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(799), kUnknownSeqnoBeforeAll);
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(899), 40U);
+
+  // Always keep last entry, to have a non-trivial seqno bound
+  test.TruncateOldEntries(10000000);
+  EXPECT_EQ(test.Size(), 1U);
+
+  EXPECT_EQ(test.GetProximalSeqnoBeforeTime(10000000), 50U);
 }

 TEST_F(SeqnoTimeTest, Sort) {
@ -930,10 +1290,10 @@ TEST_F(SeqnoTimeTest, EncodeDecodeBasic) {
  for (SequenceNumber seq = 0; seq <= 1000; seq++) {
    // test has the more accurate time mapping, encode only pick
    // kMaxSeqnoTimePairsPerSST number of entries, which is less accurate
-    uint64_t target_time = test.GetOldestApproximateTime(seq);
-    ASSERT_GE(decoded.GetOldestApproximateTime(seq),
+    uint64_t target_time = test.GetProximalTimeBeforeSeqno(seq);
+    ASSERT_GE(decoded.GetProximalTimeBeforeSeqno(seq),
              target_time < 200 ? 0 : target_time - 200);
-    ASSERT_LE(decoded.GetOldestApproximateTime(seq), target_time);
+    ASSERT_LE(decoded.GetProximalTimeBeforeSeqno(seq), target_time);
  }
 }

--- a/db/seqno_to_time_mapping.cc
+++ b/db/seqno_to_time_mapping.cc
@ -11,14 +11,34 @@

 namespace ROCKSDB_NAMESPACE {

-uint64_t SeqnoToTimeMapping::GetOldestApproximateTime(
-    const SequenceNumber seqno) const {
+SeqnoToTimeMapping::pair_const_iterator SeqnoToTimeMapping::FindGreaterTime(
+    uint64_t time) const {
+  return std::upper_bound(pairs_.cbegin(), pairs_.cend(),
+                          SeqnoTimePair{0, time}, SeqnoTimePair::TimeLess);
+}
+
+SeqnoToTimeMapping::pair_const_iterator SeqnoToTimeMapping::FindGreaterEqSeqno(
+    SequenceNumber seqno) const {
+  return std::lower_bound(pairs_.cbegin(), pairs_.cend(),
+                          SeqnoTimePair{seqno, 0}, SeqnoTimePair::SeqnoLess);
+}
+
+SeqnoToTimeMapping::pair_const_iterator SeqnoToTimeMapping::FindGreaterSeqno(
+    SequenceNumber seqno) const {
+  return std::upper_bound(pairs_.cbegin(), pairs_.cend(),
+                          SeqnoTimePair{seqno, 0}, SeqnoTimePair::SeqnoLess);
+}
+
+uint64_t SeqnoToTimeMapping::GetProximalTimeBeforeSeqno(
+    SequenceNumber seqno) const {
  assert(is_sorted_);
-  auto it = std::upper_bound(seqno_time_mapping_.begin(),
-                             seqno_time_mapping_.end(), seqno);
-  if (it == seqno_time_mapping_.begin()) {
-    return 0;
+  // Find the last entry with a seqno strictly less than the given seqno.
+  // First, find the first entry >= the given seqno (or end)
+  auto it = FindGreaterEqSeqno(seqno);
+  if (it == pairs_.cbegin()) {
+    return kUnknownTimeBeforeAll;
  }
+  // Then return data from previous.
  it--;
  return it->time;
 }
@ -28,44 +48,47 @@ void SeqnoToTimeMapping::Add(SequenceNumber seqno, uint64_t time) {
    return;
  }
  is_sorted_ = false;
-  seqno_time_mapping_.emplace_back(seqno, time);
+  pairs_.emplace_back(seqno, time);
 }

 void SeqnoToTimeMapping::TruncateOldEntries(const uint64_t now) {
  assert(is_sorted_);

  if (max_time_duration_ == 0) {
+    // No cutoff time
    return;
  }

-  const uint64_t cut_off_time =
-      now > max_time_duration_ ? now - max_time_duration_ : 0;
-  assert(cut_off_time <= now);  // no overflow
-
-  auto it = std::upper_bound(
-      seqno_time_mapping_.begin(), seqno_time_mapping_.end(), cut_off_time,
-      [](uint64_t target, const SeqnoTimePair& other) -> bool {
-        return target < other.time;
-      });
-  if (it == seqno_time_mapping_.begin()) {
+  if (now < max_time_duration_) {
+    // Would under-flow
    return;
  }
-  it--;
-  seqno_time_mapping_.erase(seqno_time_mapping_.begin(), it);
+
+  const uint64_t cut_off_time = now - max_time_duration_;
+  assert(cut_off_time <= now);  // no under/overflow
+
+  auto it = FindGreaterTime(cut_off_time);
+  if (it == pairs_.cbegin()) {
+    return;
+  }
+  // Move back one, to the entry that would be used to return a good seqno from
+  // GetProximalSeqnoBeforeTime(cut_off_time)
+  --it;
+  // Remove everything strictly before that entry
+  pairs_.erase(pairs_.cbegin(), std::move(it));
 }

-SequenceNumber SeqnoToTimeMapping::GetOldestSequenceNum(uint64_t time) {
+SequenceNumber SeqnoToTimeMapping::GetProximalSeqnoBeforeTime(uint64_t time) {
  assert(is_sorted_);

-  auto it = std::upper_bound(
-      seqno_time_mapping_.begin(), seqno_time_mapping_.end(), time,
-      [](uint64_t target, const SeqnoTimePair& other) -> bool {
-        return target < other.time;
-      });
-  if (it == seqno_time_mapping_.begin()) {
-    return 0;
+  // Find the last entry with a time <= the given time.
+  // First, find the first entry > the given time (or end).
+  auto it = FindGreaterTime(time);
+  if (it == pairs_.cbegin()) {
+    return kUnknownSeqnoBeforeAll;
  }
-  it--;
+  // Then return data from previous.
+  --it;
  return it->seqno;
 }

@ -84,15 +107,13 @@ void SeqnoToTimeMapping::Encode(std::string& dest, const SequenceNumber start,
    return;
  }

-  auto start_it = std::upper_bound(seqno_time_mapping_.begin(),
-                                   seqno_time_mapping_.end(), start);
-  if (start_it != seqno_time_mapping_.begin()) {
+  auto start_it = FindGreaterSeqno(start);
+  if (start_it != pairs_.begin()) {
    start_it--;
  }

-  auto end_it = std::upper_bound(seqno_time_mapping_.begin(),
-                                 seqno_time_mapping_.end(), end);
-  if (end_it == seqno_time_mapping_.begin()) {
+  auto end_it = FindGreaterSeqno(end);
+  if (end_it == pairs_.begin()) {
    return;
  }
  if (start_it >= end_it) {
@ -108,7 +129,7 @@ void SeqnoToTimeMapping::Encode(std::string& dest, const SequenceNumber start,
    }
  }
  // to include the first element
-  if (start_it != seqno_time_mapping_.begin()) {
+  if (start_it != pairs_.begin()) {
    start_it--;
  }

@ -166,14 +187,14 @@ void SeqnoToTimeMapping::Encode(std::string& dest, const SequenceNumber start,
  SeqnoTimePair base;
  for (auto it = start_it; it < end_it; it++) {
    assert(base < *it);
-    SeqnoTimePair val = *it - base;
+    SeqnoTimePair val = it->ComputeDelta(base);
    base = *it;
    val.Encode(dest);
  }
 }

-Status SeqnoToTimeMapping::Add(const std::string& seqno_time_mapping_str) {
-  Slice input(seqno_time_mapping_str);
+Status SeqnoToTimeMapping::Add(const std::string& pairs_str) {
+  Slice input(pairs_str);
  if (input.empty()) {
    return Status::OK();
  }
@ -189,8 +210,8 @@ Status SeqnoToTimeMapping::Add(const std::string& seqno_time_mapping_str) {
    if (!s.ok()) {
      return s;
    }
-    val.Add(base);
-    seqno_time_mapping_.emplace_back(val);
+    val.ApplyDelta(base);
+    pairs_.emplace_back(val);
    base = val;
  }
  return Status::OK();
@ -222,33 +243,58 @@ bool SeqnoToTimeMapping::Append(SequenceNumber seqno, uint64_t time) {
      return false;
    }
    if (seqno == Last().seqno) {
-      Last().time = time;
-      return true;
-    }
-    if (time == Last().time) {
-      // new sequence has the same time as old one, no need to add new mapping
+      // Updating Last() would hurt GetProximalSeqnoBeforeTime() queries, so
+      // NOT doing it (for now)
      return false;
    }
+    if (time == Last().time) {
+      // Updating Last() here helps GetProximalSeqnoBeforeTime() queries, so
+      // doing it (for now)
+      Last().seqno = seqno;
+      return true;
+    }
  }

-  seqno_time_mapping_.emplace_back(seqno, time);
+  pairs_.emplace_back(seqno, time);

-  if (seqno_time_mapping_.size() > max_capacity_) {
-    seqno_time_mapping_.pop_front();
+  if (pairs_.size() > max_capacity_) {
+    // FIXME: be smarter about how we erase to avoid data falling off the
+    // front prematurely.
+    pairs_.pop_front();
  }
  return true;
 }

+bool SeqnoToTimeMapping::PrePopulate(SequenceNumber from_seqno,
+                                     SequenceNumber to_seqno,
+                                     uint64_t from_time, uint64_t to_time) {
+  assert(Empty());
+  assert(from_seqno > 0);
+  assert(to_seqno > from_seqno);
+  assert(from_time > kUnknownTimeBeforeAll);
+  assert(to_time >= from_time);
+
+  // TODO: smartly limit this to max_capacity_ representative samples
+  for (auto i = from_seqno; i <= to_seqno; i++) {
+    uint64_t t = from_time + (to_time - from_time) * (i - from_seqno) /
+                                 (to_seqno - from_seqno);
+    pairs_.emplace_back(i, t);
+  }
+
+  return /*success*/ true;
+}
+
 bool SeqnoToTimeMapping::Resize(uint64_t min_time_duration,
                                uint64_t max_time_duration) {
  uint64_t new_max_capacity =
      CalculateMaxCapacity(min_time_duration, max_time_duration);
  if (new_max_capacity == max_capacity_) {
    return false;
-  } else if (new_max_capacity < seqno_time_mapping_.size()) {
-    uint64_t delta = seqno_time_mapping_.size() - new_max_capacity;
-    seqno_time_mapping_.erase(seqno_time_mapping_.begin(),
-                              seqno_time_mapping_.begin() + delta);
+  } else if (new_max_capacity < pairs_.size()) {
+    uint64_t delta = pairs_.size() - new_max_capacity;
+    // FIXME: be smarter about how we erase to avoid data falling off the
+    // front prematurely.
+    pairs_.erase(pairs_.begin(), pairs_.begin() + delta);
  }
  max_capacity_ = new_max_capacity;
  return true;
@ -258,16 +304,16 @@ Status SeqnoToTimeMapping::Sort() {
  if (is_sorted_) {
    return Status::OK();
  }
-  if (seqno_time_mapping_.empty()) {
+  if (pairs_.empty()) {
    is_sorted_ = true;
    return Status::OK();
  }

-  std::deque<SeqnoTimePair> copy = std::move(seqno_time_mapping_);
+  std::deque<SeqnoTimePair> copy = std::move(pairs_);

  std::sort(copy.begin(), copy.end());

-  seqno_time_mapping_.clear();
+  pairs_.clear();

  // remove seqno = 0, which may have special meaning, like zeroed out data
  while (copy.front().seqno == 0) {
@ -285,12 +331,12 @@ Status SeqnoToTimeMapping::Sort() {
      assert(it.seqno > prev.seqno);
      // If a larger sequence number has an older time which is not useful, skip
      if (it.time > prev.time) {
-        seqno_time_mapping_.push_back(prev);
+        pairs_.push_back(prev);
        prev = it;
      }
    }
  }
-  seqno_time_mapping_.emplace_back(prev);
+  pairs_.emplace_back(prev);

  is_sorted_ = true;
  return Status::OK();
@ -298,7 +344,7 @@ Status SeqnoToTimeMapping::Sort() {

 std::string SeqnoToTimeMapping::ToHumanString() const {
  std::string ret;
-  for (const auto& seq_time : seqno_time_mapping_) {
+  for (const auto& seq_time : pairs_) {
    AppendNumberTo(&ret, seq_time.seqno);
    ret.append("->");
    AppendNumberTo(&ret, seq_time.time);
@ -310,13 +356,11 @@ std::string SeqnoToTimeMapping::ToHumanString() const {
 SeqnoToTimeMapping SeqnoToTimeMapping::Copy(
    SequenceNumber smallest_seqno) const {
  SeqnoToTimeMapping ret;
-  auto it = std::upper_bound(seqno_time_mapping_.begin(),
-                             seqno_time_mapping_.end(), smallest_seqno);
-  if (it != seqno_time_mapping_.begin()) {
+  auto it = FindGreaterSeqno(smallest_seqno);
+  if (it != pairs_.begin()) {
    it--;
  }
-  std::copy(it, seqno_time_mapping_.end(),
-            std::back_inserter(ret.seqno_time_mapping_));
+  std::copy(it, pairs_.end(), std::back_inserter(ret.pairs_));
  return ret;
 }

@ -330,12 +374,4 @@ uint64_t SeqnoToTimeMapping::CalculateMaxCapacity(uint64_t min_time_duration,
      max_time_duration * kMaxSeqnoTimePairsPerCF / min_time_duration);
 }

-SeqnoToTimeMapping::SeqnoTimePair SeqnoToTimeMapping::SeqnoTimePair::operator-(
-    const SeqnoTimePair& other) const {
-  SeqnoTimePair res;
-  res.seqno = seqno - other.seqno;
-  res.time = time - other.time;
-  return res;
-}
-
 }  // namespace ROCKSDB_NAMESPACE
--- a/db/seqno_to_time_mapping.h
+++ b/db/seqno_to_time_mapping.h
@ -18,20 +18,32 @@

 namespace ROCKSDB_NAMESPACE {

-constexpr uint64_t kUnknownSeqnoTime = 0;
+constexpr uint64_t kUnknownTimeBeforeAll = 0;
+constexpr SequenceNumber kUnknownSeqnoBeforeAll = 0;

-// SeqnoToTimeMapping stores the sequence number to time mapping, so given a
-// sequence number it can estimate the oldest possible time for that sequence
-// number. For example:
-//   10 -> 100
-//   50 -> 300
-// then if a key has seqno 19, the OldestApproximateTime would be 100, for 51 it
-// would be 300.
-// As it's a sorted list, the new entry is inserted from the back. The old data
-// will be popped from the front if they're no longer used.
+// SeqnoToTimeMapping stores a sampled mapping from sequence numbers to
+// unix times (seconds since epoch). This information provides rough bounds
+// between sequence numbers and their write times, but is primarily designed
+// for getting a best lower bound on the sequence number of data written no
+// later than a specified time.
 //
-// Note: the data struct is not thread safe, both read and write need to be
-//  synchronized by caller.
+// For ease of sampling, it is assumed that the recorded time in each pair
+// comes at or after the sequence number and before the next sequence number,
+// so this example:
+//
+// Seqno: 10,       11, ... 20,       21, ... 30,       31, ...
+// Time:  ...  500      ...      600      ...      700      ...
+//
+// would be represented as
+//   10 -> 500
+//   20 -> 600
+//   30 -> 700
+//
+// In typical operation, the list is sorted, both among seqnos and among times,
+// with a bounded number of entries, but some public working states violate
+// these constraints.
+//
+// NOT thread safe - requires external synchronization.
 class SeqnoToTimeMapping {
 public:
  // Maximum number of entries can be encoded into SST. The data is delta encode
@ -63,28 +75,33 @@ class SeqnoToTimeMapping {
    // Decode the value from input Slice and remove it from the input
    Status Decode(Slice& input);

-    // subtraction of 2 SeqnoTimePair
-    SeqnoTimePair operator-(const SeqnoTimePair& other) const;
-
-    // Add 2 values together
-    void Add(const SeqnoTimePair& obj) {
-      seqno += obj.seqno;
-      time += obj.time;
+    // For delta encoding
+    SeqnoTimePair ComputeDelta(const SeqnoTimePair& base) const {
+      return {seqno - base.seqno, time - base.time};
    }

-    // Compare SeqnoTimePair with a sequence number, used for binary search a
-    // sequence number in a list of SeqnoTimePair
-    bool operator<(const SequenceNumber& other) const { return seqno < other; }
+    // For delta decoding
+    void ApplyDelta(const SeqnoTimePair& delta_or_base) {
+      seqno += delta_or_base.seqno;
+      time += delta_or_base.time;
+    }

-    // Compare 2 SeqnoTimePair
+    // Ordering used for Sort()
    bool operator<(const SeqnoTimePair& other) const {
      return std::tie(seqno, time) < std::tie(other.seqno, other.time);
    }

-    // Check if 2 SeqnoTimePair is the same
    bool operator==(const SeqnoTimePair& other) const {
      return std::tie(seqno, time) == std::tie(other.seqno, other.time);
    }
+
+    static bool SeqnoLess(const SeqnoTimePair& a, const SeqnoTimePair& b) {
+      return a.seqno < b.seqno;
+    }
+
+    static bool TimeLess(const SeqnoTimePair& a, const SeqnoTimePair& b) {
+      return a.time < b.time;
+    }
  };

  // constractor of SeqnoToTimeMapping
@ -99,20 +116,40 @@ class SeqnoToTimeMapping {
                              uint64_t max_capacity = 0)
      : max_time_duration_(max_time_duration), max_capacity_(max_capacity) {}

+  // Both seqno range and time range are inclusive. ... TODO
+  //
+  bool PrePopulate(SequenceNumber from_seqno, SequenceNumber to_seqno,
+                   uint64_t from_time, uint64_t to_time);
+
  // Append a new entry to the list. The new entry should be newer than the
  // existing ones. It maintains the internal sorted status.
  bool Append(SequenceNumber seqno, uint64_t time);

-  // Given a sequence number, estimate it's oldest time
-  uint64_t GetOldestApproximateTime(SequenceNumber seqno) const;
+  // Given a sequence number, return the best (largest / newest) known time
+  // that is no later than the write time of that given sequence number.
+  // If no such specific time is known, returns kUnknownTimeBeforeAll.
+  // Using the example in the class comment above,
+  //  GetProximalTimeBeforeSeqno(10) -> kUnknownTimeBeforeAll
+  //  GetProximalTimeBeforeSeqno(11) -> 500
+  //  GetProximalTimeBeforeSeqno(20) -> 500
+  //  GetProximalTimeBeforeSeqno(21) -> 600
+  uint64_t GetProximalTimeBeforeSeqno(SequenceNumber seqno) const;

-  // Truncate the old entries based on the current time and max_time_duration_
+  // Remove any entries not needed for GetProximalSeqnoBeforeTime queries of
+  // times older than `now - max_time_duration_`
  void TruncateOldEntries(uint64_t now);

-  // Given a time, return it's oldest possible sequence number
-  SequenceNumber GetOldestSequenceNum(uint64_t time);
+  // Given a time, return the best (largest) sequence number whose write time
+  // is no later than that given time. If no such specific sequence number is
+  // known, returns kUnknownSeqnoBeforeAll. Using the example in the class
+  // comment above,
+  //  GetProximalSeqnoBeforeTime(499) -> kUnknownSeqnoBeforeAll
+  //  GetProximalSeqnoBeforeTime(500) -> 10
+  //  GetProximalSeqnoBeforeTime(599) -> 10
+  //  GetProximalSeqnoBeforeTime(600) -> 20
+  SequenceNumber GetProximalSeqnoBeforeTime(uint64_t time);

-  // Encode to a binary string
+  // Encode to a binary string. start and end seqno are both inclusive.
  void Encode(std::string& des, SequenceNumber start, SequenceNumber end,
              uint64_t now,
              uint64_t output_size = kMaxSeqnoTimePairsPerSST) const;
@ -122,10 +159,10 @@ class SeqnoToTimeMapping {
  void Add(SequenceNumber seqno, uint64_t time);

  // Decode and add the entries to the current obj. The list will be unsorted
-  Status Add(const std::string& seqno_time_mapping_str);
+  Status Add(const std::string& pairs_str);

  // Return the number of entries
-  size_t Size() const { return seqno_time_mapping_.size(); }
+  size_t Size() const { return pairs_.size(); }

  // Reduce the size of internal list
  bool Resize(uint64_t min_time_duration, uint64_t max_time_duration);
@ -145,10 +182,10 @@ class SeqnoToTimeMapping {
  SeqnoToTimeMapping Copy(SequenceNumber smallest_seqno) const;

  // If the internal list is empty
-  bool Empty() const { return seqno_time_mapping_.empty(); }
+  bool Empty() const { return pairs_.empty(); }

  // clear all entries
-  void Clear() { seqno_time_mapping_.clear(); }
+  void Clear() { pairs_.clear(); }

  // return the string for user message
  // Note: Not efficient, okay for print
@ -156,7 +193,7 @@ class SeqnoToTimeMapping {

 #ifndef NDEBUG
  const std::deque<SeqnoTimePair>& TEST_GetInternalMapping() const {
-    return seqno_time_mapping_;
+    return pairs_;
  }
 #endif

@ -167,7 +204,7 @@ class SeqnoToTimeMapping {
  uint64_t max_time_duration_;
  uint64_t max_capacity_;

-  std::deque<SeqnoTimePair> seqno_time_mapping_;
+  std::deque<SeqnoTimePair> pairs_;

  bool is_sorted_ = true;

@ -176,14 +213,14 @@ class SeqnoToTimeMapping {

  SeqnoTimePair& Last() {
    assert(!Empty());
-    return seqno_time_mapping_.back();
+    return pairs_.back();
  }
+
+  using pair_const_iterator =
+      std::deque<SeqnoToTimeMapping::SeqnoTimePair>::const_iterator;
+  pair_const_iterator FindGreaterTime(uint64_t time) const;
+  pair_const_iterator FindGreaterSeqno(SequenceNumber seqno) const;
+  pair_const_iterator FindGreaterEqSeqno(SequenceNumber seqno) const;
 };

-// for searching the sequence number from SeqnoToTimeMapping
-inline bool operator<(const SequenceNumber& seqno,
-                      const SeqnoToTimeMapping::SeqnoTimePair& other) {
-  return seqno < other.seqno;
-}
-
 }  // namespace ROCKSDB_NAMESPACE
--- a/db/table_cache.cc
+++ b/db/table_cache.cc
@ -351,23 +351,25 @@ Status TableCache::GetRangeTombstoneIterator(
  return s;
 }

-void TableCache::CreateRowCacheKeyPrefix(const ReadOptions& options,
-                                         const FileDescriptor& fd,
-                                         const Slice& internal_key,
-                                         GetContext* get_context,
-                                         IterKey& row_cache_key) {
+uint64_t TableCache::CreateRowCacheKeyPrefix(const ReadOptions& options,
+                                             const FileDescriptor& fd,
+                                             const Slice& internal_key,
+                                             GetContext* get_context,
+                                             IterKey& row_cache_key) {
  uint64_t fd_number = fd.GetNumber();
  // We use the user key as cache key instead of the internal key,
  // otherwise the whole cache would be invalidated every time the
  // sequence key increases. However, to support caching snapshot
-  // reads, we append the sequence number (incremented by 1 to
-  // distinguish from 0) only in this case.
+  // reads, we append a sequence number (incremented by 1 to
+  // distinguish from 0) other than internal_key seq no
+  // to determine row cache entry visibility.
  // If the snapshot is larger than the largest seqno in the file,
  // all data should be exposed to the snapshot, so we treat it
  // the same as there is no snapshot. The exception is that if
  // a seq-checking callback is registered, some internal keys
  // may still be filtered out.
-  uint64_t seq_no = 0;
+  uint64_t cache_entry_seq_no = 0;
+
  // Maybe we can include the whole file ifsnapshot == fd.largest_seqno.
  if (options.snapshot != nullptr &&
      (get_context->has_callback() ||
@ -376,18 +378,24 @@ void TableCache::CreateRowCacheKeyPrefix(const ReadOptions& options,
    // We should consider to use options.snapshot->GetSequenceNumber()
    // instead of GetInternalKeySeqno(k), which will make the code
    // easier to understand.
-    seq_no = 1 + GetInternalKeySeqno(internal_key);
+    cache_entry_seq_no = 1 + GetInternalKeySeqno(internal_key);
  }

  // Compute row cache key.
  row_cache_key.TrimAppend(row_cache_key.Size(), row_cache_id_.data(),
                           row_cache_id_.size());
  AppendVarint64(&row_cache_key, fd_number);
-  AppendVarint64(&row_cache_key, seq_no);
+  AppendVarint64(&row_cache_key, cache_entry_seq_no);
+
+  // Provide a sequence number for callback checking on cache hit.
+  // As cache_entry_seq_no starts at 1, decrease it's value by 1 to get
+  // a sequence number align with get context's logic.
+  return cache_entry_seq_no == 0 ? 0 : cache_entry_seq_no - 1;
 }

 bool TableCache::GetFromRowCache(const Slice& user_key, IterKey& row_cache_key,
-                                 size_t prefix_size, GetContext* get_context) {
+                                 size_t prefix_size, GetContext* get_context,
+                                 SequenceNumber seq_no) {
  bool found = false;

  row_cache_key.TrimAppend(prefix_size, user_key.data(), user_key.size());
@ -404,8 +412,10 @@ bool TableCache::GetFromRowCache(const Slice& user_key, IterKey& row_cache_key,
    // get_context.pinnable_slice_. Cache entry is released when
    // get_context.pinnable_slice_ is reset.
    row_cache.RegisterReleaseAsCleanup(row_handle, value_pinner);
+    // If row cache hit, knowing cache key is the same to row_cache_key,
+    // can use row_cache_key's seq no to construct InternalKey.
    replayGetContextLog(*row_cache.Value(row_handle), user_key, get_context,
-                        &value_pinner);
+                        &value_pinner, seq_no);
    RecordTick(ioptions_.stats, ROW_CACHE_HIT);
    found = true;
  } else {
@ -428,13 +438,14 @@ Status TableCache::Get(
  IterKey row_cache_key;
  std::string row_cache_entry_buffer;

-  // Check row cache if enabled. Since row cache does not currently store
-  // sequence numbers, we cannot use it if we need to fetch the sequence.
+  // Check row cache if enabled.
+  // Reuse row_cache_key sequence number when row cache hits.
  if (ioptions_.row_cache && !get_context->NeedToReadSequence()) {
    auto user_key = ExtractUserKey(k);
-    CreateRowCacheKeyPrefix(options, fd, k, get_context, row_cache_key);
+    uint64_t cache_entry_seq_no =
+        CreateRowCacheKeyPrefix(options, fd, k, get_context, row_cache_key);
    done = GetFromRowCache(user_key, row_cache_key, row_cache_key.Size(),
-                           get_context);
+                           get_context, cache_entry_seq_no);
    if (!done) {
      row_cache_entry = &row_cache_entry_buffer;
    }
@ -718,4 +729,4 @@ uint64_t TableCache::ApproximateSize(

  return result;
 }
-}  // namespace ROCKSDB_NAMESPACE
+}  // namespace ROCKSDB_NAMESPACE
--- a/db/table_cache.h
+++ b/db/table_cache.h
@ -262,15 +262,18 @@ class TableCache {
  // Create a key prefix for looking up the row cache. The prefix is of the
  // format row_cache_id + fd_number + seq_no. Later, the user key can be
  // appended to form the full key
-  void CreateRowCacheKeyPrefix(const ReadOptions& options,
-                               const FileDescriptor& fd,
-                               const Slice& internal_key,
-                               GetContext* get_context, IterKey& row_cache_key);
+  // Return the sequence number that determines the visibility of row_cache_key
+  uint64_t CreateRowCacheKeyPrefix(const ReadOptions& options,
+                                   const FileDescriptor& fd,
+                                   const Slice& internal_key,
+                                   GetContext* get_context,
+                                   IterKey& row_cache_key);

  // Helper function to lookup the row cache for a key. It appends the
  // user key to row_cache_key at offset prefix_size
  bool GetFromRowCache(const Slice& user_key, IterKey& row_cache_key,
-                       size_t prefix_size, GetContext* get_context);
+                       size_t prefix_size, GetContext* get_context,
+                       SequenceNumber seq_no = kMaxSequenceNumber);

  const ImmutableOptions& ioptions_;
  const FileOptions& file_options_;
@ -283,4 +286,4 @@ class TableCache {
  std::string db_session_id_;
 };

-}  // namespace ROCKSDB_NAMESPACE
+}  // namespace ROCKSDB_NAMESPACE
--- a/db/version_edit.h
+++ b/db/version_edit.h
@ -219,10 +219,16 @@ struct FileMetaData {
  // refers to. 0 is an invalid value; BlobDB numbers the files starting from 1.
  uint64_t oldest_blob_file_number = kInvalidBlobFileNumber;

-  // The file could be the compaction output from other SST files, which could
-  // in turn be outputs for compact older SST files. We track the memtable
-  // flush timestamp for the oldest SST file that eventually contribute data
-  // to this file. 0 means the information is not available.
+  // For flush output file, oldest ancestor time is the oldest key time in the
+  // file.  If the oldest key time is not available, flush time is used.
+  //
+  // For compaction output file, oldest ancestor time is the oldest
+  // among all the oldest key time of its input files, since the file could be
+  // the compaction output from other SST files, which could in turn be outputs
+  // for compact older SST files. If that's not available, creation time of this
+  // compaction output file is used.
+  //
+  // 0 means the information is not available.
  uint64_t oldest_ancester_time = kUnknownOldestAncesterTime;

  // Unix time when the SST file is created.
--- a/db/version_set.cc
+++ b/db/version_set.cc
@ -2527,21 +2527,16 @@ void Version::Get(const ReadOptions& read_options, const LookupKey& k,
    // merge_operands are in saver and we hit the beginning of the key history
    // do a final merge of nullptr and operands;
    if (value || columns) {
-      std::string result;
      // `op_failure_scope` (an output parameter) is not provided (set to
      // nullptr) since a failure must be propagated regardless of its value.
      *status = MergeHelper::TimedFullMerge(
-          merge_operator_, user_key, nullptr, merge_context->GetOperands(),
-          &result, info_log_, db_statistics_, clock_,
-          /* result_operand */ nullptr, /* update_num_ops_stats */ true,
-          /* op_failure_scope */ nullptr);
+          merge_operator_, user_key, MergeHelper::kNoBaseValue,
+          merge_context->GetOperands(), info_log_, db_statistics_, clock_,
+          /* update_num_ops_stats */ true, value ? value->GetSelf() : nullptr,
+          columns, /* op_failure_scope */ nullptr);
      if (status->ok()) {
        if (LIKELY(value != nullptr)) {
-          *(value->GetSelf()) = std::move(result);
          value->PinSelf();
-        } else {
-          assert(columns != nullptr);
-          columns->SetPlainValue(std::move(result));
        }
      }
    }
@ -2778,22 +2773,19 @@ void Version::MultiGet(const ReadOptions& read_options, MultiGetRange* range,
      }
      // merge_operands are in saver and we hit the beginning of the key history
      // do a final merge of nullptr and operands;
-      std::string result;
-
      // `op_failure_scope` (an output parameter) is not provided (set to
      // nullptr) since a failure must be propagated regardless of its value.
      *status = MergeHelper::TimedFullMerge(
-          merge_operator_, user_key, nullptr, iter->merge_context.GetOperands(),
-          &result, info_log_, db_statistics_, clock_,
-          /* result_operand */ nullptr, /* update_num_ops_stats */ true,
+          merge_operator_, user_key, MergeHelper::kNoBaseValue,
+          iter->merge_context.GetOperands(), info_log_, db_statistics_, clock_,
+          /* update_num_ops_stats */ true,
+          iter->value ? iter->value->GetSelf() : nullptr, iter->columns,
          /* op_failure_scope */ nullptr);
      if (LIKELY(iter->value != nullptr)) {
-        *iter->value->GetSelf() = std::move(result);
        iter->value->PinSelf();
        range->AddValueSize(iter->value->size());
      } else {
        assert(iter->columns);
-        iter->columns->SetPlainValue(std::move(result));
        range->AddValueSize(iter->columns->serialized_size());
      }

@ -7239,6 +7231,20 @@ Status VersionSet::VerifyFileMetadata(const ReadOptions& read_options,
  return status;
 }

+void VersionSet::EnsureNonZeroSequence() {
+  uint64_t expected = 0;
+  // Update each from 0->1, in order, or abort if any becomes non-zero in
+  // parallel
+  if (last_allocated_sequence_.compare_exchange_strong(expected, 1)) {
+    if (last_published_sequence_.compare_exchange_strong(expected, 1)) {
+      (void)last_sequence_.compare_exchange_strong(expected, 1);
+    }
+  }
+  assert(last_allocated_sequence_.load() > 0);
+  assert(last_published_sequence_.load() > 0);
+  assert(last_sequence_.load() > 0);
+}
+
 ReactiveVersionSet::ReactiveVersionSet(
    const std::string& dbname, const ImmutableDBOptions* _db_options,
    const FileOptions& _file_options, Cache* table_cache,
--- a/db/version_set.h
+++ b/db/version_set.h
@ -1342,6 +1342,9 @@ class VersionSet {
    last_allocated_sequence_.store(s, std::memory_order_seq_cst);
  }

+  // Allocate a dummy sequence number as needed to ensure last is non-zero.
+  void EnsureNonZeroSequence();
+
  // Note: memory_order_release must be sufficient
  uint64_t FetchAddLastAllocatedSequence(uint64_t s) {
    return last_allocated_sequence_.fetch_add(s, std::memory_order_seq_cst);
--- a/db/wide/db_wide_basic_test.cc
+++ b/db/wide/db_wide_basic_test.cc
@ -10,6 +10,7 @@
 #include "db/db_test_util.h"
 #include "port/stack_trace.h"
 #include "test_util/testutil.h"
+#include "util/overload.h"
 #include "utilities/merge_operators.h"

 namespace ROCKSDB_NAMESPACE {
@ -690,6 +691,397 @@ TEST_F(DBWideBasicTest, MergeEntity) {
  verify_merge_ops_post_compaction();
 }

+class DBWideMergeV3Test : public DBWideBasicTest {
+ protected:
+  void RunTest(const WideColumns& first_expected,
+               const WideColumns& second_expected,
+               const WideColumns& third_expected) {
+    // Note: we'll take some snapshots to prevent merging during flush
+    snapshots_.reserve(6);
+
+    // Test reading from memtables
+    WriteKeyValues();
+    VerifyKeyValues(first_expected, second_expected, third_expected);
+    VerifyMergeOperandCount(first_key, 2);
+    VerifyMergeOperandCount(second_key, 3);
+    VerifyMergeOperandCount(third_key, 3);
+
+    // Test reading from SST files
+    ASSERT_OK(Flush());
+    VerifyKeyValues(first_expected, second_expected, third_expected);
+    VerifyMergeOperandCount(first_key, 2);
+    VerifyMergeOperandCount(second_key, 3);
+    VerifyMergeOperandCount(third_key, 3);
+
+    // Test reading from SSTs after compaction. Note that we write the same KVs
+    // and flush again so we have two overlapping files. We also release the
+    // snapshots so that the compaction can merge all keys.
+    WriteKeyValues();
+    ASSERT_OK(Flush());
+
+    snapshots_.clear();
+
+    ASSERT_OK(db_->CompactRange(CompactRangeOptions(), /* begin */ nullptr,
+                                /* end */ nullptr));
+    VerifyKeyValues(first_expected, second_expected, third_expected);
+    VerifyMergeOperandCount(first_key, 1);
+    VerifyMergeOperandCount(second_key, 1);
+    VerifyMergeOperandCount(third_key, 1);
+  }
+
+  void WriteKeyValues() {
+    // Base values
+    ASSERT_OK(db_->Delete(WriteOptions(), db_->DefaultColumnFamily(),
+                          first_key));  // no base value
+    ASSERT_OK(db_->Put(WriteOptions(), db_->DefaultColumnFamily(), second_key,
+                       second_base_value));  // plain base value
+    ASSERT_OK(db_->PutEntity(WriteOptions(), db_->DefaultColumnFamily(),
+                             third_key,
+                             third_columns));  // wide-column base value
+
+    snapshots_.emplace_back(db_);
+
+    // First round of merge operands
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), first_key,
+                         first_merge_op1));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), second_key,
+                         second_merge_op1));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), third_key,
+                         third_merge_op1));
+
+    snapshots_.emplace_back(db_);
+
+    // Second round of merge operands
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), first_key,
+                         first_merge_op2));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), second_key,
+                         second_merge_op2));
+    ASSERT_OK(db_->Merge(WriteOptions(), db_->DefaultColumnFamily(), third_key,
+                         third_merge_op2));
+
+    snapshots_.emplace_back(db_);
+  }
+
+  void VerifyKeyValues(const WideColumns& first_expected,
+                       const WideColumns& second_expected,
+                       const WideColumns& third_expected) {
+    assert(!first_expected.empty() &&
+           first_expected[0].name() == kDefaultWideColumnName);
+    assert(!second_expected.empty() &&
+           second_expected[0].name() == kDefaultWideColumnName);
+    assert(!third_expected.empty() &&
+           third_expected[0].name() == kDefaultWideColumnName);
+
+    // Get
+    {
+      PinnableSlice result;
+      ASSERT_OK(db_->Get(ReadOptions(), db_->DefaultColumnFamily(), first_key,
+                         &result));
+      ASSERT_EQ(result, first_expected[0].value());
+    }
+
+    {
+      PinnableSlice result;
+      ASSERT_OK(db_->Get(ReadOptions(), db_->DefaultColumnFamily(), second_key,
+                         &result));
+      ASSERT_EQ(result, second_expected[0].value());
+    }
+
+    {
+      PinnableSlice result;
+      ASSERT_OK(db_->Get(ReadOptions(), db_->DefaultColumnFamily(), third_key,
+                         &result));
+      ASSERT_EQ(result, third_expected[0].value());
+    }
+
+    // MultiGet
+    {
+      std::array<Slice, num_keys> keys{{first_key, second_key, third_key}};
+      std::array<PinnableSlice, num_keys> values;
+      std::array<Status, num_keys> statuses;
+
+      db_->MultiGet(ReadOptions(), db_->DefaultColumnFamily(), num_keys,
+                    keys.data(), values.data(), statuses.data());
+      ASSERT_OK(statuses[0]);
+      ASSERT_EQ(values[0], first_expected[0].value());
+      ASSERT_OK(statuses[1]);
+      ASSERT_EQ(values[1], second_expected[0].value());
+      ASSERT_OK(statuses[2]);
+      ASSERT_EQ(values[2], third_expected[0].value());
+    }
+
+    // GetEntity
+    {
+      PinnableWideColumns result;
+
+      ASSERT_OK(db_->GetEntity(ReadOptions(), db_->DefaultColumnFamily(),
+                               first_key, &result));
+      ASSERT_EQ(result.columns(), first_expected);
+    }
+
+    {
+      PinnableWideColumns result;
+
+      ASSERT_OK(db_->GetEntity(ReadOptions(), db_->DefaultColumnFamily(),
+                               second_key, &result));
+      ASSERT_EQ(result.columns(), second_expected);
+    }
+
+    {
+      PinnableWideColumns result;
+
+      ASSERT_OK(db_->GetEntity(ReadOptions(), db_->DefaultColumnFamily(),
+                               third_key, &result));
+      ASSERT_EQ(result.columns(), third_expected);
+    }
+
+    // MultiGetEntity
+    {
+      std::array<Slice, num_keys> keys{{first_key, second_key, third_key}};
+      std::array<PinnableWideColumns, num_keys> results;
+      std::array<Status, num_keys> statuses;
+
+      db_->MultiGetEntity(ReadOptions(), db_->DefaultColumnFamily(), num_keys,
+                          keys.data(), results.data(), statuses.data());
+      ASSERT_OK(statuses[0]);
+      ASSERT_EQ(results[0].columns(), first_expected);
+      ASSERT_OK(statuses[1]);
+      ASSERT_EQ(results[1].columns(), second_expected);
+      ASSERT_OK(statuses[2]);
+      ASSERT_EQ(results[2].columns(), third_expected);
+    }
+
+    // Iterator
+    {
+      std::unique_ptr<Iterator> iter(db_->NewIterator(ReadOptions()));
+
+      iter->SeekToFirst();
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+      ASSERT_EQ(iter->key(), first_key);
+      ASSERT_EQ(iter->value(), first_expected[0].value());
+      ASSERT_EQ(iter->columns(), first_expected);
+
+      iter->Next();
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+      ASSERT_EQ(iter->key(), second_key);
+      ASSERT_EQ(iter->value(), second_expected[0].value());
+      ASSERT_EQ(iter->columns(), second_expected);
+
+      iter->Next();
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+      ASSERT_EQ(iter->key(), third_key);
+      ASSERT_EQ(iter->value(), third_expected[0].value());
+      ASSERT_EQ(iter->columns(), third_expected);
+
+      iter->Next();
+      ASSERT_FALSE(iter->Valid());
+      ASSERT_OK(iter->status());
+
+      iter->SeekToLast();
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+      ASSERT_EQ(iter->key(), third_key);
+      ASSERT_EQ(iter->value(), third_expected[0].value());
+      ASSERT_EQ(iter->columns(), third_expected);
+
+      iter->Prev();
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+      ASSERT_EQ(iter->key(), second_key);
+      ASSERT_EQ(iter->value(), second_expected[0].value());
+      ASSERT_EQ(iter->columns(), second_expected);
+
+      iter->Prev();
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+      ASSERT_EQ(iter->key(), first_key);
+      ASSERT_EQ(iter->value(), first_expected[0].value());
+      ASSERT_EQ(iter->columns(), first_expected);
+
+      iter->Prev();
+      ASSERT_FALSE(iter->Valid());
+      ASSERT_OK(iter->status());
+    }
+  }
+
+  void VerifyMergeOperandCount(const Slice& key, int expected_merge_ops) {
+    GetMergeOperandsOptions get_merge_opts;
+    get_merge_opts.expected_max_number_of_operands = expected_merge_ops;
+
+    std::vector<PinnableSlice> merge_operands(expected_merge_ops);
+    int number_of_operands = 0;
+
+    ASSERT_OK(db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(),
+                                    key, merge_operands.data(), &get_merge_opts,
+                                    &number_of_operands));
+    ASSERT_EQ(number_of_operands, expected_merge_ops);
+  }
+
+  std::vector<ManagedSnapshot> snapshots_;
+
+  static constexpr size_t num_keys = 3;
+
+  static constexpr char first_key[] = "first";
+  static constexpr char first_merge_op1[] = "hello";
+  static constexpr char first_merge_op1_upper[] = "HELLO";
+  static constexpr char first_merge_op2[] = "world";
+  static constexpr char first_merge_op2_upper[] = "WORLD";
+
+  static constexpr char second_key[] = "second";
+  static constexpr char second_base_value[] = "foo";
+  static constexpr char second_base_value_upper[] = "FOO";
+  static constexpr char second_merge_op1[] = "bar";
+  static constexpr char second_merge_op1_upper[] = "BAR";
+  static constexpr char second_merge_op2[] = "baz";
+  static constexpr char second_merge_op2_upper[] = "BAZ";
+
+  static constexpr char third_key[] = "third";
+  static const WideColumns third_columns;
+  static constexpr char third_merge_op1[] = "three";
+  static constexpr char third_merge_op1_upper[] = "THREE";
+  static constexpr char third_merge_op2[] = "four";
+  static constexpr char third_merge_op2_upper[] = "FOUR";
+};
+
+const WideColumns DBWideMergeV3Test::third_columns{{"one", "ONE"},
+                                                   {"two", "TWO"}};
+
+TEST_F(DBWideMergeV3Test, MergeV3WideColumnOutput) {
+  // A test merge operator that always returns a wide-column result. It adds any
+  // base values and merge operands to a single wide-column entity, and converts
+  // all column values to uppercase. In addition, it puts "none", "plain", or
+  // "wide" into the value of the default column depending on the type of the
+  // base value (if any).
+  static constexpr char kNone[] = "none";
+  static constexpr char kPlain[] = "plain";
+  static constexpr char kWide[] = "wide";
+
+  class WideColumnOutputMergeOperator : public MergeOperator {
+   public:
+    bool FullMergeV3(const MergeOperationInputV3& merge_in,
+                     MergeOperationOutputV3* merge_out) const override {
+      assert(merge_out);
+
+      merge_out->new_value = MergeOperationOutputV3::NewColumns();
+      auto& new_columns =
+          std::get<MergeOperationOutputV3::NewColumns>(merge_out->new_value);
+
+      auto upper = [](std::string str) {
+        for (char& c : str) {
+          c = static_cast<char>(std::toupper(static_cast<unsigned char>(c)));
+        }
+
+        return str;
+      };
+
+      std::visit(overload{[&](const std::monostate&) {
+                            new_columns.emplace_back(
+                                kDefaultWideColumnName.ToString(), kNone);
+                          },
+                          [&](const Slice& value) {
+                            new_columns.emplace_back(
+                                kDefaultWideColumnName.ToString(), kPlain);
+
+                            const std::string val = value.ToString();
+                            new_columns.emplace_back(val, upper(val));
+                          },
+                          [&](const WideColumns& columns) {
+                            new_columns.emplace_back(
+                                kDefaultWideColumnName.ToString(), kWide);
+
+                            for (const auto& column : columns) {
+                              new_columns.emplace_back(
+                                  column.name().ToString(),
+                                  upper(column.value().ToString()));
+                            }
+                          }},
+                 merge_in.existing_value);
+
+      for (const auto& operand : merge_in.operand_list) {
+        const std::string op = operand.ToString();
+        new_columns.emplace_back(op, upper(op));
+      }
+
+      return true;
+    }
+
+    const char* Name() const override {
+      return "WideColumnOutputMergeOperator";
+    }
+  };
+
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.merge_operator = std::make_shared<WideColumnOutputMergeOperator>();
+  Reopen(options);
+
+  // Expected results
+  // Lexicographical order: [default] < hello < world
+  const WideColumns first_expected{{kDefaultWideColumnName, kNone},
+                                   {first_merge_op1, first_merge_op1_upper},
+                                   {first_merge_op2, first_merge_op2_upper}};
+  // Lexicographical order: [default] < bar < baz < foo
+  const WideColumns second_expected{
+      {kDefaultWideColumnName, kPlain},
+      {second_merge_op1, second_merge_op1_upper},
+      {second_merge_op2, second_merge_op2_upper},
+      {second_base_value, second_base_value_upper}};
+  // Lexicographical order: [default] < four < one < three < two
+  const WideColumns third_expected{
+      {kDefaultWideColumnName, kWide},
+      {third_merge_op2, third_merge_op2_upper},
+      {third_columns[0].name(), third_columns[0].value()},
+      {third_merge_op1, third_merge_op1_upper},
+      {third_columns[1].name(), third_columns[1].value()}};
+
+  RunTest(first_expected, second_expected, third_expected);
+}
+
+TEST_F(DBWideMergeV3Test, MergeV3PlainOutput) {
+  // A test merge operator that always returns a plain value as result, namely
+  // the total number of operands serialized as a string. Base values are also
+  // counted as operands; specifically, a plain base value is counted as one
+  // operand, while a wide-column base value is counted as as many operands as
+  // the number of columns.
+  class PlainOutputMergeOperator : public MergeOperator {
+   public:
+    bool FullMergeV3(const MergeOperationInputV3& merge_in,
+                     MergeOperationOutputV3* merge_out) const override {
+      assert(merge_out);
+
+      size_t count = 0;
+      std::visit(
+          overload{[&](const std::monostate&) {},
+                   [&](const Slice&) { count = 1; },
+                   [&](const WideColumns& columns) { count = columns.size(); }},
+          merge_in.existing_value);
+
+      count += merge_in.operand_list.size();
+
+      merge_out->new_value = std::string();
+      std::get<std::string>(merge_out->new_value) = std::to_string(count);
+
+      return true;
+    }
+
+    const char* Name() const override { return "PlainOutputMergeOperator"; }
+  };
+
+  Options options = GetDefaultOptions();
+  options.create_if_missing = true;
+  options.merge_operator = std::make_shared<PlainOutputMergeOperator>();
+  Reopen(options);
+
+  const WideColumns first_expected{{kDefaultWideColumnName, "2"}};
+  const WideColumns second_expected{{kDefaultWideColumnName, "3"}};
+  const WideColumns third_expected{{kDefaultWideColumnName, "4"}};
+
+  RunTest(first_expected, second_expected, third_expected);
+}
+
 TEST_F(DBWideBasicTest, CompactionFilter) {
  Options options = GetDefaultOptions();
  options.create_if_missing = true;
--- a/db/wide/wide_column_serialization.cc
+++ b/db/wide/wide_column_serialization.cc
@ -16,11 +16,9 @@

 namespace ROCKSDB_NAMESPACE {

-Status WideColumnSerialization::SerializeImpl(const Slice* value_of_default,
-                                              const WideColumns& columns,
-                                              std::string& output) {
-  const size_t num_columns =
-      value_of_default ? columns.size() + 1 : columns.size();
+Status WideColumnSerialization::Serialize(const WideColumns& columns,
+                                          std::string& output) {
+  const size_t num_columns = columns.size();

  if (num_columns > static_cast<size_t>(std::numeric_limits<uint32_t>::max())) {
    return Status::InvalidArgument("Too many wide columns");
@ -31,17 +29,6 @@ Status WideColumnSerialization::SerializeImpl(const Slice* value_of_default,
  PutVarint32(&output, static_cast<uint32_t>(num_columns));

  const Slice* prev_name = nullptr;
-  if (value_of_default) {
-    if (value_of_default->size() >
-        static_cast<size_t>(std::numeric_limits<uint32_t>::max())) {
-      return Status::InvalidArgument("Wide column value too long");
-    }
-
-    PutLengthPrefixedSlice(&output, kDefaultWideColumnName);
-    PutVarint32(&output, static_cast<uint32_t>(value_of_default->size()));
-
-    prev_name = &kDefaultWideColumnName;
-  }

  for (size_t i = 0; i < columns.size(); ++i) {
    const WideColumn& column = columns[i];
@ -68,10 +55,6 @@ Status WideColumnSerialization::SerializeImpl(const Slice* value_of_default,
    prev_name = &name;
  }

-  if (value_of_default) {
-    output.append(value_of_default->data(), value_of_default->size());
-  }
-
  for (const auto& column : columns) {
    const Slice& value = column.value();

--- a/db/wide/wide_column_serialization.h
+++ b/db/wide/wide_column_serialization.h
@ -44,9 +44,6 @@ class Slice;
 class WideColumnSerialization {
 public:
  static Status Serialize(const WideColumns& columns, std::string& output);
-  static Status Serialize(const Slice& value_of_default,
-                          const WideColumns& other_columns,
-                          std::string& output);

  static Status Deserialize(Slice& input, WideColumns& columns);

@ -55,23 +52,6 @@ class WideColumnSerialization {
  static Status GetValueOfDefaultColumn(Slice& input, Slice& value);

  static constexpr uint32_t kCurrentVersion = 1;
-
- private:
-  static Status SerializeImpl(const Slice* value_of_default,
-                              const WideColumns& columns, std::string& output);
 };

-inline Status WideColumnSerialization::Serialize(const WideColumns& columns,
-                                                 std::string& output) {
-  constexpr Slice* value_of_default = nullptr;
-
-  return SerializeImpl(value_of_default, columns, output);
-}
-
-inline Status WideColumnSerialization::Serialize(
-    const Slice& value_of_default, const WideColumns& other_columns,
-    std::string& output) {
-  return SerializeImpl(&value_of_default, other_columns, output);
-}
-
 }  // namespace ROCKSDB_NAMESPACE
--- a/db/wide/wide_column_serialization_test.cc
+++ b/db/wide/wide_column_serialization_test.cc
@ -124,25 +124,6 @@ TEST(WideColumnSerializationTest, SerializeDeserialize) {
  }
 }

-TEST(WideColumnSerializationTest, SerializeWithPrepend) {
-  Slice value_of_default("baz");
-  WideColumns other_columns{{"foo", "bar"}, {"hello", "world"}};
-
-  std::string output;
-  ASSERT_OK(WideColumnSerialization::Serialize(value_of_default, other_columns,
-                                               output));
-
-  Slice input(output);
-
-  WideColumns deserialized_columns;
-  ASSERT_OK(WideColumnSerialization::Deserialize(input, deserialized_columns));
-
-  WideColumns expected_columns{{kDefaultWideColumnName, value_of_default},
-                               other_columns[0],
-                               other_columns[1]};
-  ASSERT_EQ(deserialized_columns, expected_columns);
-}
-
 TEST(WideColumnSerializationTest, SerializeDuplicateError) {
  WideColumns columns{{"foo", "bar"}, {"foo", "baz"}};
  std::string output;
@ -151,16 +132,6 @@ TEST(WideColumnSerializationTest, SerializeDuplicateError) {
      WideColumnSerialization::Serialize(columns, output).IsCorruption());
 }

-TEST(WideColumnSerializationTest, SerializeWithPrependDuplicateError) {
-  Slice value_of_default("baz");
-  WideColumns other_columns{{kDefaultWideColumnName, "dup"}, {"foo", "bar"}};
-
-  std::string output;
-  ASSERT_TRUE(WideColumnSerialization::Serialize(value_of_default,
-                                                 other_columns, output)
-                  .IsCorruption());
-}
-
 TEST(WideColumnSerializationTest, SerializeOutOfOrderError) {
  WideColumns columns{{"hello", "world"}, {"foo", "bar"}};
  std::string output;
--- a/db/wide/wide_columns_helper.h
+++ b/db/wide/wide_columns_helper.h
@ -24,6 +24,11 @@ class WideColumnsHelper {
    return !columns.empty() && columns.front().name() == kDefaultWideColumnName;
  }

+  static bool HasDefaultColumnOnly(const WideColumns& columns) {
+    return columns.size() == 1 &&
+           columns.front().name() == kDefaultWideColumnName;
+  }
+
  static const Slice& GetDefaultColumn(const WideColumns& columns) {
    assert(HasDefaultColumn(columns));
    return columns.front().value();
--- a/db/write_batch.cc
+++ b/db/write_batch.cc
@ -2483,13 +2483,15 @@ class MemTableInserter : public WriteBatch::Handler {
    }

    if (perform_merge) {
-      // 1) Get the existing value
-      std::string get_value;
+      // 1) Get the existing value. Use the wide column APIs to make sure we
+      // don't lose any columns in the process.
+      PinnableWideColumns existing;

      // Pass in the sequence number so that we also include previous merge
      // operations in the same batch.
      SnapshotImpl read_from_snapshot;
      read_from_snapshot.number_ = sequence_;
+
      // TODO: plumb Env::IOActivity
      ReadOptions read_options;
      read_options.snapshot = &read_from_snapshot;
@ -2498,26 +2500,47 @@ class MemTableInserter : public WriteBatch::Handler {
      if (cf_handle == nullptr) {
        cf_handle = db_->DefaultColumnFamily();
      }
-      Status get_status = db_->Get(read_options, cf_handle, key, &get_value);
+
+      Status get_status =
+          db_->GetEntity(read_options, cf_handle, key, &existing);
      if (!get_status.ok()) {
        // Failed to read a key we know exists. Store the delta in memtable.
        perform_merge = false;
      } else {
-        Slice get_value_slice = Slice(get_value);
-
        // 2) Apply this merge
        auto merge_operator = moptions->merge_operator;
        assert(merge_operator);

+        const auto& columns = existing.columns();
+
+        Status merge_status;
        std::string new_value;
-        // `op_failure_scope` (an output parameter) is not provided (set to
-        // nullptr) since a failure must be propagated regardless of its value.
-        Status merge_status = MergeHelper::TimedFullMerge(
-            merge_operator, key, &get_value_slice, {value}, &new_value,
-            moptions->info_log, moptions->statistics,
-            SystemClock::Default().get(), /* result_operand */ nullptr,
-            /* update_num_ops_stats */ false,
-            /* op_failure_scope */ nullptr);
+        ValueType new_value_type;
+
+        if (WideColumnsHelper::HasDefaultColumnOnly(columns)) {
+          // `op_failure_scope` (an output parameter) is not provided (set to
+          // nullptr) since a failure must be propagated regardless of its
+          // value.
+          merge_status = MergeHelper::TimedFullMerge(
+              merge_operator, key, MergeHelper::kPlainBaseValue,
+              WideColumnsHelper::GetDefaultColumn(columns), {value},
+              moptions->info_log, moptions->statistics,
+              SystemClock::Default().get(),
+              /* update_num_ops_stats */ false, &new_value,
+              /* result_operand */ nullptr, &new_value_type,
+              /* op_failure_scope */ nullptr);
+        } else {
+          // `op_failure_scope` (an output parameter) is not provided (set to
+          // nullptr) since a failure must be propagated regardless of its
+          // value.
+          merge_status = MergeHelper::TimedFullMerge(
+              merge_operator, key, MergeHelper::kWideBaseValue, columns,
+              {value}, moptions->info_log, moptions->statistics,
+              SystemClock::Default().get(),
+              /* update_num_ops_stats */ false, &new_value,
+              /* result_operand */ nullptr, &new_value_type,
+              /* op_failure_scope */ nullptr);
+        }

        if (!merge_status.ok()) {
          // Failed to merge!
@ -2526,15 +2549,18 @@ class MemTableInserter : public WriteBatch::Handler {
        } else {
          // 3) Add value to memtable
          assert(!concurrent_memtable_writes_);
+          assert(new_value_type == kTypeValue ||
+                 new_value_type == kTypeWideColumnEntity);
+
          if (kv_prot_info != nullptr) {
            auto merged_kv_prot_info =
                kv_prot_info->StripC(column_family_id).ProtectS(sequence_);
            merged_kv_prot_info.UpdateV(value, new_value);
-            merged_kv_prot_info.UpdateO(kTypeMerge, kTypeValue);
-            ret_status = mem->Add(sequence_, kTypeValue, key, new_value,
+            merged_kv_prot_info.UpdateO(kTypeMerge, new_value_type);
+            ret_status = mem->Add(sequence_, new_value_type, key, new_value,
                                  &merged_kv_prot_info);
          } else {
-            ret_status = mem->Add(sequence_, kTypeValue, key, new_value,
+            ret_status = mem->Add(sequence_, new_value_type, key, new_value,
                                  nullptr /* kv_prot_info */);
          }
        }
--- a/db_stress_tool/CMakeLists.txt
+++ b/db_stress_tool/CMakeLists.txt
@ -9,6 +9,7 @@ add_executable(db_stress${ARTIFACT_SUFFIX}
  db_stress_shared_state.cc
  db_stress_stat.cc
  db_stress_test_base.cc
+  db_stress_wide_merge_operator.cc
  db_stress_tool.cc
  expected_state.cc
  expected_value.cc
--- a/db_stress_tool/batched_ops_stress.cc
+++ b/db_stress_tool/batched_ops_stress.cc
@ -52,11 +52,11 @@ class BatchedOpsStressTest : public StressTest {
      const std::string k = num + key_body;
      const std::string v = value_body + num;

-      if (FLAGS_use_merge) {
-        batch.Merge(cfh, k, v);
-      } else if (FLAGS_use_put_entity_one_in > 0 &&
-                 (value_base % FLAGS_use_put_entity_one_in) == 0) {
+      if (FLAGS_use_put_entity_one_in > 0 &&
+          (value_base % FLAGS_use_put_entity_one_in) == 0) {
        batch.PutEntity(cfh, k, GenerateWideColumns(value_base, v));
+      } else if (FLAGS_use_merge) {
+        batch.Merge(cfh, k, v);
      } else {
        batch.Put(cfh, k, v);
      }
--- a/db_stress_tool/cf_consistency_stress.cc
+++ b/db_stress_tool/cf_consistency_stress.cc
@ -36,18 +36,15 @@ class CfConsistencyStressTest : public StressTest {

    WriteBatch batch;

-    const bool use_put_entity = !FLAGS_use_merge &&
-                                FLAGS_use_put_entity_one_in > 0 &&
-                                (value_base % FLAGS_use_put_entity_one_in) == 0;
-
    for (auto cf : rand_column_families) {
      ColumnFamilyHandle* const cfh = column_families_[cf];
      assert(cfh);

-      if (FLAGS_use_merge) {
-        batch.Merge(cfh, k, v);
-      } else if (use_put_entity) {
+      if (FLAGS_use_put_entity_one_in > 0 &&
+          (value_base % FLAGS_use_put_entity_one_in) == 0) {
        batch.PutEntity(cfh, k, GenerateWideColumns(value_base, v));
+      } else if (FLAGS_use_merge) {
+        batch.Merge(cfh, k, v);
      } else {
        batch.Put(cfh, k, v);
      }
--- a/db_stress_tool/db_stress_test_base.cc
+++ b/db_stress_tool/db_stress_test_base.cc
@ -17,6 +17,7 @@
 #include "db_stress_tool/db_stress_compaction_filter.h"
 #include "db_stress_tool/db_stress_driver.h"
 #include "db_stress_tool/db_stress_table_properties_collector.h"
+#include "db_stress_tool/db_stress_wide_merge_operator.h"
 #include "rocksdb/convenience.h"
 #include "rocksdb/filter_policy.h"
 #include "rocksdb/secondary_cache.h"
@ -511,7 +512,11 @@ void StressTest::PreloadDbAndReopenAsReadOnly(int64_t number_of_keys,
        ts = GetNowNanos();
      }

-      if (FLAGS_use_merge) {
+      if (FLAGS_use_put_entity_one_in > 0 &&
+          (value_base % FLAGS_use_put_entity_one_in) == 0) {
+        s = db_->PutEntity(write_opts, cfh, key,
+                           GenerateWideColumns(value_base, v));
+      } else if (FLAGS_use_merge) {
        if (!FLAGS_use_txn) {
          if (FLAGS_user_timestamp_size > 0) {
            s = db_->Merge(write_opts, cfh, key, ts, v);
@ -523,9 +528,6 @@ void StressTest::PreloadDbAndReopenAsReadOnly(int64_t number_of_keys,
              write_opts, /*thread=*/nullptr,
              [&](Transaction& txn) { return txn.Merge(cfh, key, v); });
        }
-      } else if (FLAGS_use_put_entity_one_in > 0) {
-        s = db_->PutEntity(write_opts, cfh, key,
-                           GenerateWideColumns(value_base, v));
      } else {
        if (!FLAGS_use_txn) {
          if (FLAGS_user_timestamp_size > 0) {
@ -1235,7 +1237,6 @@ Status StressTest::TestIterate(ThreadState* thread,
  } else if (options_.prefix_extractor.get() == nullptr) {
    expect_total_order = true;
  }
-
  std::string upper_bound_str;
  Slice upper_bound;
  if (thread->rand.OneIn(16)) {
@ -1246,6 +1247,7 @@ Status StressTest::TestIterate(ThreadState* thread,
    upper_bound = Slice(upper_bound_str);
    ro.iterate_upper_bound = &upper_bound;
  }
+
  std::string lower_bound_str;
  Slice lower_bound;
  if (thread->rand.OneIn(16)) {
@ -1563,7 +1565,8 @@ void StressTest::VerifyIterator(ThreadState* thread,
          fprintf(stderr, "iterator has value %s\n",
                  iter->key().ToString(true).c_str());
        } else {
-          fprintf(stderr, "iterator is not valid\n");
+          fprintf(stderr, "iterator is not valid with status: %s\n",
+                  iter->status().ToString().c_str());
        }
        *diverged = true;
      }
@ -2693,7 +2696,9 @@ void StressTest::Open(SharedState* shared, bool reopen) {

    // If this is for DB reopen, write error injection may have been enabled.
    // Disable it here in case there is no open fault injection.
-    fault_fs_guard->DisableWriteErrorInjection();
+    if (fault_fs_guard) {
+      fault_fs_guard->DisableWriteErrorInjection();
+    }
    if (!FLAGS_use_txn) {
      // Determine whether we need to inject file metadata write failures
      // during DB reopen. If it does, enable it.
@ -2752,8 +2757,7 @@ void StressTest::Open(SharedState* shared, bool reopen) {
          if (s.ok()) {
            db_ = blob_db;
          }
-        } else
-        {
+        } else {
          if (db_preload_finished_.load() && FLAGS_read_only) {
            s = DB::OpenForReadOnly(DBOptions(options_), FLAGS_db,
                                    cf_descriptors, &column_families_, &db_);
@ -3334,7 +3338,11 @@ void InitializeOptionsFromFlags(
  if (FLAGS_use_full_merge_v1) {
    options.merge_operator = MergeOperators::CreateDeprecatedPutOperator();
  } else {
-    options.merge_operator = MergeOperators::CreatePutOperator();
+    if (FLAGS_use_put_entity_one_in > 0) {
+      options.merge_operator = std::make_shared<DBStressWideMergeOperator>();
+    } else {
+      options.merge_operator = MergeOperators::CreatePutOperator();
+    }
  }

  if (FLAGS_enable_compaction_filter) {
--- a/db_stress_tool/db_stress_tool.cc
+++ b/db_stress_tool/db_stress_tool.cc
@ -88,6 +88,11 @@ int db_stress_tool(int argc, char** argv) {
    FaultInjectionTestFS* fs =
        new FaultInjectionTestFS(raw_env->GetFileSystem());
    fault_fs_guard.reset(fs);
+    if (FLAGS_write_fault_one_in) {
+      fault_fs_guard->SetFilesystemDirectWritable(false);
+    } else {
+      fault_fs_guard->SetFilesystemDirectWritable(true);
+    }
    fault_env_guard =
        std::make_shared<CompositeEnvWrapper>(raw_env, fault_fs_guard);
    raw_env = fault_env_guard.get();
@ -303,11 +308,11 @@ int db_stress_tool(int argc, char** argv) {
  }

  if (FLAGS_use_put_entity_one_in > 0 &&
-      (FLAGS_use_merge || FLAGS_use_full_merge_v1 || FLAGS_use_txn ||
-       FLAGS_test_multi_ops_txns || FLAGS_user_timestamp_size > 0)) {
+      (FLAGS_use_full_merge_v1 || FLAGS_use_txn || FLAGS_test_multi_ops_txns ||
+       FLAGS_user_timestamp_size > 0)) {
    fprintf(stderr,
-            "PutEntity is currently incompatible with Merge,"
-            " transactions, and user-defined timestamps\n");
+            "Wide columns are incompatible with V1 Merge, transactions, and "
+            "user-defined timestamps\n");
    exit(1);
  }

--- a/db_stress_tool/db_stress_wide_merge_operator.cc
+++ b/db_stress_tool/db_stress_wide_merge_operator.cc
@ -0,0 +1,51 @@
+//  Copyright (c) Meta Platforms, Inc. and affiliates.
+//  This source code is licensed under both the GPLv2 (found in the
+//  COPYING file in the root directory) and Apache 2.0 License
+//  (found in the LICENSE.Apache file in the root directory).
+
+#ifdef GFLAGS
+
+#include "db_stress_tool/db_stress_wide_merge_operator.h"
+
+#include "db_stress_tool/db_stress_common.h"
+
+namespace ROCKSDB_NAMESPACE {
+
+bool DBStressWideMergeOperator::FullMergeV3(
+    const MergeOperationInputV3& merge_in,
+    MergeOperationOutputV3* merge_out) const {
+  assert(!merge_in.operand_list.empty());
+  assert(merge_out);
+
+  const Slice& latest = merge_in.operand_list.back();
+
+  if (latest.size() < sizeof(uint32_t)) {
+    return false;
+  }
+
+  const uint32_t value_base = GetValueBase(latest);
+
+  if (FLAGS_use_put_entity_one_in == 0 ||
+      (value_base % FLAGS_use_put_entity_one_in) != 0) {
+    merge_out->new_value = latest;
+    return true;
+  }
+
+  const auto columns = GenerateWideColumns(value_base, latest);
+
+  merge_out->new_value = MergeOperationOutputV3::NewColumns();
+  auto& new_columns =
+      std::get<MergeOperationOutputV3::NewColumns>(merge_out->new_value);
+  new_columns.reserve(columns.size());
+
+  for (const auto& column : columns) {
+    new_columns.emplace_back(column.name().ToString(),
+                             column.value().ToString());
+  }
+
+  return true;
+}
+
+}  // namespace ROCKSDB_NAMESPACE
+
+#endif  // GFLAGS
--- a/db_stress_tool/db_stress_wide_merge_operator.h
+++ b/db_stress_tool/db_stress_wide_merge_operator.h
@ -0,0 +1,27 @@
+//  Copyright (c) Meta Platforms, Inc. and affiliates.
+//  This source code is licensed under both the GPLv2 (found in the
+//  COPYING file in the root directory) and Apache 2.0 License
+//  (found in the LICENSE.Apache file in the root directory).
+
+#pragma once
+
+#include "rocksdb/merge_operator.h"
+
+namespace ROCKSDB_NAMESPACE {
+
+// A test merge operator that implements the wide-column aware FullMergeV3
+// interface. Similarly to the simple "put" type merge operators, the merge
+// result is based on the last merge operand; however, the merge result can
+// potentially be a wide-column entity, depending on the value base encoded into
+// the merge operand and the value of the "use_put_entity_one_in" stress test
+// option. Following the same rule as for writes ensures that the queries
+// issued by the validation logic receive the expected results.
+class DBStressWideMergeOperator : public MergeOperator {
+ public:
+  bool FullMergeV3(const MergeOperationInputV3& merge_in,
+                   MergeOperationOutputV3* merge_out) const override;
+
+  const char* Name() const override { return "DBStressWideMergeOperator"; }
+};
+
+}  // namespace ROCKSDB_NAMESPACE
--- a/db_stress_tool/no_batched_ops_stress.cc
+++ b/db_stress_tool/no_batched_ops_stress.cc
@ -1279,7 +1279,11 @@ class NonBatchedOpsStressTest : public StressTest {

    Status s;

-    if (FLAGS_use_merge) {
+    if (FLAGS_use_put_entity_one_in > 0 &&
+        (value_base % FLAGS_use_put_entity_one_in) == 0) {
+      s = db_->PutEntity(write_opts, cfh, k,
+                         GenerateWideColumns(value_base, v));
+    } else if (FLAGS_use_merge) {
      if (!FLAGS_use_txn) {
        if (FLAGS_user_timestamp_size == 0) {
          s = db_->Merge(write_opts, cfh, k, v);
@ -1291,10 +1295,6 @@ class NonBatchedOpsStressTest : public StressTest {
          return txn.Merge(cfh, k, v);
        });
      }
-    } else if (FLAGS_use_put_entity_one_in > 0 &&
-               (value_base % FLAGS_use_put_entity_one_in) == 0) {
-      s = db_->PutEntity(write_opts, cfh, k,
-                         GenerateWideColumns(value_base, v));
    } else {
      if (!FLAGS_use_txn) {
        if (FLAGS_user_timestamp_size == 0) {
@ -1542,11 +1542,8 @@ class NonBatchedOpsStressTest : public StressTest {
      const Slice k(key_str);
      const Slice v(value, value_len);

-      const bool use_put_entity =
-          !FLAGS_use_merge && FLAGS_use_put_entity_one_in > 0 &&
-          (value_base % FLAGS_use_put_entity_one_in) == 0;
-
-      if (use_put_entity) {
+      if (FLAGS_use_put_entity_one_in > 0 &&
+          (value_base % FLAGS_use_put_entity_one_in) == 0) {
        WideColumns columns = GenerateWideColumns(value_base, v);
        s = sst_file_writer.PutEntity(k, columns);
      } else {
@ -2036,9 +2033,7 @@ class NonBatchedOpsStressTest : public StressTest {
      const Slice slice(value_from_db);
      const uint32_t value_base_from_db = GetValueBase(slice);
      if (ExpectedValueHelper::MustHaveNotExisted(expected_value,
-                                                  expected_value) ||
-          !ExpectedValueHelper::InExpectedValueBaseRange(
-              value_base_from_db, expected_value, expected_value)) {
+                                                  expected_value)) {
        VerificationAbort(shared, msg_prefix + ": Unexpected value found", cf,
                          key, value_from_db, "");
        return false;
@ -2047,6 +2042,14 @@ class NonBatchedOpsStressTest : public StressTest {
      size_t expected_value_data_size =
          GenerateValue(expected_value.GetValueBase(), expected_value_data,
                        sizeof(expected_value_data));
+      if (!ExpectedValueHelper::InExpectedValueBaseRange(
+              value_base_from_db, expected_value, expected_value)) {
+        VerificationAbort(shared, msg_prefix + ": Unexpected value found", cf,
+                          key, value_from_db,
+                          Slice(expected_value_data, expected_value_data_size));
+        return false;
+      }
+      // TODO: are the length/memcmp() checks repetitive?
      if (value_from_db.length() != expected_value_data_size) {
        VerificationAbort(shared,
                          msg_prefix + ": Length of value read is not equal",
--- a/file/file_prefetch_buffer.cc
+++ b/file/file_prefetch_buffer.cc
@ -364,8 +364,11 @@ Status FilePrefetchBuffer::HandleOverlappingData(
    size_t second_size = bufs_[second].async_read_in_progress_
                             ? bufs_[second].async_req_len_
                             : bufs_[second].buffer_.CurrentSize();
-    if (tmp_offset + tmp_length <= bufs_[second].offset_ + second_size) {
-      uint64_t rounddown_start = bufs_[second].offset_ + second_size;
+    uint64_t rounddown_start = bufs_[second].offset_ + second_size;
+    // Second buffer might be out of bound if first buffer already prefetched
+    // that data.
+    if (tmp_offset + tmp_length <= bufs_[second].offset_ + second_size &&
+        !IsOffsetOutOfBound(rounddown_start)) {
      uint64_t roundup_end =
          Roundup(rounddown_start + readahead_size, alignment);
      uint64_t roundup_len = roundup_end - rounddown_start;
@ -562,20 +565,24 @@ Status FilePrefetchBuffer::PrefetchAsyncInternal(const IOOptions& opts,
      roundup_end2 = Roundup(rounddown_start2 + prefetch_size, alignment);
    }

-    uint64_t roundup_len2 = roundup_end2 - rounddown_start2;
-    uint64_t chunk_len2 = 0;
-    CalculateOffsetAndLen(alignment, rounddown_start2, roundup_len2, second,
-                          false /*refit_tail*/, chunk_len2);
-    assert(chunk_len2 == 0);
-    // Update the buffer offset.
-    bufs_[second].offset_ = rounddown_start2;
-    assert(roundup_len2 >= chunk_len2);
-    uint64_t read_len2 = static_cast<size_t>(roundup_len2 - chunk_len2);
-    s = ReadAsync(opts, reader, read_len2, rounddown_start2, second);
-    if (!s.ok()) {
-      DestroyAndClearIOHandle(second);
-      bufs_[second].buffer_.Clear();
-      return s;
+    // Second buffer might be out of bound if first buffer already prefetched
+    // that data.
+    if (!IsOffsetOutOfBound(rounddown_start2)) {
+      uint64_t roundup_len2 = roundup_end2 - rounddown_start2;
+      uint64_t chunk_len2 = 0;
+      CalculateOffsetAndLen(alignment, rounddown_start2, roundup_len2, second,
+                            false /*refit_tail*/, chunk_len2);
+      assert(chunk_len2 == 0);
+      // Update the buffer offset.
+      bufs_[second].offset_ = rounddown_start2;
+      assert(roundup_len2 >= chunk_len2);
+      uint64_t read_len2 = static_cast<size_t>(roundup_len2 - chunk_len2);
+      s = ReadAsync(opts, reader, read_len2, rounddown_start2, second);
+      if (!s.ok()) {
+        DestroyAndClearIOHandle(second);
+        bufs_[second].buffer_.Clear();
+        return s;
+      }
    }
  }

@ -653,8 +660,8 @@ bool FilePrefetchBuffer::TryReadFromCacheUntracked(
            return false;
          }
        }
-        UpdateReadAheadSizeForUpperBound(offset, n);
-        s = Prefetch(opts, reader, offset, n + readahead_size_);
+        size_t current_readahead_size = ReadAheadSizeTuning(offset, n);
+        s = Prefetch(opts, reader, offset, n + current_readahead_size);
      }
      if (!s.ok()) {
        if (status) {
@ -925,17 +932,22 @@ Status FilePrefetchBuffer::PrefetchAsync(const IOOptions& opts,
      rounddown_start2 = roundup_end1;
    }

-    roundup_end2 = Roundup(rounddown_start2 + prefetch_size, alignment);
-    uint64_t roundup_len2 = roundup_end2 - rounddown_start2;
+    // Second buffer might be out of bound if first buffer already prefetched
+    // that data.
+    if (!IsOffsetOutOfBound(rounddown_start2)) {
+      roundup_end2 = Roundup(rounddown_start2 + prefetch_size, alignment);
+      uint64_t roundup_len2 = roundup_end2 - rounddown_start2;

-    assert(roundup_len2 >= alignment);
-    CalculateOffsetAndLen(alignment, rounddown_start2, roundup_len2, second,
-                          false, chunk_len2);
-    assert(chunk_len2 == 0);
-    assert(roundup_len2 >= chunk_len2);
-    read_len2 = static_cast<size_t>(roundup_len2 - chunk_len2);
-    // Update the buffer offset.
-    bufs_[second].offset_ = rounddown_start2;
+      assert(roundup_len2 >= alignment);
+
+      CalculateOffsetAndLen(alignment, rounddown_start2, roundup_len2, second,
+                            false, chunk_len2);
+      assert(chunk_len2 == 0);
+      assert(roundup_len2 >= chunk_len2);
+      read_len2 = static_cast<size_t>(roundup_len2 - chunk_len2);
+      // Update the buffer offset.
+      bufs_[second].offset_ = rounddown_start2;
+    }
  }

  if (read_len1) {
--- a/file/file_prefetch_buffer.h
+++ b/file/file_prefetch_buffer.h
@ -90,6 +90,7 @@ class FilePrefetchBuffer {
      uint64_t num_file_reads_for_auto_readahead = 0,
      uint64_t upper_bound_offset = 0, FileSystem* fs = nullptr,
      SystemClock* clock = nullptr, Statistics* stats = nullptr,
+      const std::function<void(uint64_t, size_t, size_t&)>& cb = nullptr,
      FilePrefetchBufferUsage usage = FilePrefetchBufferUsage::kUnknown)
      : curr_(0),
        readahead_size_(readahead_size),
@ -108,7 +109,8 @@ class FilePrefetchBuffer {
        clock_(clock),
        stats_(stats),
        usage_(usage),
-        upper_bound_offset_(upper_bound_offset) {
+        upper_bound_offset_(upper_bound_offset),
+        readaheadsize_cb_(cb) {
    assert((num_file_reads_ >= num_file_reads_for_auto_readahead_ + 1) ||
           (num_file_reads_ == 0));
    // If ReadOptions.async_io is enabled, data is asynchronously filled in
@ -441,6 +443,28 @@ class FilePrefetchBuffer {
    }
  }

+  inline bool IsOffsetOutOfBound(uint64_t offset) {
+    if (upper_bound_offset_ > 0) {
+      return (offset >= upper_bound_offset_);
+    }
+    return false;
+  }
+
+  // Performs tuning to calculate readahead_size.
+  size_t ReadAheadSizeTuning(uint64_t offset, size_t n) {
+    UpdateReadAheadSizeForUpperBound(offset, n);
+
+    if (readaheadsize_cb_ != nullptr && readahead_size_ > 0) {
+      size_t updated_readahead_size = 0;
+      readaheadsize_cb_(offset, readahead_size_, updated_readahead_size);
+      if (readahead_size_ != updated_readahead_size) {
+        RecordTick(stats_, READAHEAD_TRIMMED);
+      }
+      return updated_readahead_size;
+    }
+    return readahead_size_;
+  }
+
  std::vector<BufferInfo> bufs_;
  // curr_ represents the index for bufs_ indicating which buffer is being
  // consumed currently.
@ -487,5 +511,6 @@ class FilePrefetchBuffer {
  // ReadOptions.auto_readahead_size are set to trim readahead_size upto
  // upper_bound_offset_ during prefetching.
  uint64_t upper_bound_offset_ = 0;
+  std::function<void(uint64_t, size_t, size_t&)> readaheadsize_cb_;
 };
 }  // namespace ROCKSDB_NAMESPACE
--- a/file/prefetch_test.cc
+++ b/file/prefetch_test.cc
@ -654,9 +654,6 @@ TEST_P(PrefetchTest, ConfigureInternalAutoReadaheadSize) {

  SyncPoint::GetInstance()->SetCallBack("FilePrefetchBuffer::Prefetch:Start",
                                        [&](void*) { buff_prefetch_count++; });
-
-  SyncPoint::GetInstance()->EnableProcessing();
-
  SyncPoint::GetInstance()->EnableProcessing();

  Status s = TryReopen(options);
@ -1233,6 +1230,271 @@ TEST_P(PrefetchTest, PrefetchWhenReseekwithCache) {
  Close();
 }

+TEST_P(PrefetchTest, PrefetchWithBlockLookupAutoTuneTest) {
+  if (mem_env_ || encrypted_env_) {
+    ROCKSDB_GTEST_SKIP("Test requires non-mem or non-encrypted environment");
+    return;
+  }
+
+  std::shared_ptr<MockFS> fs =
+      std::make_shared<MockFS>(FileSystem::Default(), false);
+
+  std::unique_ptr<Env> env(new CompositeEnvWrapper(env_, fs));
+  Options options;
+  SetGenericOptions(env.get(), /*use_direct_io=*/false, options);
+  options.statistics = CreateDBStatistics();
+  BlockBasedTableOptions table_options;
+  SetBlockBasedTableOptions(table_options);
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  Status s = TryReopen(options);
+  ASSERT_OK(s);
+
+  Random rnd(309);
+  WriteBatch batch;
+
+  for (int i = 0; i < 26; i++) {
+    std::string key = "my_key_";
+
+    for (int j = 0; j < 10; j++) {
+      key += char('a' + i);
+      ASSERT_OK(batch.Put(key, rnd.RandomString(1000)));
+    }
+  }
+  ASSERT_OK(db_->Write(WriteOptions(), &batch));
+
+  std::string start_key = "my_key_a";
+
+  std::string end_key = "my_key_";
+  for (int j = 0; j < 10; j++) {
+    end_key += char('a' + 25);
+  }
+
+  Slice least(start_key.data(), start_key.size());
+  Slice greatest(end_key.data(), end_key.size());
+
+  ASSERT_OK(db_->CompactRange(CompactRangeOptions(), &least, &greatest));
+
+  // Try with different num_file_reads_for_auto_readahead from 0 to 3.
+  for (size_t i = 0; i < 3; i++) {
+    std::shared_ptr<Cache> cache = NewLRUCache(1024 * 1024, 2);
+    table_options.block_cache = cache;
+    table_options.no_block_cache = false;
+    table_options.num_file_reads_for_auto_readahead = i;
+    options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+    s = TryReopen(options);
+    ASSERT_OK(s);
+
+    // Warm up the cache.
+    {
+      auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ReadOptions()));
+
+      iter->Seek("my_key_bbb");
+      ASSERT_TRUE(iter->Valid());
+
+      iter->Seek("my_key_ccccccccc");
+      ASSERT_TRUE(iter->Valid());
+
+      iter->Seek("my_key_ddd");
+      ASSERT_TRUE(iter->Valid());
+
+      iter->Seek("my_key_ddddddd");
+      ASSERT_TRUE(iter->Valid());
+
+      iter->Seek("my_key_e");
+      ASSERT_TRUE(iter->Valid());
+
+      iter->Seek("my_key_eeeee");
+      ASSERT_TRUE(iter->Valid());
+
+      iter->Seek("my_key_eeeeeeeee");
+      ASSERT_TRUE(iter->Valid());
+    }
+
+    ReadOptions ropts;
+    ropts.auto_readahead_size = true;
+    ReadOptions cmp_ro;
+    cmp_ro.auto_readahead_size = false;
+
+    if (std::get<0>(GetParam())) {
+      ropts.readahead_size = cmp_ro.readahead_size = 32768;
+    }
+
+    // With and without tuning readahead_size.
+    {
+      ASSERT_OK(options.statistics->Reset());
+      // Seek.
+      {
+        Slice ub = Slice("my_key_uuu");
+        Slice* ub_ptr = &ub;
+        cmp_ro.iterate_upper_bound = ub_ptr;
+        ropts.iterate_upper_bound = ub_ptr;
+
+        auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ropts));
+        auto cmp_iter = std::unique_ptr<Iterator>(db_->NewIterator(cmp_ro));
+
+        Slice seek_key = Slice("my_key_aaa");
+        iter->Seek(seek_key);
+        cmp_iter->Seek(seek_key);
+
+        while (iter->Valid() && cmp_iter->Valid()) {
+          if (iter->key() != cmp_iter->key()) {
+            // Error
+            ASSERT_TRUE(false);
+          }
+          iter->Next();
+          cmp_iter->Next();
+        }
+
+        uint64_t readahead_trimmed =
+            options.statistics->getAndResetTickerCount(READAHEAD_TRIMMED);
+        ASSERT_GT(readahead_trimmed, 0);
+
+        ASSERT_OK(cmp_iter->status());
+        ASSERT_OK(iter->status());
+      }
+
+      // Reseek with new upper_bound_iterator.
+      {
+        Slice ub = Slice("my_key_y");
+        ropts.iterate_upper_bound = &ub;
+        cmp_ro.iterate_upper_bound = &ub;
+
+        auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ropts));
+        auto cmp_iter = std::unique_ptr<Iterator>(db_->NewIterator(cmp_ro));
+
+        Slice reseek_key = Slice("my_key_v");
+        iter->Seek(reseek_key);
+        cmp_iter->Seek(reseek_key);
+
+        while (iter->Valid() && cmp_iter->Valid()) {
+          if (iter->key() != cmp_iter->key()) {
+            // Error
+            ASSERT_TRUE(false);
+          }
+          iter->Next();
+          cmp_iter->Next();
+        }
+
+        uint64_t readahead_trimmed =
+            options.statistics->getAndResetTickerCount(READAHEAD_TRIMMED);
+        ASSERT_GT(readahead_trimmed, 0);
+
+        ASSERT_OK(cmp_iter->status());
+        ASSERT_OK(iter->status());
+      }
+    }
+    Close();
+  }
+}
+
+TEST_F(PrefetchTest, PrefetchWithBlockLookupAutoTuneWithPrev) {
+  if (mem_env_ || encrypted_env_) {
+    ROCKSDB_GTEST_SKIP("Test requires non-mem or non-encrypted environment");
+    return;
+  }
+
+  // First param is if the mockFS support_prefetch or not
+  std::shared_ptr<MockFS> fs =
+      std::make_shared<MockFS>(FileSystem::Default(), false);
+
+  std::unique_ptr<Env> env(new CompositeEnvWrapper(env_, fs));
+  Options options;
+  SetGenericOptions(env.get(), /*use_direct_io=*/false, options);
+  options.statistics = CreateDBStatistics();
+  BlockBasedTableOptions table_options;
+  SetBlockBasedTableOptions(table_options);
+  std::shared_ptr<Cache> cache = NewLRUCache(1024 * 1024, 2);
+  table_options.block_cache = cache;
+  table_options.no_block_cache = false;
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+
+  Status s = TryReopen(options);
+  ASSERT_OK(s);
+
+  Random rnd(309);
+  WriteBatch batch;
+
+  for (int i = 0; i < 26; i++) {
+    std::string key = "my_key_";
+
+    for (int j = 0; j < 10; j++) {
+      key += char('a' + i);
+      ASSERT_OK(batch.Put(key, rnd.RandomString(1000)));
+    }
+  }
+  ASSERT_OK(db_->Write(WriteOptions(), &batch));
+
+  std::string start_key = "my_key_a";
+
+  std::string end_key = "my_key_";
+  for (int j = 0; j < 10; j++) {
+    end_key += char('a' + 25);
+  }
+
+  Slice least(start_key.data(), start_key.size());
+  Slice greatest(end_key.data(), end_key.size());
+
+  ASSERT_OK(db_->CompactRange(CompactRangeOptions(), &least, &greatest));
+
+  ReadOptions ropts;
+  ropts.auto_readahead_size = true;
+
+  {
+    // Seek.
+    Slice ub = Slice("my_key_uuu");
+    Slice* ub_ptr = &ub;
+    ropts.iterate_upper_bound = ub_ptr;
+    ropts.auto_readahead_size = true;
+
+    ReadOptions cmp_readopts = ropts;
+    cmp_readopts.auto_readahead_size = false;
+
+    auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ropts));
+    auto cmp_iter = std::unique_ptr<Iterator>(db_->NewIterator(cmp_readopts));
+
+    Slice seek_key = Slice("my_key_bbb");
+    {
+      cmp_iter->Seek(seek_key);
+      ASSERT_TRUE(cmp_iter->Valid());
+      ASSERT_OK(cmp_iter->status());
+
+      iter->Seek(seek_key);
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_OK(iter->status());
+
+      ASSERT_EQ(iter->key(), cmp_iter->key());
+    }
+
+    // Prev op should pass with auto tuning of readahead_size.
+    {
+      cmp_iter->Prev();
+      ASSERT_TRUE(cmp_iter->Valid());
+      ASSERT_OK(cmp_iter->status());
+
+      iter->Prev();
+      ASSERT_OK(iter->status());
+      ASSERT_TRUE(iter->Valid());
+
+      ASSERT_EQ(iter->key(), cmp_iter->key());
+    }
+
+    // Reseek would follow as usual.
+    {
+      cmp_iter->Seek(seek_key);
+      ASSERT_TRUE(cmp_iter->Valid());
+      ASSERT_OK(cmp_iter->status());
+
+      iter->Seek(seek_key);
+      ASSERT_OK(iter->status());
+      ASSERT_TRUE(iter->Valid());
+      ASSERT_EQ(iter->key(), cmp_iter->key());
+    }
+  }
+  Close();
+}
+
 // This test verifies the functionality of ReadOptions.adaptive_readahead.
 TEST_P(PrefetchTest, DBIterLevelReadAhead) {
  const int kNumKeys = 1000;
@ -3117,7 +3379,64 @@ TEST_F(FilePrefetchBufferTest, NoSyncWithAsyncIO) {
  // Length should be 4000.
  ASSERT_EQ(async_result.size(), 4000);
  // Data correctness.
-  Slice result(content.c_str() + 3000, 4000);
+  Slice result(&content[3000], 4000);
+  ASSERT_EQ(result.size(), 4000);
+  ASSERT_EQ(result, async_result);
+}
+
+// This test checks if during seek in async_io, if first buffer already
+// prefetched the data till upper_bound offset, second buffer shouldn't go for
+// prefetching.
+TEST_F(FilePrefetchBufferTest, IterateUpperBoundTest1) {
+  std::string fname = "iterate-upperbound-test1";
+  Random rand(0);
+  std::string content = rand.RandomString(32768);
+  Write(fname, content);
+
+  FileOptions opts;
+  std::unique_ptr<RandomAccessFileReader> r;
+  Read(fname, opts, &r);
+
+  FilePrefetchBuffer fpb(
+      /*readahead_size=*/8192, /*max_readahead_size=*/16384, /*enable=*/true,
+      /*track_min_offset=*/false, /*implicit_auto_readahead=*/false,
+      /*num_file_reads=*/0, /*num_file_reads_for_auto_readahead=*/0,
+      /*upper_bound_offset=*/8000, fs());
+
+  int read_async_called = 0;
+  SyncPoint::GetInstance()->SetCallBack(
+      "FilePrefetchBuffer::ReadAsync",
+      [&](void* /*arg*/) { read_async_called++; });
+  SyncPoint::GetInstance()->EnableProcessing();
+
+  Slice async_result;
+  // Simulate a seek of 4000 bytes at offset 3000. Due to the readahead
+  // settings, it will do 1 read of 4000+1000 (till 8000 - upper bound).
+  Status s = fpb.PrefetchAsync(IOOptions(), r.get(), 3000, 4000, &async_result);
+
+  // Platforms that don't have IO uring may not support async IO
+  if (s.IsNotSupported()) {
+    return;
+  }
+
+  ASSERT_TRUE(s.IsTryAgain());
+  IOOptions io_opts;
+  io_opts.rate_limiter_priority = Env::IOPriority::IO_LOW;
+  ASSERT_TRUE(fpb.TryReadFromCacheAsync(io_opts, r.get(), /*offset=*/3000,
+                                        /*length=*/4000, &async_result, &s));
+  // No sync call should be made.
+  HistogramData sst_read_micros;
+  stats()->histogramData(SST_READ_MICROS, &sst_read_micros);
+  ASSERT_EQ(sst_read_micros.count, 0);
+
+  // Number of async calls should be 1.
+  // No Prefetching should happen in second buffer as first buffer has already
+  // prefetched till offset.
+  ASSERT_EQ(read_async_called, 1);
+  // Length should be 4000.
+  ASSERT_EQ(async_result.size(), 4000);
+  // Data correctness.
+  Slice result(&content[3000], 4000);
  ASSERT_EQ(result.size(), 4000);
  ASSERT_EQ(result, async_result);
 }
--- a/include/rocksdb/advanced_cache.h
+++ b/include/rocksdb/advanced_cache.h
@ -13,7 +13,9 @@
 #include <string>

 #include "rocksdb/cache.h"
+#include "rocksdb/compression_type.h"
 #include "rocksdb/memory_allocator.h"
+#include "rocksdb/options.h"
 #include "rocksdb/slice.h"
 #include "rocksdb/status.h"

@ -109,13 +111,18 @@ class Cache {
  // pointer into static data).
  using DeleterFn = void (*)(ObjectPtr obj, MemoryAllocator* allocator);

-  // The CreateCallback is takes in a buffer from the NVM cache and constructs
-  // an object using it. The callback doesn't have ownership of the buffer and
+  // The CreateCallback is takes in a buffer from the secondary  cache and
+  // constructs an object using it. The buffer could be compressed or
+  // uncompressed, as indicated by the type argument. If compressed,
+  // the callback is responsible for uncompressing it using information
+  // from the context, such as compression dictionary.
+  // The callback doesn't have ownership of the buffer and
  // should copy the contents into its own buffer. The CreateContext* is
  // provided by Lookup and may be used to follow DB- or CF-specific settings.
  // In case of some error, non-OK is returned and the caller should ignore
  // any result in out_obj. (The implementation must clean up after itself.)
-  using CreateCallback = Status (*)(const Slice& data, CreateContext* context,
+  using CreateCallback = Status (*)(const Slice& data, CompressionType type,
+                                    CacheTier source, CreateContext* context,
                                    MemoryAllocator* allocator,
                                    ObjectPtr* out_obj, size_t* out_charge);

@ -242,12 +249,19 @@ class Cache {
  // the item is only inserted into the primary cache. It may
  // defer the insertion to the secondary cache as it sees fit.
  //
+  // Along with the object pointer, the caller may pass a Slice pointing to
+  // the compressed serialized data of the object. If compressed is
+  // non-empty, then the caller must pass the type indicating the compression
+  // algorithm used. The cache may, optionally, also insert the compressed
+  // block into one or more cache tiers.
+  //
  // When the inserted entry is no longer needed, it will be destroyed using
  // helper->del_cb (if non-nullptr).
-  virtual Status Insert(const Slice& key, ObjectPtr obj,
-                        const CacheItemHelper* helper, size_t charge,
-                        Handle** handle = nullptr,
-                        Priority priority = Priority::LOW) = 0;
+  virtual Status Insert(
+      const Slice& key, ObjectPtr obj, const CacheItemHelper* helper,
+      size_t charge, Handle** handle = nullptr,
+      Priority priority = Priority::LOW, const Slice& compressed = Slice(),
+      CompressionType type = CompressionType::kNoCompression) = 0;

  // Similar to Insert, but used for creating cache entries that cannot
  // be found with Lookup, such as for memory charging purposes. The
@ -536,11 +550,14 @@ class CacheWrapper : public Cache {
  // Only function that derived class must provide
  // const char* Name() const override { ... }

-  Status Insert(const Slice& key, ObjectPtr value,
-                const CacheItemHelper* helper, size_t charge,
-                Handle** handle = nullptr,
-                Priority priority = Priority::LOW) override {
-    return target_->Insert(key, value, helper, charge, handle, priority);
+  Status Insert(
+      const Slice& key, ObjectPtr value, const CacheItemHelper* helper,
+      size_t charge, Handle** handle = nullptr,
+      Priority priority = Priority::LOW,
+      const Slice& compressed_value = Slice(),
+      CompressionType type = CompressionType::kNoCompression) override {
+    return target_->Insert(key, value, helper, charge, handle, priority,
+                           compressed_value, type);
  }

  Handle* CreateStandalone(const Slice& key, ObjectPtr obj,
--- a/include/rocksdb/advanced_options.h
+++ b/include/rocksdb/advanced_options.h
@ -275,7 +275,8 @@ struct CompactionOptionsFIFO {
 // In the future, we may add more caching layers.
 enum class CacheTier : uint8_t {
  kVolatileTier = 0,
-  kNonVolatileBlockTier = 0x01,
+  kVolatileCompressedTier = 0x01,
+  kNonVolatileBlockTier = 0x02,
 };

 enum UpdateStatus {     // Return status For inplace update callback
--- a/include/rocksdb/cache.h
+++ b/include/rocksdb/cache.h
@ -12,6 +12,7 @@
 #pragma once

 #include <cstdint>
+#include <limits>
 #include <memory>
 #include <string>

@ -481,22 +482,53 @@ enum TieredAdmissionPolicy {
  // Same as kAdmPolicyPlaceholder, but also if an entry in the primary cache
  // was a hit, then force insert it into the compressed secondary cache
  kAdmPolicyAllowCacheHits,
+  // An admission policy for three cache tiers - primary uncompressed,
+  // compressed secondary, and a compressed local flash (non-volatile) cache.
+  // Each tier is managed as an independent queue.
+  kAdmPolicyThreeQueue,
  kAdmPolicyMax,
 };

+// EXPERIMENTAL
+// The following feature is experimental, and the API is subject to change
+//
 // A 2-tier cache with a primary block cache, and a compressed secondary
 // cache. The returned cache instance will internally allocate a primary
 // uncompressed cache of the specified type, and a compressed secondary
 // cache. Any cache memory reservations, such as WriteBufferManager
 // allocations costed to the block cache, will be distributed
 // proportionally across both the primary and secondary.
-struct TieredVolatileCacheOptions {
-  ShardedCacheOptions* cache_opts;
-  PrimaryCacheType cache_type;
-  TieredAdmissionPolicy adm_policy;
+struct TieredCacheOptions {
+  ShardedCacheOptions* cache_opts = nullptr;
+  PrimaryCacheType cache_type = PrimaryCacheType::kCacheTypeLRU;
+  TieredAdmissionPolicy adm_policy = TieredAdmissionPolicy::kAdmPolicyAuto;
  CompressedSecondaryCacheOptions comp_cache_opts;
+  // Any capacity specified in LRUCacheOptions, HyperClockCacheOptions and
+  // CompressedSecondaryCacheOptions is ignored
+  // The total_capacity specified here is taken as the memory budget and
+  // divided between the primary block cache and compressed secondary cache
+  size_t total_capacity = 0;
+  double compressed_secondary_ratio = 0.0;
+  // An optional secondary cache that will serve as the persistent cache
+  // tier. If present, compressed blocks will be written to this
+  // secondary cache.
+  std::shared_ptr<SecondaryCache> nvm_sec_cache;
 };

-extern std::shared_ptr<Cache> NewTieredVolatileCache(
-    TieredVolatileCacheOptions& cache_opts);
+extern std::shared_ptr<Cache> NewTieredCache(
+    const TieredCacheOptions& cache_opts);
+
+// EXPERIMENTAL
+// Dynamically update some of the parameters of a TieredCache. The input
+// cache shared_ptr should have been allocated using NewTieredVolatileCache.
+// At the moment, there are a couple of limitations -
+// 1. The total_capacity should be > the WriteBufferManager max size, if
+//    using the block cache charging feature
+// 2. Once the compressed secondary cache is disabled by setting the
+//    compressed_secondary_ratio to 0.0, it cannot be dynamically re-enabled
+//    again
+extern Status UpdateTieredCache(
+    const std::shared_ptr<Cache>& cache, int64_t total_capacity = -1,
+    double compressed_secondary_ratio = std::numeric_limits<double>::max(),
+    TieredAdmissionPolicy adm_policy = TieredAdmissionPolicy::kAdmPolicyMax);
 }  // namespace ROCKSDB_NAMESPACE
--- a/include/rocksdb/compaction_filter.h
+++ b/include/rocksdb/compaction_filter.h
@ -16,6 +16,7 @@

 #include "rocksdb/customizable.h"
 #include "rocksdb/rocksdb_namespace.h"
+#include "rocksdb/table_properties.h"
 #include "rocksdb/types.h"
 #include "rocksdb/wide_columns.h"

@ -160,10 +161,20 @@ class CompactionFilter : public Customizable {
    // Whether this table file is created as part of a compaction requested by
    // the client.
    bool is_manual_compaction;
+    // The lowest level among all the input files (if any) used in table
+    // creation
+    int input_start_level = kUnknownStartLevel;
    // The column family that will contain the created table file.
    uint32_t column_family_id;
    // Reason this table file is being created.
    TableFileCreationReason reason;
+    // Map from all the input files (if any) used in table creation to their
+    // table properties. When there are such input files but RocksDB fail to
+    // load their table properties, `input_table_properties` will be an empty
+    // map.
+    TablePropertiesCollection input_table_properties;
+
+    static const int kUnknownStartLevel = -1;
  };

  virtual ~CompactionFilter() {}
--- a/Show More
+++ b/Show More