Support pro-actively erasing obsolete block cache entries (#12694)

Summary: Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries. Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive. This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on. The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF. Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths. Notable auxiliary code details: * Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.) * Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item) Marked EXPERIMENTAL until more thorough validation is complete. Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694 Test Plan: * Unit tests added, including refactoring an existing test to make better use of parameterized tests. * Added to crash test. * Performance, sample command: ``` for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done ``` Averaging 10 runs each case, block cache data block hit rates ``` lru_cache UA=0 -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0 UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0 fixed_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9 UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2 auto_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5 UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8 ``` Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings). Reviewed By: ajkr Differential Revision: D57932442 Pulled By: pdillinger fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 08:57:11 -07:00 · 2024-06-07 08:57:11 -07:00 · b34cef57b7
parent 44aceb88d0
commit b34cef57b7
31 changed files with 628 additions and 144 deletions
--- a/cache/cache_test.cc
+++ b/cache/cache_test.cc
@ -18,6 +18,7 @@
 #include "cache/lru_cache.h"
 #include "cache/typed_cache.h"
 #include "port/stack_trace.h"
+#include "table/block_based/block_cache.h"
 #include "test_util/secondary_cache_test_util.h"
 #include "test_util/testharness.h"
 #include "util/coding.h"
@ -1017,6 +1018,63 @@ INSTANTIATE_TEST_CASE_P(CacheTestInstance, CacheTest,
 INSTANTIATE_TEST_CASE_P(CacheTestInstance, LRUCacheTest,
                        testing::Values(secondary_cache_test_util::kLRU));

+TEST(MiscBlockCacheTest, UncacheAggressivenessAdvisor) {
+  // Aggressiveness to a sequence of Report() calls (as string of 0s and 1s)
+  // exactly until the first ShouldContinue() == false.
+  const std::vector<std::pair<uint32_t, Slice>> expectedTraces{
+      // Aggressiveness 1 aborts on first unsuccessful erasure.
+      {1, "0"},
+      {1, "11111111111111111111110"},
+      // For sufficient evidence, aggressiveness 2 requires a minimum of two
+      // unsuccessful erasures.
+      {2, "00"},
+      {2, "0110"},
+      {2, "1100"},
+      {2, "011111111111111111111111111111111111111111111111111111111111111100"},
+      {2, "0111111111111111111111111111111111110"},
+      // For sufficient evidence, aggressiveness 3 and higher require a minimum
+      // of three unsuccessful erasures.
+      {3, "000"},
+      {3, "01010"},
+      {3, "111000"},
+      {3, "00111111111111111111111111111111111100"},
+      {3, "00111111111111111111110"},
+
+      {4, "000"},
+      {4, "01010"},
+      {4, "111000"},
+      {4, "001111111111111111111100"},
+      {4, "0011111111111110"},
+
+      {6, "000"},
+      {6, "01010"},
+      {6, "111000"},
+      {6, "00111111111111100"},
+      {6, "0011111110"},
+
+      // 69 -> 50% threshold, now up to minimum of 4
+      {69, "0000"},
+      {69, "010000"},
+      {69, "01010000"},
+      {69, "101010100010101000"},
+
+      // 230 -> 10% threshold, appropriately higher minimum
+      {230, "000000000000"},
+      {230, "0000000000010000000000"},
+      {230, "00000000000100000000010000000000"}};
+  for (const auto& [aggressiveness, t] : expectedTraces) {
+    SCOPED_TRACE("aggressiveness=" + std::to_string(aggressiveness) + " with " +
+                 t.ToString());
+    UncacheAggressivenessAdvisor uaa(aggressiveness);
+    for (size_t i = 0; i < t.size(); ++i) {
+      SCOPED_TRACE("i=" + std::to_string(i));
+      ASSERT_TRUE(uaa.ShouldContinue());
+      uaa.Report(t[i] & 1);
+    }
+    ASSERT_FALSE(uaa.ShouldContinue());
+  }
+}
+
 }  // namespace ROCKSDB_NAMESPACE

 int main(int argc, char** argv) {
--- a/db/db_block_cache_test.cc
+++ b/db/db_block_cache_test.cc
@ -26,6 +26,7 @@
 #include "rocksdb/table_properties.h"
 #include "table/block_based/block_based_table_reader.h"
 #include "table/unique_id_impl.h"
+#include "test_util/secondary_cache_test_util.h"
 #include "util/compression.h"
 #include "util/defer.h"
 #include "util/hash.h"
@ -740,118 +741,6 @@ class LookupLiarCache : public CacheWrapper {

 }  // anonymous namespace

-TEST_F(DBBlockCacheTest, AddRedundantStats) {
-  const size_t capacity = size_t{1} << 25;
-  const int num_shard_bits = 0;  // 1 shard
-  int iterations_tested = 0;
-  for (const std::shared_ptr<Cache>& base_cache :
-       {NewLRUCache(capacity, num_shard_bits),
-        // FixedHyperClockCache
-        HyperClockCacheOptions(
-            capacity,
-            BlockBasedTableOptions().block_size /*estimated_value_size*/,
-            num_shard_bits)
-            .MakeSharedCache(),
-        // AutoHyperClockCache
-        HyperClockCacheOptions(capacity, 0 /*estimated_value_size*/,
-                               num_shard_bits)
-            .MakeSharedCache()}) {
-    if (!base_cache) {
-      // Skip clock cache when not supported
-      continue;
-    }
-    ++iterations_tested;
-    Options options = CurrentOptions();
-    options.create_if_missing = true;
-    options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
-
-    std::shared_ptr<LookupLiarCache> cache =
-        std::make_shared<LookupLiarCache>(base_cache);
-
-    BlockBasedTableOptions table_options;
-    table_options.cache_index_and_filter_blocks = true;
-    table_options.block_cache = cache;
-    table_options.filter_policy.reset(NewBloomFilterPolicy(50));
-    options.table_factory.reset(NewBlockBasedTableFactory(table_options));
-    DestroyAndReopen(options);
-
-    // Create a new table.
-    ASSERT_OK(Put("foo", "value"));
-    ASSERT_OK(Put("bar", "value"));
-    ASSERT_OK(Flush());
-    ASSERT_EQ(1, NumTableFilesAtLevel(0));
-
-    // Normal access filter+index+data.
-    ASSERT_EQ("value", Get("foo"));
-
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
-    // --------
-    ASSERT_EQ(3, TestGetTickerCount(options, BLOCK_CACHE_ADD));
-
-    ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
-    ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
-    ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
-    // --------
-    ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
-
-    // Againt access filter+index+data, but force redundant load+insert on index
-    cache->SetNthLookupNotFound(2);
-    ASSERT_EQ("value", Get("bar"));
-
-    ASSERT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
-    // --------
-    ASSERT_EQ(4, TestGetTickerCount(options, BLOCK_CACHE_ADD));
-
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
-    ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
-    ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
-    // --------
-    ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
-
-    // Access just filter (with high probability), and force redundant
-    // load+insert
-    cache->SetNthLookupNotFound(1);
-    ASSERT_EQ("NOT_FOUND", Get("this key was not added"));
-
-    EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
-    EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
-    EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
-    // --------
-    EXPECT_EQ(5, TestGetTickerCount(options, BLOCK_CACHE_ADD));
-
-    EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
-    EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
-    EXPECT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
-    // --------
-    EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
-
-    // Access just data, forcing redundant load+insert
-    ReadOptions read_options;
-    std::unique_ptr<Iterator> iter{db_->NewIterator(read_options)};
-    cache->SetNthLookupNotFound(1);
-    iter->SeekToFirst();
-    ASSERT_TRUE(iter->Valid());
-    ASSERT_EQ(iter->key(), "bar");
-
-    EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
-    EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
-    EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
-    // --------
-    EXPECT_EQ(6, TestGetTickerCount(options, BLOCK_CACHE_ADD));
-
-    EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
-    EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
-    EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
-    // --------
-    EXPECT_EQ(3, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
-  }
-  EXPECT_GE(iterations_tested, 1);
-}
-
 TEST_F(DBBlockCacheTest, ParanoidFileChecks) {
  Options options = CurrentOptions();
  options.create_if_missing = true;
@ -1347,6 +1236,190 @@ TEST_F(DBBlockCacheTest, HyperClockCacheReportProblems) {
  EXPECT_EQ(logger->PopCounts(), (std::array<int, 3>{{0, 1, 0}}));
 }

+class DBBlockCacheTypeTest
+    : public DBBlockCacheTest,
+      public secondary_cache_test_util::WithCacheTypeParam {};
+
+INSTANTIATE_TEST_CASE_P(DBBlockCacheTypeTestInstance, DBBlockCacheTypeTest,
+                        secondary_cache_test_util::GetTestingCacheTypes());
+
+TEST_P(DBBlockCacheTypeTest, AddRedundantStats) {
+  BlockBasedTableOptions table_options;
+
+  const size_t capacity = size_t{1} << 25;
+  const int num_shard_bits = 0;  // 1 shard
+  estimated_value_size_ = table_options.block_size;
+  std::shared_ptr<Cache> base_cache =
+      NewCache(capacity, num_shard_bits, /*strict_capacity_limit=*/false);
+  Options options = CurrentOptions();
+  options.create_if_missing = true;
+  options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
+
+  std::shared_ptr<LookupLiarCache> cache =
+      std::make_shared<LookupLiarCache>(base_cache);
+
+  table_options.cache_index_and_filter_blocks = true;
+  table_options.block_cache = cache;
+  table_options.filter_policy.reset(NewBloomFilterPolicy(50));
+  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+  DestroyAndReopen(options);
+
+  // Create a new table.
+  ASSERT_OK(Put("foo", "value"));
+  ASSERT_OK(Put("bar", "value"));
+  ASSERT_OK(Flush());
+  ASSERT_EQ(1, NumTableFilesAtLevel(0));
+
+  // Normal access filter+index+data.
+  ASSERT_EQ("value", Get("foo"));
+
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
+  // --------
+  ASSERT_EQ(3, TestGetTickerCount(options, BLOCK_CACHE_ADD));
+
+  ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
+  ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
+  ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
+  // --------
+  ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
+
+  // Againt access filter+index+data, but force redundant load+insert on index
+  cache->SetNthLookupNotFound(2);
+  ASSERT_EQ("value", Get("bar"));
+
+  ASSERT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
+  // --------
+  ASSERT_EQ(4, TestGetTickerCount(options, BLOCK_CACHE_ADD));
+
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
+  ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
+  ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
+  // --------
+  ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
+
+  // Access just filter (with high probability), and force redundant
+  // load+insert
+  cache->SetNthLookupNotFound(1);
+  ASSERT_EQ("NOT_FOUND", Get("this key was not added"));
+
+  EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
+  EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
+  EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
+  // --------
+  EXPECT_EQ(5, TestGetTickerCount(options, BLOCK_CACHE_ADD));
+
+  EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
+  EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
+  EXPECT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
+  // --------
+  EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
+
+  // Access just data, forcing redundant load+insert
+  ReadOptions read_options;
+  std::unique_ptr<Iterator> iter{db_->NewIterator(read_options)};
+  cache->SetNthLookupNotFound(1);
+  iter->SeekToFirst();
+  ASSERT_TRUE(iter->Valid());
+  ASSERT_EQ(iter->key(), "bar");
+
+  EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD));
+  EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD));
+  EXPECT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD));
+  // --------
+  EXPECT_EQ(6, TestGetTickerCount(options, BLOCK_CACHE_ADD));
+
+  EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_ADD_REDUNDANT));
+  EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_ADD_REDUNDANT));
+  EXPECT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_DATA_ADD_REDUNDANT));
+  // --------
+  EXPECT_EQ(3, TestGetTickerCount(options, BLOCK_CACHE_ADD_REDUNDANT));
+}
+
+TEST_P(DBBlockCacheTypeTest, Uncache) {
+  for (bool partitioned : {false, true}) {
+    SCOPED_TRACE("partitioned=" + std::to_string(partitioned));
+    for (uint32_t ua : {0, 1, 2, 10000}) {
+      SCOPED_TRACE("ua=" + std::to_string(ua));
+
+      Options options = CurrentOptions();
+      options.uncache_aggressiveness = ua;
+      options.create_if_missing = true;
+      options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
+      BlockBasedTableOptions table_options;
+
+      const size_t capacity = size_t{1} << 25;
+      const int num_shard_bits = 0;  // 1 shard
+      estimated_value_size_ = table_options.block_size;
+      std::shared_ptr<Cache> cache =
+          NewCache(capacity, num_shard_bits, /*strict_capacity_limit=*/false);
+
+      table_options.cache_index_and_filter_blocks = true;
+      table_options.block_cache = cache;
+      table_options.filter_policy.reset(NewBloomFilterPolicy(10));
+      table_options.partition_filters = partitioned;
+      table_options.index_type =
+          partitioned ? BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch
+                      : BlockBasedTableOptions::IndexType::kBinarySearch;
+      options.table_factory.reset(NewBlockBasedTableFactory(table_options));
+      DestroyAndReopen(options);
+
+      size_t kBaselineCount = 1;  // Because of entry stats collector
+
+      ASSERT_EQ(kBaselineCount, cache->GetOccupancyCount());
+      ASSERT_EQ(0U, cache->GetUsage());
+
+      constexpr uint8_t kNumDataBlocks = 10;
+      constexpr uint8_t kNumFiles = 3;
+      for (int i = 0; i < kNumDataBlocks; i++) {
+        // Force some overlap with ordering
+        ASSERT_OK(Put(Key((i * 7) % kNumDataBlocks),
+                      Random::GetTLSInstance()->RandomBinaryString(
+                          static_cast<int>(table_options.block_size))));
+        if (i >= kNumDataBlocks - kNumFiles) {
+          ASSERT_OK(Flush());
+        }
+      }
+      ASSERT_EQ(int{kNumFiles}, NumTableFilesAtLevel(0));
+
+      for (int i = 0; i < kNumDataBlocks; i++) {
+        ASSERT_NE(Get(Key(i)), "NOT_FOUND");
+      }
+
+      size_t meta_blocks_per_file = /*index & filter*/ 2U * (1U + partitioned);
+      ASSERT_EQ(
+          cache->GetOccupancyCount(),
+          kBaselineCount + kNumDataBlocks + meta_blocks_per_file * kNumFiles);
+      ASSERT_GE(cache->GetUsage(), kNumDataBlocks * table_options.block_size);
+
+      // Combine into one file, making the originals obsolete
+      ASSERT_OK(db_->CompactRange({}, nullptr, nullptr));
+
+      for (int i = 0; i < kNumDataBlocks; i++) {
+        ASSERT_NE(Get(Key(i)), "NOT_FOUND");
+      }
+
+      if (ua == 0) {
+        // Expect to see cache entries for new file and obsolete files
+        EXPECT_EQ(cache->GetOccupancyCount(),
+                  kBaselineCount + kNumDataBlocks * 2U +
+                      meta_blocks_per_file * (kNumFiles + 1));
+        EXPECT_GE(cache->GetUsage(),
+                  kNumDataBlocks * table_options.block_size * 2U);
+      } else {
+        // Expect only to see cache entries for new file
+        EXPECT_EQ(cache->GetOccupancyCount(),
+                  kBaselineCount + kNumDataBlocks + meta_blocks_per_file);
+        EXPECT_GE(cache->GetUsage(), kNumDataBlocks * table_options.block_size);
+        EXPECT_LT(cache->GetUsage(),
+                  kNumDataBlocks * table_options.block_size * 2U);
+      }
+    }
+  }
+}

 class DBBlockCacheKeyTest
    : public DBTestBase,
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@ -413,12 +413,24 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
      state.manifest_delete_files.size());
  // We may ignore the dbname when generating the file names.
  for (auto& file : state.sst_delete_files) {
-    if (!file.only_delete_metadata) {
-      candidate_files.emplace_back(
-          MakeTableFileName(file.metadata->fd.GetNumber()), file.path);
-    }
-    if (file.metadata->table_reader_handle) {
-      table_cache_->Release(file.metadata->table_reader_handle);
+    auto* handle = file.metadata->table_reader_handle;
+    if (file.only_delete_metadata) {
+      if (handle) {
+        // Simply release handle of file that is not being deleted
+        table_cache_->Release(handle);
+      }
+    } else {
+      // File is being deleted (actually obsolete)
+      auto number = file.metadata->fd.GetNumber();
+      candidate_files.emplace_back(MakeTableFileName(number), file.path);
+      if (handle == nullptr) {
+        // For files not "pinned" in table cache
+        handle = TableCache::Lookup(table_cache_.get(), number);
+      }
+      if (handle) {
+        TableCache::ReleaseObsolete(table_cache_.get(), handle,
+                                    file.uncache_aggressiveness);
+      }
    }
    file.DeleteMetadata();
  }
@ -580,8 +592,6 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
    std::string fname;
    std::string dir_to_sync;
    if (type == kTableFile) {
-      // evict from cache
-      TableCache::Evict(table_cache_.get(), number);
      fname = MakeTableFileName(candidate_file.file_path, number);
      dir_to_sync = candidate_file.file_path;
    } else if (type == kBlobFile) {
--- a/db/table_cache.cc
+++ b/db/table_cache.cc
@ -163,6 +163,11 @@ Status TableCache::GetTableReader(
  return s;
 }

+Cache::Handle* TableCache::Lookup(Cache* cache, uint64_t file_number) {
+  Slice key = GetSliceForFileNumber(&file_number);
+  return cache->Lookup(key);
+}
+
 Status TableCache::FindTable(
    const ReadOptions& ro, const FileOptions& file_options,
    const InternalKeyComparator& internal_comparator,
@ -727,4 +732,14 @@ uint64_t TableCache::ApproximateSize(

  return result;
 }
+
+void TableCache::ReleaseObsolete(Cache* cache, Cache::Handle* h,
+                                 uint32_t uncache_aggressiveness) {
+  CacheInterface typed_cache(cache);
+  TypedHandle* table_handle = reinterpret_cast<TypedHandle*>(h);
+  TableReader* table_reader = typed_cache.Value(table_handle);
+  table_reader->MarkObsolete(uncache_aggressiveness);
+  typed_cache.ReleaseAndEraseIfLastRef(table_handle);
+}
+
 }  // namespace ROCKSDB_NAMESPACE
--- a/db/table_cache.h
+++ b/db/table_cache.h
@ -165,6 +165,14 @@ class TableCache {
  // Evict any entry for the specified file number
  static void Evict(Cache* cache, uint64_t file_number);

+  // Handles releasing, erasing, etc. of what should be the last reference
+  // to an obsolete file.
+  static void ReleaseObsolete(Cache* cache, Cache::Handle* handle,
+                              uint32_t uncache_aggressiveness);
+
+  // Return handle to an existing cache entry if there is one
+  static Cache::Handle* Lookup(Cache* cache, uint64_t file_number);
+
  // Find table reader
  // @param skip_filters Disables loading/accessing the filter block
  // @param level == -1 means not specified
--- a/db/version_set.cc
+++ b/db/version_set.cc
@ -857,10 +857,14 @@ Version::~Version() {
      f->refs--;
      if (f->refs <= 0) {
        assert(cfd_ != nullptr);
+        // When not in the process of closing the DB, we'll have a superversion
+        // to get current mutable options from
+        auto* sv = cfd_->GetSuperVersion();
        uint32_t path_id = f->fd.GetPathId();
        assert(path_id < cfd_->ioptions()->cf_paths.size());
        vset_->obsolete_files_.emplace_back(
            f, cfd_->ioptions()->cf_paths[path_id].path,
+            sv ? sv->mutable_cf_options.uncache_aggressiveness : 0,
            cfd_->GetFileMetadataCacheReservationManager());
      }
    }
@ -5197,6 +5201,10 @@ VersionSet::~VersionSet() {
  column_family_set_.reset();
  for (auto& file : obsolete_files_) {
    if (file.metadata->table_reader_handle) {
+      // NOTE: DB is shutting down, so file is probably not obsolete, just
+      // no longer referenced by Versions in memory.
+      // For more context, see comment on "table_cache_->EraseUnRefEntries()"
+      // in DBImpl::CloseHelper().
      table_cache_->Release(file.metadata->table_reader_handle);
      TableCache::Evict(table_cache_, file.metadata->fd.GetNumber());
    }
--- a/db/version_set.h
+++ b/db/version_set.h
@ -797,16 +797,20 @@ struct ObsoleteFileInfo {
  // the file, usually because the file is trivial moved so two FileMetadata
  // is managing the file.
  bool only_delete_metadata = false;
+  // To apply to this file
+  uint32_t uncache_aggressiveness = 0;

  ObsoleteFileInfo() noexcept
      : metadata(nullptr), only_delete_metadata(false) {}
  ObsoleteFileInfo(FileMetaData* f, const std::string& file_path,
+                   uint32_t _uncache_aggressiveness,
                   std::shared_ptr<CacheReservationManager>
                       file_metadata_cache_res_mgr_arg = nullptr)
      : metadata(f),
        path(file_path),
-        only_delete_metadata(false),
-        file_metadata_cache_res_mgr(file_metadata_cache_res_mgr_arg) {}
+        uncache_aggressiveness(_uncache_aggressiveness),
+        file_metadata_cache_res_mgr(
+            std::move(file_metadata_cache_res_mgr_arg)) {}

  ObsoleteFileInfo(const ObsoleteFileInfo&) = delete;
  ObsoleteFileInfo& operator=(const ObsoleteFileInfo&) = delete;
@ -816,9 +820,13 @@ struct ObsoleteFileInfo {
  }

  ObsoleteFileInfo& operator=(ObsoleteFileInfo&& rhs) noexcept {
-    path = std::move(rhs.path);
    metadata = rhs.metadata;
    rhs.metadata = nullptr;
+    path = std::move(rhs.path);
+    only_delete_metadata = rhs.only_delete_metadata;
+    rhs.only_delete_metadata = false;
+    uncache_aggressiveness = rhs.uncache_aggressiveness;
+    rhs.uncache_aggressiveness = 0;
    file_metadata_cache_res_mgr = rhs.file_metadata_cache_res_mgr;
    rhs.file_metadata_cache_res_mgr = nullptr;

@ -1495,10 +1503,7 @@ class VersionSet {
  void GetLiveFilesMetaData(std::vector<LiveFileMetaData>* metadata);

  void AddObsoleteBlobFile(uint64_t blob_file_number, std::string path) {
-    assert(table_cache_);
-
-    table_cache_->Erase(GetSliceForKey(&blob_file_number));
-
+    // TODO: Erase file from BlobFileCache?
    obsolete_blob_files_.emplace_back(blob_file_number, std::move(path));
  }

@ -1676,6 +1681,8 @@ class VersionSet {
  // Current size of manifest file
  uint64_t manifest_file_size_;

+  // Obsolete files, or during DB shutdown any files not referenced by what's
+  // left of the in-memory LSM state.
  std::vector<ObsoleteFileInfo> obsolete_files_;
  std::vector<ObsoleteBlobFileInfo> obsolete_blob_files_;
  std::vector<std::string> obsolete_manifests_;
--- a/db_stress_tool/db_stress_common.h
+++ b/db_stress_tool/db_stress_common.h
@ -417,6 +417,7 @@ DECLARE_bool(enable_memtable_insert_with_hint_prefix_extractor);
 DECLARE_bool(check_multiget_consistency);
 DECLARE_bool(check_multiget_entity_consistency);
 DECLARE_bool(inplace_update_support);
+DECLARE_uint32(uncache_aggressiveness);

 constexpr long KB = 1024;
 constexpr int kRandomValueMaxFactor = 3;
--- a/db_stress_tool/db_stress_gflags.cc
+++ b/db_stress_tool/db_stress_gflags.cc
@ -1406,4 +1406,11 @@ DEFINE_bool(check_multiget_entity_consistency, true,
 DEFINE_bool(inplace_update_support,
            ROCKSDB_NAMESPACE::Options().inplace_update_support,
            "Options.inplace_update_support");
+
+DEFINE_uint32(uncache_aggressiveness,
+              ROCKSDB_NAMESPACE::ColumnFamilyOptions().uncache_aggressiveness,
+              "Aggressiveness of erasing cache entries that are likely "
+              "obsolete. 0 = disabled, 1 = minimum, 100 = moderate, 10000 = "
+              "normal max");
+
 #endif  // GFLAGS
--- a/db_stress_tool/db_stress_test_base.cc
+++ b/db_stress_tool/db_stress_test_base.cc
@ -3891,6 +3891,7 @@ void InitializeOptionsFromFlags(
  options.lowest_used_cache_tier =
      static_cast<CacheTier>(FLAGS_lowest_used_cache_tier);
  options.inplace_update_support = FLAGS_inplace_update_support;
+  options.uncache_aggressiveness = FLAGS_uncache_aggressiveness;
 }

 void InitializeOptionsGeneral(
--- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@ -350,6 +350,48 @@ struct ColumnFamilyOptions : public AdvancedColumnFamilyOptions {
  // Dynamically changeable through SetOptions() API
  uint32_t memtable_max_range_deletions = 0;

+  // EXPERIMENTAL
+  // When > 0, RocksDB attempts to erase some block cache entries for files
+  // that have become obsolete, which means they are about to be deleted.
+  // To avoid excessive tracking, this "uncaching" process is iterative and
+  // speculative, meaning it could incur extra background CPU effort if the
+  // file's blocks are generally not cached. A larger number indicates more
+  // willingness to spend CPU time to maximize block cache hit rates by
+  // erasing known-obsolete entries.
+  //
+  // When uncache_aggressiveness=1, block cache entries for an obsolete file
+  // are only erased until any attempted erase operation fails because the
+  // block is not cached. Then no further attempts are made to erase cached
+  // blocks for that file.
+  //
+  // For larger values, erasure is attempted until evidence incidates that the
+  // chance of success is < 0.99^(a-1), where a = uncache_aggressiveness. For
+  // example:
+  // 2 -> Attempt only while expecting >= 99% successful/useful erasure
+  // 11 -> 90%
+  // 69 -> 50%
+  // 110 -> 33%
+  // 230 -> 10%
+  // 460 -> 1%
+  // 690 -> 0.1%
+  // 1000 -> 1 in 23000
+  // 10000 -> Always (for all practical purposes)
+  // NOTE: UINT32_MAX and nearby values could take additional special meanings
+  // in the future.
+  //
+  // Pinned cache entries (guaranteed present) are always erased if
+  // uncache_aggressiveness > 0, but are not used in predicting the chances of
+  // successful erasure of non-pinned entries.
+  //
+  // NOTE: In the case of copied DBs (such as Checkpoints) sharing a block
+  // cache, it is possible that a file becoming obsolete doesn't mean its
+  // block cache entries (shared among copies) are obsolete. Such a scenerio
+  // is the best case for uncache_aggressiveness = 0.
+  //
+  // Once validated in production, the default will likely change to something
+  // around 300.
+  uint32_t uncache_aggressiveness = 0;
+
  // Create ColumnFamilyOptions with default values for all fields
  ColumnFamilyOptions();
  // Create ColumnFamilyOptions from Options
--- a/options/cf_options.cc
+++ b/options/cf_options.cc
@ -523,6 +523,10 @@ static std::unordered_map<std::string, OptionTypeInfo>
         {offsetof(struct MutableCFOptions, bottommost_file_compaction_delay),
          OptionType::kUInt32T, OptionVerificationType::kNormal,
          OptionTypeFlags::kMutable}},
+        {"uncache_aggressiveness",
+         {offsetof(struct MutableCFOptions, uncache_aggressiveness),
+          OptionType::kUInt32T, OptionVerificationType::kNormal,
+          OptionTypeFlags::kMutable}},
        {"block_protection_bytes_per_key",
         {offsetof(struct MutableCFOptions, block_protection_bytes_per_key),
          OptionType::kUInt8T, OptionVerificationType::kNormal,
@ -1122,11 +1126,12 @@ void MutableCFOptions::Dump(Logger* log) const {
                 report_bg_io_stats);
  ROCKS_LOG_INFO(log, "                              compression: %d",
                 static_cast<int>(compression));
-  ROCKS_LOG_INFO(log,
-                 "                       experimental_mempurge_threshold: %f",
+  ROCKS_LOG_INFO(log, "          experimental_mempurge_threshold: %f",
                 experimental_mempurge_threshold);
  ROCKS_LOG_INFO(log, "         bottommost_file_compaction_delay: %" PRIu32,
                 bottommost_file_compaction_delay);
+  ROCKS_LOG_INFO(log, "                   uncache_aggressiveness: %" PRIu32,
+                 uncache_aggressiveness);

  // Universal Compaction Options
  ROCKS_LOG_INFO(log, "compaction_options_universal.size_ratio : %d",
--- a/options/cf_options.h
+++ b/options/cf_options.h
@ -173,7 +173,8 @@ struct MutableCFOptions {
        compression_per_level(options.compression_per_level),
        memtable_max_range_deletions(options.memtable_max_range_deletions),
        bottommost_file_compaction_delay(
-            options.bottommost_file_compaction_delay) {
+            options.bottommost_file_compaction_delay),
+        uncache_aggressiveness(options.uncache_aggressiveness) {
    RefreshDerivedOptions(options.num_levels, options.compaction_style);
  }

@ -223,7 +224,9 @@ struct MutableCFOptions {
        memtable_protection_bytes_per_key(0),
        block_protection_bytes_per_key(0),
        sample_for_compression(0),
-        memtable_max_range_deletions(0) {}
+        memtable_max_range_deletions(0),
+        bottommost_file_compaction_delay(0),
+        uncache_aggressiveness(0) {}

  explicit MutableCFOptions(const Options& options);

@ -319,6 +322,7 @@ struct MutableCFOptions {
  std::vector<CompressionType> compression_per_level;
  uint32_t memtable_max_range_deletions;
  uint32_t bottommost_file_compaction_delay;
+  uint32_t uncache_aggressiveness;

  // Derived options
  // Per-level target file size.
--- a/options/options_helper.cc
+++ b/options/options_helper.cc
@ -274,6 +274,7 @@ void UpdateColumnFamilyOptions(const MutableCFOptions& moptions,
  cf_opts->last_level_temperature = moptions.last_level_temperature;
  cf_opts->default_write_temperature = moptions.default_write_temperature;
  cf_opts->memtable_max_range_deletions = moptions.memtable_max_range_deletions;
+  cf_opts->uncache_aggressiveness = moptions.uncache_aggressiveness;
 }

 void UpdateColumnFamilyOptions(const ImmutableCFOptions& ioptions,
--- a/options/options_settable_test.cc
+++ b/options/options_settable_test.cc
@ -565,7 +565,8 @@ TEST_F(OptionsSettableTest, ColumnFamilyOptionsAllFieldsSettable) {
      "persist_user_defined_timestamps=true;"
      "block_protection_bytes_per_key=1;"
      "memtable_max_range_deletions=999999;"
-      "bottommost_file_compaction_delay=7200;",
+      "bottommost_file_compaction_delay=7200;"
+      "uncache_aggressiveness=1234;",
      new_options));

  ASSERT_NE(new_options->blob_cache.get(), nullptr);
--- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@ -135,7 +135,47 @@ extern const uint64_t kBlockBasedTableMagicNumber;
 extern const std::string kHashIndexPrefixesBlock;
 extern const std::string kHashIndexPrefixesMetadataBlock;

-BlockBasedTable::~BlockBasedTable() { delete rep_; }
+BlockBasedTable::~BlockBasedTable() {
+  if (rep_->uncache_aggressiveness > 0 && rep_->table_options.block_cache) {
+    if (rep_->filter) {
+      rep_->filter->EraseFromCacheBeforeDestruction(
+          rep_->uncache_aggressiveness);
+    }
+    if (rep_->index_reader) {
+      {
+        // TODO: Also uncache data blocks known after any gaps in partitioned
+        // index. Right now the iterator errors out as soon as there's an
+        // index partition not in cache.
+        IndexBlockIter iiter_on_stack;
+        ReadOptions ropts;
+        ropts.read_tier = kBlockCacheTier;  // No I/O
+        auto iiter = NewIndexIterator(
+            ropts, /*disable_prefix_seek=*/false, &iiter_on_stack,
+            /*get_context=*/nullptr, /*lookup_context=*/nullptr);
+        std::unique_ptr<InternalIteratorBase<IndexValue>> iiter_unique_ptr;
+        if (iiter != &iiter_on_stack) {
+          iiter_unique_ptr.reset(iiter);
+        }
+        // Un-cache the data blocks the index iterator with tell us about
+        // without I/O. (NOTE: It's extremely unlikely that a data block
+        // will be in block cache without the index block pointing to it
+        // also in block cache.)
+        UncacheAggressivenessAdvisor advisor(rep_->uncache_aggressiveness);
+        for (iiter->SeekToFirst(); iiter->Valid() && advisor.ShouldContinue();
+             iiter->Next()) {
+          bool erased = EraseFromCache(iiter->value().handle);
+          advisor.Report(erased);
+        }
+        iiter->status().PermitUncheckedError();
+      }
+
+      // Un-cache the index block(s)
+      rep_->index_reader->EraseFromCacheBeforeDestruction(
+          rep_->uncache_aggressiveness);
+    }
+  }
+  delete rep_;
+}

 namespace {
 // Read the block identified by "handle" from "file".
@ -2668,6 +2708,24 @@ Status BlockBasedTable::VerifyChecksumInMetaBlocks(
  return s;
 }

+bool BlockBasedTable::EraseFromCache(const BlockHandle& handle) const {
+  assert(rep_ != nullptr);
+
+  Cache* const cache = rep_->table_options.block_cache.get();
+  if (cache == nullptr) {
+    return false;
+  }
+
+  CacheKey key = GetCacheKey(rep_->base_cache_key, handle);
+
+  Cache::Handle* const cache_handle = cache->Lookup(key.AsSlice());
+  if (cache_handle == nullptr) {
+    return false;
+  }
+
+  return cache->Release(cache_handle, /*erase_if_last_ref=*/true);
+}
+
 bool BlockBasedTable::TEST_BlockInCache(const BlockHandle& handle) const {
  assert(rep_ != nullptr);

@ -3237,4 +3295,8 @@ void BlockBasedTable::DumpKeyValue(const Slice& key, const Slice& value,
  out_stream << "  ------\n";
 }

+void BlockBasedTable::MarkObsolete(uint32_t uncache_aggressiveness) {
+  rep_->uncache_aggressiveness = uncache_aggressiveness;
+}
+
 }  // namespace ROCKSDB_NAMESPACE
--- a/table/block_based/block_based_table_reader.h
+++ b/table/block_based/block_based_table_reader.h
@ -183,6 +183,8 @@ class BlockBasedTable : public TableReader {
  Status ApproximateKeyAnchors(const ReadOptions& read_options,
                               std::vector<Anchor>& anchors) override;

+  bool EraseFromCache(const BlockHandle& handle) const;
+
  bool TEST_BlockInCache(const BlockHandle& handle) const;

  // Returns true if the block for the specified key is in cache.
@ -208,6 +210,8 @@ class BlockBasedTable : public TableReader {
  Status VerifyChecksum(const ReadOptions& readOptions,
                        TableReaderCaller caller) override;

+  void MarkObsolete(uint32_t uncache_aggressiveness) override;
+
  ~BlockBasedTable();

  bool TEST_FilterBlockInCache() const;
@ -241,6 +245,8 @@ class BlockBasedTable : public TableReader {
        FilePrefetchBuffer* /* tail_prefetch_buffer */) {
      return Status::OK();
    }
+    virtual void EraseFromCacheBeforeDestruction(
+        uint32_t /*uncache_aggressiveness*/) {}
  };

  class IndexReaderCommon;
@ -619,11 +625,7 @@ struct BlockBasedTable::Rep {

  std::shared_ptr<FragmentedRangeTombstoneList> fragmented_range_dels;

-  // FIXME
-  // If true, data blocks in this file are definitely ZSTD compressed. If false
-  // they might not be. When false we skip creating a ZSTD digested
-  // uncompression dictionary. Even if we get a false negative, things should
-  // still work, just not as quickly.
+  // Context for block cache CreateCallback
  BlockCreateContext create_context;

  // If global_seqno is used, all Keys in this file will have the same
@ -672,6 +674,10 @@ struct BlockBasedTable::Rep {
  // `end_key` for range deletion entries.
  const bool user_defined_timestamps_persisted;

+  // Set to >0 when the file is known to be obsolete and should have its block
+  // cache entries evicted on close.
+  uint32_t uncache_aggressiveness = 0;
+
  std::unique_ptr<CacheReservationManager::CacheReservationHandle>
      table_reader_cache_res_handle = nullptr;

--- a/table/block_based/block_cache.h
+++ b/table/block_based/block_cache.h
@ -156,4 +156,35 @@ template <typename TUse, typename TBlocklike>
 using WithBlocklikeCheck = std::enable_if_t<
    TBlocklike::kCacheEntryRole == CacheEntryRole::kMisc || true, TUse>;

+// Helper for the uncache_aggressiveness option
+class UncacheAggressivenessAdvisor {
+ public:
+  UncacheAggressivenessAdvisor(uint32_t uncache_aggressiveness) {
+    assert(uncache_aggressiveness > 0);
+    allowance_ = std::min(uncache_aggressiveness, uint32_t{3});
+    threshold_ = std::pow(0.99, uncache_aggressiveness - 1);
+  }
+  void Report(bool erased) { ++(erased ? useful_ : not_useful_); }
+  bool ShouldContinue() {
+    if (not_useful_ < allowance_) {
+      return true;
+    } else {
+      // See UncacheAggressivenessAdvisor unit test
+      return (useful_ + 1.0) / (useful_ + not_useful_ - allowance_ + 1.5) >=
+             threshold_;
+    }
+  }
+
+ private:
+  // Baseline minimum number of "not useful" to consider stopping, to allow
+  // sufficient evidence for checking the threshold. Actual minimum will be
+  // higher as threshold gets well below 1.0.
+  int allowance_;
+  // After allowance, stop if useful ratio is below this threshold
+  double threshold_;
+  // Counts
+  int useful_ = 0;
+  int not_useful_ = 0;
+};
+
 }  // namespace ROCKSDB_NAMESPACE
--- a/table/block_based/cachable_entry.h
+++ b/table/block_based/cachable_entry.h
@ -78,7 +78,7 @@ class CachableEntry {
      return *this;
    }

-    ReleaseResource();
+    ReleaseResource(/*erase_if_last_ref=*/false);

    value_ = rhs.value_;
    cache_ = rhs.cache_;
@ -95,7 +95,7 @@ class CachableEntry {
    return *this;
  }

-  ~CachableEntry() { ReleaseResource(); }
+  ~CachableEntry() { ReleaseResource(/*erase_if_last_ref=*/false); }

  bool IsEmpty() const {
    return value_ == nullptr && cache_ == nullptr && cache_handle_ == nullptr &&
@ -114,7 +114,12 @@ class CachableEntry {
  bool GetOwnValue() const { return own_value_; }

  void Reset() {
-    ReleaseResource();
+    ReleaseResource(/*erase_if_last_ref=*/false);
+    ResetFields();
+  }
+
+  void ResetEraseIfLastRef() {
+    ReleaseResource(/*erase_if_last_ref=*/true);
    ResetFields();
  }

@ -200,10 +205,10 @@ class CachableEntry {
  }

 private:
-  void ReleaseResource() noexcept {
+  void ReleaseResource(bool erase_if_last_ref) noexcept {
    if (LIKELY(cache_handle_ != nullptr)) {
      assert(cache_ != nullptr);
-      cache_->Release(cache_handle_);
+      cache_->Release(cache_handle_, erase_if_last_ref);
    } else if (own_value_) {
      delete value_;
    }
--- a/table/block_based/filter_block.h
+++ b/table/block_based/filter_block.h
@ -169,6 +169,9 @@ class FilterBlockReader {
    return Status::OK();
  }

+  virtual void EraseFromCacheBeforeDestruction(
+      uint32_t /*uncache_aggressiveness*/) {}
+
  virtual bool RangeMayExist(const Slice* /*iterate_upper_bound*/,
                             const Slice& user_key_without_ts,
                             const SliceTransform* prefix_extractor,
--- a/table/block_based/filter_block_reader_common.cc
+++ b/table/block_based/filter_block_reader_common.cc
@ -155,6 +155,18 @@ bool FilterBlockReaderCommon<TBlocklike>::IsFilterCompatible(
  }
 }

+template <typename TBlocklike>
+void FilterBlockReaderCommon<TBlocklike>::EraseFromCacheBeforeDestruction(
+    uint32_t uncache_aggressiveness) {
+  if (uncache_aggressiveness > 0) {
+    if (filter_block_.IsCached()) {
+      filter_block_.ResetEraseIfLastRef();
+    } else {
+      table()->EraseFromCache(table()->get_rep()->filter_handle);
+    }
+  }
+}
+
 // Explicitly instantiate templates for both "blocklike" types we use.
 // This makes it possible to keep the template definitions in the .cc file.
 template class FilterBlockReaderCommon<Block_kFilterPartitionIndex>;
--- a/table/block_based/filter_block_reader_common.h
+++ b/table/block_based/filter_block_reader_common.h
@ -42,6 +42,9 @@ class FilterBlockReaderCommon : public FilterBlockReader {
                     BlockCacheLookupContext* lookup_context,
                     const ReadOptions& read_options) override;

+  void EraseFromCacheBeforeDestruction(
+      uint32_t /*uncache_aggressiveness*/) override;
+
 protected:
  static Status ReadFilterBlock(const BlockBasedTable* table,
                                FilePrefetchBuffer* prefetch_buffer,
--- a/table/block_based/index_reader_common.cc
+++ b/table/block_based/index_reader_common.cc
@ -54,4 +54,15 @@ Status BlockBasedTable::IndexReaderCommon::GetOrReadIndexBlock(
                        cache_index_blocks(), get_context, lookup_context,
                        index_block);
 }
+
+void BlockBasedTable::IndexReaderCommon::EraseFromCacheBeforeDestruction(
+    uint32_t uncache_aggressiveness) {
+  if (uncache_aggressiveness > 0) {
+    if (index_block_.IsCached()) {
+      index_block_.ResetEraseIfLastRef();
+    } else {
+      table()->EraseFromCache(table()->get_rep()->index_handle);
+    }
+  }
+}
 }  // namespace ROCKSDB_NAMESPACE
--- a/table/block_based/index_reader_common.h
+++ b/table/block_based/index_reader_common.h
@ -24,6 +24,9 @@ class BlockBasedTable::IndexReaderCommon : public BlockBasedTable::IndexReader {
    assert(table_ != nullptr);
  }

+  void EraseFromCacheBeforeDestruction(
+      uint32_t /*uncache_aggressiveness*/) override;
+
 protected:
  static Status ReadIndexBlock(const BlockBasedTable* table,
                               FilePrefetchBuffer* prefetch_buffer,
--- a/table/block_based/partitioned_filter_block.cc
+++ b/table/block_based/partitioned_filter_block.cc
@ -538,6 +538,51 @@ Status PartitionedFilterBlockReader::CacheDependencies(
  return biter.status();
 }

+void PartitionedFilterBlockReader::EraseFromCacheBeforeDestruction(
+    uint32_t uncache_aggressiveness) {
+  // NOTE: essentially a copy of
+  // PartitionIndexReader::EraseFromCacheBeforeDestruction
+  if (uncache_aggressiveness > 0) {
+    CachableEntry<Block_kFilterPartitionIndex> top_level_block;
+
+    GetOrReadFilterBlock(/*no_io=*/true, /*get_context=*/nullptr,
+                         /*lookup_context=*/nullptr, &top_level_block,
+                         ReadOptions{})
+        .PermitUncheckedError();
+
+    if (!filter_map_.empty()) {
+      // All partitions present if any
+      for (auto& e : filter_map_) {
+        e.second.ResetEraseIfLastRef();
+      }
+    } else if (!top_level_block.IsEmpty()) {
+      IndexBlockIter biter;
+      const InternalKeyComparator* const comparator = internal_comparator();
+      Statistics* kNullStats = nullptr;
+      top_level_block.GetValue()->NewIndexIterator(
+          comparator->user_comparator(),
+          table()->get_rep()->get_global_seqno(
+              BlockType::kFilterPartitionIndex),
+          &biter, kNullStats, true /* total_order_seek */,
+          false /* have_first_key */, index_key_includes_seq(),
+          index_value_is_full(), false /* block_contents_pinned */,
+          user_defined_timestamps_persisted());
+
+      UncacheAggressivenessAdvisor advisor(uncache_aggressiveness);
+      for (biter.SeekToFirst(); biter.Valid() && advisor.ShouldContinue();
+           biter.Next()) {
+        bool erased = table()->EraseFromCache(biter.value().handle);
+        advisor.Report(erased);
+      }
+      biter.status().PermitUncheckedError();
+    }
+    top_level_block.ResetEraseIfLastRef();
+  }
+  // Might be needed to un-cache a pinned top-level block
+  FilterBlockReaderCommon<Block_kFilterPartitionIndex>::
+      EraseFromCacheBeforeDestruction(uncache_aggressiveness);
+}
+
 const InternalKeyComparator* PartitionedFilterBlockReader::internal_comparator()
    const {
  assert(table());
--- a/table/block_based/partitioned_filter_block.h
+++ b/table/block_based/partitioned_filter_block.h
@ -167,6 +167,8 @@ class PartitionedFilterBlockReader
                         FilterManyFunction filter_function) const;
  Status CacheDependencies(const ReadOptions& ro, bool pin,
                           FilePrefetchBuffer* tail_prefetch_buffer) override;
+  void EraseFromCacheBeforeDestruction(
+      uint32_t /*uncache_aggressiveness*/) override;

  const InternalKeyComparator* internal_comparator() const;
  bool index_key_includes_seq() const;
--- a/table/block_based/partitioned_index_reader.cc
+++ b/table/block_based/partitioned_index_reader.cc
@ -223,4 +223,47 @@ Status PartitionIndexReader::CacheDependencies(
  return s;
 }

+void PartitionIndexReader::EraseFromCacheBeforeDestruction(
+    uint32_t uncache_aggressiveness) {
+  // NOTE: essentially a copy of
+  // PartitionedFilterBlockReader::EraseFromCacheBeforeDestruction
+  if (uncache_aggressiveness > 0) {
+    CachableEntry<Block> top_level_block;
+
+    GetOrReadIndexBlock(/*no_io=*/true, /*get_context=*/nullptr,
+                        /*lookup_context=*/nullptr, &top_level_block,
+                        ReadOptions{})
+        .PermitUncheckedError();
+
+    if (!partition_map_.empty()) {
+      // All partitions present if any
+      for (auto& e : partition_map_) {
+        e.second.ResetEraseIfLastRef();
+      }
+    } else if (!top_level_block.IsEmpty()) {
+      IndexBlockIter biter;
+      const InternalKeyComparator* const comparator = internal_comparator();
+      Statistics* kNullStats = nullptr;
+      top_level_block.GetValue()->NewIndexIterator(
+          comparator->user_comparator(),
+          table()->get_rep()->get_global_seqno(BlockType::kIndex), &biter,
+          kNullStats, true /* total_order_seek */, index_has_first_key(),
+          index_key_includes_seq(), index_value_is_full(),
+          false /* block_contents_pinned */,
+          user_defined_timestamps_persisted());
+
+      UncacheAggressivenessAdvisor advisor(uncache_aggressiveness);
+      for (biter.SeekToFirst(); biter.Valid() && advisor.ShouldContinue();
+           biter.Next()) {
+        bool erased = table()->EraseFromCache(biter.value().handle);
+        advisor.Report(erased);
+      }
+      biter.status().PermitUncheckedError();
+    }
+    top_level_block.ResetEraseIfLastRef();
+  }
+  // Might be needed to un-cache a pinned top-level block
+  BlockBasedTable::IndexReaderCommon::EraseFromCacheBeforeDestruction(
+      uncache_aggressiveness);
+}
 }  // namespace ROCKSDB_NAMESPACE
--- a/table/block_based/partitioned_index_reader.h
+++ b/table/block_based/partitioned_index_reader.h
@ -42,6 +42,8 @@ class PartitionIndexReader : public BlockBasedTable::IndexReaderCommon {
    // TODO(myabandeh): more accurate estimate of partition_map_ mem usage
    return usage;
  }
+  void EraseFromCacheBeforeDestruction(
+      uint32_t /*uncache_aggressiveness*/) override;

 private:
  PartitionIndexReader(const BlockBasedTable* t,
--- a/table/table_reader.h
+++ b/table/table_reader.h
@ -188,6 +188,13 @@ class TableReader {
                                TableReaderCaller /*caller*/) {
    return Status::NotSupported("VerifyChecksum() not supported");
  }
+
+  // Tell the reader that the file should now be obsolete, e.g. as a hint
+  // to delete relevant cache entries on destruction. (It might not be safe
+  // to "unpin" cache entries until destruction time.)
+  virtual void MarkObsolete(uint32_t /*uncache_aggressiveness*/) {
+    // no-op as default
+  }
 };

 }  // namespace ROCKSDB_NAMESPACE
--- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@ -712,6 +712,12 @@ DEFINE_int64(prepopulate_block_cache, 0,
             "Pre-populate hot/warm blocks in block cache. 0 to disable and 1 "
             "to insert during flush");

+DEFINE_uint32(uncache_aggressiveness,
+              ROCKSDB_NAMESPACE::ColumnFamilyOptions().uncache_aggressiveness,
+              "Aggressiveness of erasing cache entries that are likely "
+              "obsolete. 0 = disabled, 1 = minimum, 100 = moderate, 10000 = "
+              "normal max");
+
 DEFINE_bool(use_data_block_hash_index, false,
            "if use kDataBlockBinaryAndHash "
            "instead of kDataBlockBinarySearch. "
@ -4293,6 +4299,7 @@ class Benchmark {
        FLAGS_level_compaction_dynamic_level_bytes;
    options.max_bytes_for_level_multiplier =
        FLAGS_max_bytes_for_level_multiplier;
+    options.uncache_aggressiveness = FLAGS_uncache_aggressiveness;
    Status s =
        CreateMemTableRepFactory(config_options, &options.memtable_factory);
    if (!s.ok()) {
--- a/tools/db_crashtest.py
+++ b/tools/db_crashtest.py
@ -3,7 +3,7 @@
 from __future__ import absolute_import, division, print_function, unicode_literals

 import argparse
-
+import math
 import os
 import random
 import shutil
@ -148,6 +148,7 @@ default_params = {
         "tiered_fixed_hyper_clock_cache", "tiered_auto_hyper_clock_cache",
         "tiered_auto_hyper_clock_cache"]
    ),
+    "uncache_aggressiveness": lambda: int(math.pow(10, 4.0 * random.random()) - 1.0),
    "use_full_merge_v1": lambda: random.randint(0, 1),
    "use_merge": lambda: random.randint(0, 1),
    # use_put_entity_one_in has to be the same across invocations for verification to work, hence no lambda