Hi, all Our cluster has encountered a CephFS metadata-related failure, which causes user requests to access a specific directory (/danlu/ProjectStorage/, with approximately 1.4 billion objects) to get stuck. In the MDS logs, there is a recurring loop of printouts for all dir_frags of certain directories. This issue may be caused by frequent changes we made to the filesystem subtree export pin. Can it be resolved by resetting the MDS MAP using the command ceph fs reset <fs name> --yes-i-really-mean-it? 1. Failure Timeline 1.0 Our CephFS is deployed with multiple MDS nodes. We manually control MDS load balancing using static pins, with a total of 11 nodes (Rank 0¨C10). Among them, MDS Rank 4 (pinned to /danlu/ProjectStorage/) bears a relatively large volume of metadata. 1.1 Abnormal user access was detected, and slow METADATA occurred on MDS 4. 1.2 MDS 4 was restarted. During the MDS recovery process (replay or rejoin phase), excessive memory usage occurred, exceeding the physical machine¡¯s memory limit (384 GB) and causing repeated OOM (Out-of-Memory) errors. 1.3 New nodes (Rank 11, 12, 13) were added, but the OOM issue persisted. 1.4 MDS Rank 4 was replaced with a server with larger memory. It started normally and entered the active state. 1.5 When traversing the extended attributes of files under Rank 4, it was found that the /danlu/ProjectStorage/3990/ directory contained approximately 80 million objects. For load balancing purposes, the /danlu/ProjectStorage/3990 subtree was statically pinned to the newly added Rank 11 node. 1.6 During the export of data from Rank 4 to Rank 11, MDS 4 restarted automatically because the mds_beacon_grace parameter was set too short (120 seconds). Subsequently, both Rank 4 and Rank 11 entered an abnormal state (stuck in rejoin or clientreplay). 1.7 By configuring the following parameters, Rank 4 and Rank 11 returned to the active state (all parameters were reverted later), but the /danlu/ProjectStorage/ directory still failed to provide services: mds_beacon_grace=3600 mds_bal_interval=0 mds_wipe_sessions true mds_replay_unsafe_with_closed_session true 1.8 Attempted to adjust the static pin of /danlu/ProjectStorage/ from Rank 4 to Rank 11. 1.9 Current stable state: Checking the process status and logs of Rank 4 and Rank 11 shows that the md_submit thread has a 100% CPU utilization rate. Logs indicate that Rank 4 is exporting the /danlu/ProjectStorage/3990 subtree to Rank 11, but Rank 11 reports an empty export error and routes the request back to Rank 4. 2. FS Status ceph fs status dl_cephfs - 186 clients ========= +------+--------+---------------------------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+--------+---------------------------+---------------+-------+-------+ | 0 | active | ceph-prod-45 | Reqs: 5 /s | 114k | 113k | | 1 | active | ceph-prod-46 | Reqs: 2 /s | 118k | 117k | | 2 | active | ceph-prod-47 | Reqs: 11 /s | 5124k | 5065k | | 3 | active | ceph-prod-10 | Reqs: 117 /s | 2408k | 2396k | | 4 | active | ceph-prod-01 | Reqs: 0 /s | 31.5k | 26.9k | | 5 | active | ceph-prod-02 | Reqs: 0 /s | 80.0k | 78.9k | | 6 | active | ceph-prod-11 | Reqs: 2 /s | 1145k | 1145k | | 7 | active | ceph-prod-57 | Reqs: 1 /s | 169k | 168k | | 8 | active | ceph-prod-44 | Reqs: 33 /s | 10.1M | 10.1M | | 9 | active | ceph-prod-20 | Reqs: 4 /s | 196k | 195k | | 10 | active | ceph-prod-43 | Reqs: 2 /s | 1758k | 1751k | | 11 | active | ceph-prod-48 | Reqs: 0 /s | 2879k | 2849k | | 12 | active | ceph-prod-60 | Reqs: 0 /s | 1875 | 1738 | | 13 | active | fuxi-aliyun-ceph-res-tmp3 | Reqs: 3 /s | 80.8k | 60.3k | +------+--------+---------------------------+---------------+-------+-------+ +-----------------+----------+-------+-------+ | Pool | type | used | avail | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 2109G | 2741G | | cephfs_data | data | 1457T | 205T | +-----------------+----------+-------+-------+ +--------------------------+ | Standby MDS | +--------------------------+ | fuxi-aliyun-ceph-res-tmp | +--------------------------+ 3. MDS Ops 3.1 MDS.4 { "description": "rejoin:client.92313222:258", "initiated_at": "2025-08-30 05:24:02.110611", "age": 9389.4134921530003, "duration": 9389.4136841620002, "type_data": { "flag_point": "dispatched", "reqid": "client.92313222:258", "op_type": "no_available_op_found", "events": [ { "time": "2025-08-30 05:24:02.110611", "event": "initiated" }, { "time": "2025-08-30 05:24:02.110611", "event": "header_read" }, { "time": "2025-08-30 05:24:02.110610", "event": "throttled" }, { "time": "2025-08-30 05:24:02.110626", "event": "all_read" }, { "time": "2025-08-30 05:24:02.110655", "event": "dispatched" } ] } } 3.2 MDS.11 { "ops": [ { "description": "rejoin:client.87797615:61578323", "initiated_at": "2025-08-29 23:36:22.696648", "age": 38670.621505474002, "duration": 38670.621529709002, "type_data": { "flag_point": "dispatched", "reqid": "client.87797615:61578323", "op_type": "no_available_op_found", "events": [ { "time": "2025-08-29 23:36:22.696648", "event": "initiated" }, { "time": "2025-08-29 23:36:22.696648", "event": "header_read" }, { "time": "2025-08-29 23:36:22.696646", "event": "throttled" }, { "time": "2025-08-29 23:36:22.696659", "event": "all_read" }, { "time": "2025-08-29 23:36:23.303274", "event": "dispatched" } ] } } ], "num_ops": 1 } 4. MDS Logs 4.1 mds.4 --- Start of Loop --- 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator export_dir [dir 0x1000340af7a.00001* /danlu/ProjectStorage/ [2,head] auth{0=3,11=2} pv=173913808 v=173913806 cv=0/0 dir_auth=4 ap=2+1 state=1610613760|exporting f(v3168 91=0+91) n(v15319930 rc2025-07-17 14:10:02.703809 b6931715137150 1953080=1912827+40253) hs=1+0,ss=0+0 dirty=1 | request=1 child=1 subtree=1 subtreetemp=0 replicated=1 dirty=1 authpin=1 0x558fb40f8500] to 11 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator already exporting 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator export_dir [dir 0x1000340af7a.00110* /danlu/ProjectStorage/ [2,head] auth{0=3,11=2} pv=173913735 v=173913732 cv=0/0 dir_auth=4 ap=3+3 state=1610613760|exporting f(v3168 102=0+102) n(v15319926 rc2025-07-24 13:53:23.519020 b448426565839 422059=420844+1215) hs=3+0,ss=0+0 dirty=1 | request=1 child=1 subtree=1 subtreetemp=0 replicated=1 dirty=1 authpin=1 0x558fb40f8a00] to 11 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator already exporting 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator export_dir [dir 0x1000340af7a.00111* /danlu/ProjectStorage/ [2,head] auth{0=3,1=2,2=2,10=2,11=2,13=2} pv=173913731 v=173913730 cv=0/0 dir_auth=4 ap=2+1 state=1610613760|exporting f(v3168 102=0+102) n(v15319926 rc2025-08-27 12:17:12.186080 b25441350802673 324852174=320337481+4514693) hs=2+0,ss=0+0 | request=1 child=1 subtree=1 subtreetemp=0 replicated=1 dirty=1 authpin=1 0x558fb40f8f00] to 11 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator already exporting 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator export_dir [dir 0x1000340af7a.01001* /danlu/ProjectStorage/ [2,head] auth{0=3,1=2,2=2,3=2,6=2,7=2,8=3,9=2,10=2,11=2,13=2} pv=173914304 v=173914298 cv=0/0 dir_auth=4 ap=4+21 state=1610613760|exporting f(v3168 101=0+101) n(v15319926 rc2025-08-28 10:44:08.113080 b96173795872233 155213668=143249296+11964372) hs=2+0,ss=0+0 dirty=2 | request=1 child=1 subtree=1 subtreetemp=0 replicated=1 dirty=1 authpin=1 0x558fb40f9400] to 11 2025-08-29 21:42:35.749 7fcfa3d45700 7 mds.4.migrator already exporting ... ommited many already exporting for other dir_frags... --- End of Loop --- 4.2 mds.11 --- Start of Loop --- 2025-08-29 21:42:31.991 7f6ab9899700 2 mds.11.cache Memory usage: total 50859980, rss 25329048, heap 331976, baseline 331976, 0 / 2849528 inodes have caps, 0 caps, 0 caps per inode 2025-08-29 21:42:31.991 7f6ab9899700 7 mds.11.server recall_client_state: min=100 max=1048576 total=0 flags=0xa 2025-08-29 21:42:31.991 7f6ab9899700 7 mds.11.server recalled 0 client caps. 2025-08-29 21:42:31.991 7f6abc09e700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 30395.764013 secs 2025-08-29 21:42:31.995 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1895) from client.91614761 2025-08-29 21:42:32.703 7f6abe0a2700 5 mds.ceph-prod-01 handle_mds_map old map epoch 1656906 <= 1656906, discarding 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1743) from client.91533406 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1737) from client.76133646 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1739) from client.91667655 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1895) from client.92253020 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1895) from client.92348807 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1895) from client.91480971 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1751) from client.87797615 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1739) from client.68366166 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 1390) from client.66458847 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 506) from client.91440537 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 309) from client.91117710 2025-08-29 21:42:32.703 7f6abe0a2700 3 mds.11.server handle_client_session client_session(request_renewcaps seq 283) from client.92192971 2025-08-29 21:42:32.703 7f6abe0a2700 7 mds.11.locker handle_file_lock a=nudge on (inest mix->lock(2) g=4 dirty) from mds.4 [inode 0x5002aada391 [...2,head] /danlu/ProjectStorage/1352/L36/screenTouchHtml/2025-08-27/ auth{4=2,10=3} v621993 pv621997 ap=4 dirtyparent f(v1 m2025-08-28 10:37:07.785514 1=0+1) n(v2 rc2025-08-28 10:37:07.809515 b29740 4=1+3) (inest mix->lock(2) g=4 dirty) (ifile mix->lock(2) w=1 flushing scatter_wanted) (iversion lock) | dirtyscattered=2 lock=1 importing=0 dirfrag=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 authpin=1 0x5579af8b5c00] 2025-08-29 21:42:32.703 7f6abe0a2700 7 mds.11.locker handle_file_lock trying nudge on (inest mix->lock(2) g=4 dirty) on [inode 0x5002aada391 [...2,head] /danlu/ProjectStorage/1352/L36/screenTouchHtml/2025-08-27/ auth{4=2,10=3} v621993 pv621997 ap=4 dirtyparent f(v1 m2025-08-28 10:37:07.785514 1=0+1) n(v2 rc2025-08-28 10:37:07.809515 b29740 4=1+3) (inest mix->lock(2) g=4 dirty) (ifile mix->lock(2) w=1 flushing scatter_wanted) (iversion lock) | dirtyscattered=2 lock=1 importing=0 dirfrag=1 dirtyrstat=0 dirtyparent=1 replicated=1 dirty=1 authpin=1 0x5579af8b5c00] 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.cache trim bytes_used=6GB limit=32GB reservation=0.05% count=0 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.cache trim_lru trimming 0 items from LRU size=2879068 mid=1961726 pintail=0 pinned=76602 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.cache trim_lru trimmed 0 items 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.migrator export_empty_import [dir 0x5002aef450c.100110000001011* /danlu/ProjectStorage/3990/scrapy_html/ [2,head] auth{4=2} v=228873 cv=0/0 dir_auth=11 state=1073741824 f(v1 m2025-06-10 11:04:10.789649 1091=1091+0) n(v8 rc2025-06-10 11:04:10.789649 b133525839 1091=1091+0) hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 replicated=1 0x557627968000] 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.migrator really empty, exporting to 4 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.migrator exporting to mds.4 empty import [dir 0x5002aef450c.100110000001011* /danlu/ProjectStorage/3990/scrapy_html/ [2,head] auth{4=2} v=228873 cv=0/0 dir_auth=11 state=1073741824 f(v1 m2025-06-10 11:04:10.789649 1091=1091+0) n(v8 rc2025-06-10 11:04:10.789649 b133525839 1091=1091+0) hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 replicated=1 0x557627968000] 2025-08-29 21:42:32.991 7f6ab9899700 7 mds.11.migrator export_dir [dir 0x5002aef450c.100110000001011* /danlu/ProjectStorage/3990/scrapy_html/ [2,head] auth{4=2} v=228873 cv=0/0 dir_auth=11 state=1073741824 f(v1 m2025-06-10 11:04:10.789649 1091=1091+0) n(v8 rc2025-06-10 11:04:10.789649 b133525839 1091=1091+0) hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 replicated=1 0x557627968000] to 4 ... ommited many export_empty_import for other dir_frags... --- End of Loop, and all export_empty_imports above will be printed --- 5. Cluster informations Version: v1.42.21 Current Config: [global] ...cluster IPs omitted # update on Sep,26 2021 rbd_cache = false rbd_op_threads = 4 mon_osd_nearfull_ratio = 0.89 mds_beacon_grace = 300 mon_max_pg_per_osd = 300 [mon] mon_allow_pool_delete = true mon osd allow primary affinity = true mon_osd_cache_size = 40000 rocksdb_cache_size = 5368709120 debug_mon = 5/5 mon_lease_ack_timeout_factor=4 mon_osd_min_down_reporters = 3 [mds] mds_cache_memory_limit = 34359738368 #mds_cache_memory_limit = 23622320128 mds_bal_min_rebalance = 1000 mds_log_max_segments = 256 mds_session_blacklist_on_evict = false mds_session_blacklist_on_timeout = false mds_cap_revoke_eviction_timeout = 300 [osd] # update on Sep,26 2021 osd_op_threads = 32 osd_op_num_shards = 16 osd_op_num_threads_per_shard_hdd = 1 osd_op_num_threads_per_shard_ssd = 2 #rbd_op_threads = 16 osd_disk_threads = 16 filestore_op_threads = 16 osd_scrub_min_interval=1209600 osd_scrub_max_interval=2592000 osd_deep_scrub_interval=2592000 osd_scrub_begin_hour = 23 osd_scrub_end_hour = 18 osd_scrub_chunk_min = 5 osd_scrub_chunk_max = 25 osd_max_scrubs=2 osd_op_queue_cut_off = high osd_max_backfills = 4 bluefs_buffered_io = true #bluestore_fsck_quick_fix_threads = 2 osd_max_pg_per_osd_hard_ratio = 6 osd_crush_update_on_start = false osd_scrub_during_recovery = true osd_repair_during_recovery = true bluestore_cache_trim_max_skip_pinned = 10000 osd_heartbeat_grace = 60 osd_heartbeat_interval = 60 osd_op_thread_suicide_timeout = 2000 osd_op_thread_timeout = 90 [client] rgw cache enabled = true rgw cache lru size = 100000 rgw thread pool size = 4096 rgw_max_concurrent_requests = 4096 rgw override bucket index max shards = 32 rgw lifecycle work time = 00:00-24:00 Best Regards, Jun He _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx