www.fabiankeil.de/gehacktes/gpt-and-geli-recovery/
Recently the following disk experienced a data corruption event of unknown origin:
[fk@steffen ~]$ diskinfo -v /dev/ada1 /dev/ada1 512 # sectorsize 4000785948160 # mediasize in bytes (3.6T) 7814035055 # mediasize in sectors 4096 # stripesize 0 # stripeoffset 7752018 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. ST4000DM005-2DP166 # Disk descr. ZDH0ZY73 # Disk ident. No # TRIM/UNMAP support 5980 # Rotation rate in RPM Not_Zoned # Zone Mode
The disk was working fine while it was being used, then the system was shutdown and the disk was detached and after a couple of days when the disk was supposed to be used again some sectors were corrupted and the ElectroBSD kernel was no longer able to even read the partition table.
2021-12-05T13:52:00.527737+01:00 steffen kernel <2>1 - - - ada1: <ST4000DM005-2DP166 0001> ACS-3 ATA SATA 3.x device 2021-12-05T13:52:00.527757+01:00 steffen kernel <2>1 - - - ada1: Serial Number ZDH0ZY73 2021-12-05T13:52:00.527777+01:00 steffen kernel <2>1 - - - ada1: 150.000MB/s transfers (SATA, UDMA5, PIO 8192bytes) 2021-12-05T13:52:00.527797+01:00 steffen kernel <2>1 - - - ada1: 3815446MB (7814035055 512 byte sectors) 2021-12-05T13:52:00.527826+01:00 steffen kernel <2>1 - - - ada1: quirks=0x1<4K> 2021-12-05T13:52:00.846081+01:00 steffen kernel <2>1 - - - GEOM: ada1: corrupt or invalid GPT detected. 2021-12-05T13:52:00.846156+01:00 steffen kernel <2>1 - - - GEOM: ada1: GPT rejected -- may not be recoverable.
As a result no devices for the individual partitions were created and the data could not be easily accessed.
The disk didn't store particular important data but recovering the data was a useful exercise in case another disk with more important data behaves similarly in the future.
The disk itself did not report any problems:
fk@t520.local /home/fk $ssh steffen sudo smartctl -a /dev/ada1 smartctl 7.2 2020-12-30 r5155 [ElectroBSD 12.3-STABLE amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate BarraCuda 3.5 Device Model: ST4000DM005-2DP166 Serial Number: ZDH0ZY73 LU WWN Device Id: 5 000c50 0a23b5ea9 Firmware Version: 0001 User Capacity: 4,000,785,948,160 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5980 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Fri Dec 17 13:13:40 2021 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 601) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 675) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x10a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 064 006 Pre-fail Always - 90559 3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 154 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 080 060 045 Pre-fail Always - 93302826 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 552 (209 46 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 154 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 075 053 040 Old_age Always - 25 (Min/Max 16/25) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 12 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1940 194 Temperature_Celsius 0x0022 025 047 000 Old_age Always - 25 (0 10 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 493h+16m+55.235s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 13556691644 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 49089938488 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 433 - # 2 Extended offline Interrupted (host reset) 00% 409 - # 3 Short offline Completed without error 00% 1 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
It was known that the disk once upon a time had been partitioned with
cloudiatr
.
The disk had therefore five partitions, and the last one was the
most important one as it contained a
ZFS data pool
which was only accessible through geli. The data pool contained
a volume which was accessed through ggated and contained another
ZFS pool with another geli layer that was managed with
zogftw.
Unfortunately no backup of the partition table was available but there was a backup of the geli meta data for the data pool:
fk@t520 ~ $sudo geli dump ~/.config/zogftw/geli/metadata-backups/gpt_dpool-ada0_baracuda_4tb.eli Metadata on /home/fk/.config/zogftw/geli/metadata-backups/gpt_dpool-ada0_baracuda_4tb.eli: magic: GEOM::ELI version: 7 flags: 0x0 ealgo: AES-XTS keylen: 128 provsize: 3985543004160 sectorsize: 512 keys: 0x01 iterations: 447024 Salt: 23[...]05 Master Key: 20[...]29 MD5 hash: ba992035efd4fac233b83c17c354c99b
A backup of the geli meta data for the cloudia2 pool was available as well:
fk@t520 ~ $sudo geli dump ~/.config/zogftw/geli/metadata-backups/cloudia2.eli Metadata on /home/fk/.config/zogftw/geli/metadata-backups/cloudia2.eli: magic: GEOM::ELI version: 7 flags: 0x0 ealgo: AES-XTS keylen: 256 provsize: 3298534882816 sectorsize: 4096 keys: 0x01 iterations: 805315 Salt: 8d[...]25 Master Key: 58[...]a7 MD5 hash: 3fa6840214b3fa32d3e60dbfe76596f1
In theory it should be obviously possible to ignore the partition table completely and simply use gnop with the right parameters to create a provider that contained a valid geli label and that had the proper size.
Thus geli was patched
to add a search
subcommand whose purpose is to:
Search for metadata on the provider, starting at the end and going backwards until valid metadata is found or the beginning of the provider is reached. This subcommand may be useful if, for example, the GPT partition data got corrupted or deleted while the data on the previously accessible partitions is still expected to be valid.
While the patch worked as advertised in tests, it failed to discover the geli label on the actual disk, presumably because the label was completely gone or corrupted.
Theoretically it should also be possible to simply put the backup label back on the disk but if the correct position isn't known this would additionally require the use of gnop to make sure the ZFS meta data is where ZFS looks for it.
Instead of doing that, an attempt was made to make the kernel less picky about the state of the partition data.
ElectroBSD already inherited a
kern.geom.part.check_integrity
sysctl
from FreeBSD
and a patch was created to extend it.
With the patch and kern.geom.part.check_integrity set to 0 the kernel was able to find some partition data:
2021-12-05T15:49:45.631357+01:00 steffen kernel <2>1 - - - GEOM: hdr_lba_end (7814037127) < hdr->hdr_lba_start (40) or hrdr_lba_end >= last (7814035054) for ada1. 2021-12-05T15:49:45.631376+01:00 steffen kernel <2>1 - - - GEOM: Reading sector 4000787029504 of size 512 from ada1. 2021-12-05T15:49:45.631393+01:00 steffen kernel <2>1 - - - GEOM: ada1: the secondary GPT table is corrupt or invalid. 2021-12-05T15:49:45.631409+01:00 steffen kernel <2>1 - - - GEOM: ada1: using the primary only -- recovery suggested. 2021-12-05T15:49:45.631424+01:00 steffen kernel <2>1 - - - GEOM_PART: integrity check failed (ada1, GPT)
gpart also showed a corrupt partition table but that's better than no partition table at all:
[fk@steffen ~]$ gpart show ada1 => 40 7814037088 ada1 GPT (3.6T) [CORRUPT] 40 512 1 freebsd-boot (256K) 552 1496 - free - (748K) 2048 409600 2 freebsd-zfs (200M) 411648 20971520 3 freebsd-zfs (10G) 21383168 8388608 4 freebsd-swap (4.0G) 29771776 7784263680 5 freebsd-zfs (3.6T) 7814035456 1672 - free - (836K)
The first four partition appeared valid but unfortunately partition
five could still not be attached with geli and gpart's recover
subcommand
didn't help either.
Deleting partition five and recreating it without specifying a size worked, though:
[fk@steffen ~]$ sudo gpart delete -i 5 /dev/ada1 ada1p5 deleted [fk@steffen ~]$ sudo gpart add -i 5 -t freebsd-zfs /dev/ada1 ada1p5 added [fk@steffen ~]$ gpart show ada1 => 40 7814034982 ada1 GPT (3.6T) [CORRUPT] 40 512 1 freebsd-boot (256K) 552 1496 - free - (748K) 2048 409600 2 freebsd-zfs (200M) 411648 20971520 3 freebsd-zfs (10G) 21383168 8388608 4 freebsd-swap (4.0G) 29771776 7784263240 5 freebsd-zfs (3.6T) 7814035016 6 - free - (3.0K)
The [CORRUPT]
marker was gone after a reboot.
At first, geli was still not able to read meta data from the fifth partition but after restoring the backup meta data with the force flag (to ignore the size mismatch) the geli provider could be attached again and the ZFS pool could be imported:
[fk@steffen ~]$ sudo geli restore -f gpt_dpool-ada0_baracuda_4tb.eli /dev/ada1p5 [fk@steffen ~]$ sudo geli dump /dev/ada1p5 Metadata on /dev/ada1p5: magic: GEOM::ELI version: 7 flags: 0x0 ealgo: AES-XTS keylen: 128 provsize: 3985542778880 sectorsize: 512 keys: 0x01 iterations: 447024 Salt: 23[...]05 Master Key: 20[...]29 MD5 hash: 9fdffeb97ca6b34379512a9191cbfaeb
A zpool scrub revealed a few errors but a lot less than expected:
[fk@steffen ~]$ sudo zpool status -v dpool pool: dpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 1 days 09:00:47 with 66 errors on 2021-12-07 18:40:55 config: NAME STATE READ WRITE CKSUM dpool ONLINE 0 0 0 ada1p5.eli ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: dpool/ggated/cloudia2:<0x1> dpool/ggated/cloudia2@2017-04-20_21:27:<0x1>
Unfortunately all the errors occurred in the zvol for the cloudia2 pool. The cloudia2 pool could be accessed over ggated using zogftw on another system and was scrubbed as well:
fk@t520.local /home/fk $sudo zpool status -v cloudia2 pool: cloudia2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 3 days 21:03:27 with 6 errors on 2021-12-17 08:14:24 config: NAME STATE READ WRITE CKSUM cloudia2 ONLINE 0 0 0 label/cloudia2.eli ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: cloudia2/dvds/the-good-wife/season-2@2016-06-30_16:33:/THE_GOOD_WIFE_S2D3/VIDEO_TS/VTS_03_4.VOB
Apparently all the errors affected the same file but luckily two copies of the file were available on other pools:
fk@t520 ~ $zogftw lookup the-good-wife/season-2 NAME USED AVAIL REFER MOUNTPOINT cloudia2/dvds/the-good-wife/season-2 41.3G 99.4G 41.3G /cloudia2/dvds/the-good-wife/season-2 intenso5/dvds/the-good-wife/season-2 41.3G 2.40T 41.3G /intenso5/dvds/the-good-wife/season-2 wde5/dvds/the-good-wife/season-2 41.3G 7.00G 41.3G /wde5/dvds/the-good-wife/season-2
Instead of restoring the file right away I decided to keep the partially-corrupt file
around until error correction with zfs receive
becomes available
in OpenZFS.
While it would be great to know how the data corruption occurred I was unable to figure it out. As it only affected one disk I suspect that a firmware issue is more likely than a bug in the ElectroBSD patch set or in FreeBSD itself.
While human error can't be ruled out either, I was the only person with access to the disk and I'm not sure how one would accidentally cause a corruption like this.
Finally, just to show that the implemented geli search command actually works if the meta data is still valid:
[fk@steffen ~]$ uname -a ElectroBSD steffen 12.3-STABLE ElectroBSD 12.3-STABLE #22 electrobsd-n234792-52515feff497-dirty: Fri Dec 17 12:51:48 UTC 2021 fk@steffen:/usr/obj/usr/src/amd64.amd64/sys/ELECTRO_BEER amd64 [fk@steffen ~]$ sudo geli search /dev/ada1 Searching for GEOM::ELI metadata on /dev/ada1. Found GEOM::ELI meta data at offset 4000785927680. Metadata found on /dev/ada1: magic: GEOM::ELI version: 7 flags: 0x0 ealgo: AES-XTS keylen: 128 provsize: 3985542778880 sectorsize: 512 keys: 0x01 iterations: 447024 Salt: 23[...]05 Master Key: 20[...]29 MD5 hash: 9fdffeb97ca6b34379512a9191cbfaeb Try making the data attachable with: gnop create -o 15243149312 -s 3985542778880 /dev/ada1 [fk@steffen ~]$ sudo gnop create -o 15243149312 -s 3985542778880 /dev/ada1 [fk@steffen ~]$ sudo geli attach /dev/ada1.nop Enter passphrase: [fk@steffen ~]$ sudo zpool import dpool [fk@steffen ~]$ sudo zpool status -v dpool pool: dpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 1 days 09:00:47 with 66 errors on 2021-12-07 18:40:55 config: NAME STATE READ WRITE CKSUM dpool ONLINE 0 0 0 ada1.nop.eli ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: dpool/ggated/cloudia2:<0x1> dpool/ggated/cloudia2@2017-04-20_21:27:<0x1>