www.fabiankeil.de/gehacktes/gpt-and-geli-recovery/
Recently the following disk experienced a data corruption event of unknown origin:
[fk@steffen ~]$ diskinfo -v /dev/ada1
/dev/ada1
512 # sectorsize
4000785948160 # mediasize in bytes (3.6T)
7814035055 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
7752018 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
ST4000DM005-2DP166 # Disk descr.
ZDH0ZY73 # Disk ident.
No # TRIM/UNMAP support
5980 # Rotation rate in RPM
Not_Zoned # Zone Mode
The disk was working fine while it was being used, then the system was shutdown and the disk was detached and after a couple of days when the disk was supposed to be used again some sectors were corrupted and the ElectroBSD kernel was no longer able to even read the partition table.
2021-12-05T13:52:00.527737+01:00 steffen kernel <2>1 - - - ada1: <ST4000DM005-2DP166 0001> ACS-3 ATA SATA 3.x device 2021-12-05T13:52:00.527757+01:00 steffen kernel <2>1 - - - ada1: Serial Number ZDH0ZY73 2021-12-05T13:52:00.527777+01:00 steffen kernel <2>1 - - - ada1: 150.000MB/s transfers (SATA, UDMA5, PIO 8192bytes) 2021-12-05T13:52:00.527797+01:00 steffen kernel <2>1 - - - ada1: 3815446MB (7814035055 512 byte sectors) 2021-12-05T13:52:00.527826+01:00 steffen kernel <2>1 - - - ada1: quirks=0x1<4K> 2021-12-05T13:52:00.846081+01:00 steffen kernel <2>1 - - - GEOM: ada1: corrupt or invalid GPT detected. 2021-12-05T13:52:00.846156+01:00 steffen kernel <2>1 - - - GEOM: ada1: GPT rejected -- may not be recoverable.
As a result no devices for the individual partitions were created and the data could not be easily accessed.
The disk didn't store particular important data but recovering the data was a useful exercise in case another disk with more important data behaves similarly in the future.
The disk itself did not report any problems:
fk@t520.local /home/fk $ssh steffen sudo smartctl -a /dev/ada1
smartctl 7.2 2020-12-30 r5155 [ElectroBSD 12.3-STABLE amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5
Device Model: ST4000DM005-2DP166
Serial Number: ZDH0ZY73
LU WWN Device Id: 5 000c50 0a23b5ea9
Firmware Version: 0001
User Capacity: 4,000,785,948,160 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5980 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Dec 17 13:13:40 2021 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 601) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 675) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10a5) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 064 006 Pre-fail Always - 90559
3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 154
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 045 Pre-fail Always - 93302826
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 552 (209 46 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 154
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 1
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 075 053 040 Old_age Always - 25 (Min/Max 16/25)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1940
194 Temperature_Celsius 0x0022 025 047 000 Old_age Always - 25 (0 10 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 493h+16m+55.235s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 13556691644
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 49089938488
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 433 -
# 2 Extended offline Interrupted (host reset) 00% 409 -
# 3 Short offline Completed without error 00% 1 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
It was known that the disk once upon a time had been partitioned with
cloudiatr.
The disk had therefore five partitions, and the last one was the
most important one as it contained a
ZFS data pool
which was only accessible through geli. The data pool contained
a volume which was accessed through ggated and contained another
ZFS pool with another geli layer that was managed with
zogftw.
Unfortunately no backup of the partition table was available but there was a backup of the geli meta data for the data pool:
fk@t520 ~ $sudo geli dump ~/.config/zogftw/geli/metadata-backups/gpt_dpool-ada0_baracuda_4tb.eli
Metadata on /home/fk/.config/zogftw/geli/metadata-backups/gpt_dpool-ada0_baracuda_4tb.eli:
magic: GEOM::ELI
version: 7
flags: 0x0
ealgo: AES-XTS
keylen: 128
provsize: 3985543004160
sectorsize: 512
keys: 0x01
iterations: 447024
Salt: 23[...]05
Master Key: 20[...]29
MD5 hash: ba992035efd4fac233b83c17c354c99b
A backup of the geli meta data for the cloudia2 pool was available as well:
fk@t520 ~ $sudo geli dump ~/.config/zogftw/geli/metadata-backups/cloudia2.eli
Metadata on /home/fk/.config/zogftw/geli/metadata-backups/cloudia2.eli:
magic: GEOM::ELI
version: 7
flags: 0x0
ealgo: AES-XTS
keylen: 256
provsize: 3298534882816
sectorsize: 4096
keys: 0x01
iterations: 805315
Salt: 8d[...]25
Master Key: 58[...]a7
MD5 hash: 3fa6840214b3fa32d3e60dbfe76596f1
In theory it should be obviously possible to ignore the partition table completely and simply use gnop with the right parameters to create a provider that contained a valid geli label and that had the proper size.
Thus geli was patched
to add a search subcommand whose purpose is to:
Search for metadata on the provider, starting at the end and going backwards until valid metadata is found or the beginning of the provider is reached. This subcommand may be useful if, for example, the GPT partition data got corrupted or deleted while the data on the previously accessible partitions is still expected to be valid.
While the patch worked as advertised in tests, it failed to discover the geli label on the actual disk, presumably because the label was completely gone or corrupted.
Theoretically it should also be possible to simply put the backup label back on the disk but if the correct position isn't known this would additionally require the use of gnop to make sure the ZFS meta data is where ZFS looks for it.
Instead of doing that, an attempt was made to make the kernel less picky about the state of the partition data.
ElectroBSD already inherited a
kern.geom.part.check_integrity sysctl
from FreeBSD
and a patch was created to extend it.
With the patch and kern.geom.part.check_integrity set to 0 the kernel was able to find some partition data:
2021-12-05T15:49:45.631357+01:00 steffen kernel <2>1 - - - GEOM: hdr_lba_end (7814037127) < hdr->hdr_lba_start (40) or hrdr_lba_end >= last (7814035054) for ada1. 2021-12-05T15:49:45.631376+01:00 steffen kernel <2>1 - - - GEOM: Reading sector 4000787029504 of size 512 from ada1. 2021-12-05T15:49:45.631393+01:00 steffen kernel <2>1 - - - GEOM: ada1: the secondary GPT table is corrupt or invalid. 2021-12-05T15:49:45.631409+01:00 steffen kernel <2>1 - - - GEOM: ada1: using the primary only -- recovery suggested. 2021-12-05T15:49:45.631424+01:00 steffen kernel <2>1 - - - GEOM_PART: integrity check failed (ada1, GPT)
gpart also showed a corrupt partition table but that's better than no partition table at all:
[fk@steffen ~]$ gpart show ada1
=> 40 7814037088 ada1 GPT (3.6T) [CORRUPT]
40 512 1 freebsd-boot (256K)
552 1496 - free - (748K)
2048 409600 2 freebsd-zfs (200M)
411648 20971520 3 freebsd-zfs (10G)
21383168 8388608 4 freebsd-swap (4.0G)
29771776 7784263680 5 freebsd-zfs (3.6T)
7814035456 1672 - free - (836K)
The first four partition appeared valid but unfortunately partition
five could still not be attached with geli and gpart's recover subcommand
didn't help either.
Deleting partition five and recreating it without specifying a size worked, though:
[fk@steffen ~]$ sudo gpart delete -i 5 /dev/ada1
ada1p5 deleted
[fk@steffen ~]$ sudo gpart add -i 5 -t freebsd-zfs /dev/ada1
ada1p5 added
[fk@steffen ~]$ gpart show ada1
=> 40 7814034982 ada1 GPT (3.6T) [CORRUPT]
40 512 1 freebsd-boot (256K)
552 1496 - free - (748K)
2048 409600 2 freebsd-zfs (200M)
411648 20971520 3 freebsd-zfs (10G)
21383168 8388608 4 freebsd-swap (4.0G)
29771776 7784263240 5 freebsd-zfs (3.6T)
7814035016 6 - free - (3.0K)
The [CORRUPT]
marker was gone after a reboot.
At first, geli was still not able to read meta data from the fifth partition but after restoring the backup meta data with the force flag (to ignore the size mismatch) the geli provider could be attached again and the ZFS pool could be imported:
[fk@steffen ~]$ sudo geli restore -f gpt_dpool-ada0_baracuda_4tb.eli /dev/ada1p5
[fk@steffen ~]$ sudo geli dump /dev/ada1p5
Metadata on /dev/ada1p5:
magic: GEOM::ELI
version: 7
flags: 0x0
ealgo: AES-XTS
keylen: 128
provsize: 3985542778880
sectorsize: 512
keys: 0x01
iterations: 447024
Salt: 23[...]05
Master Key: 20[...]29
MD5 hash: 9fdffeb97ca6b34379512a9191cbfaeb
A zpool scrub revealed a few errors but a lot less than expected:
[fk@steffen ~]$ sudo zpool status -v dpool
pool: dpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 1 days 09:00:47 with 66 errors on 2021-12-07 18:40:55
config:
NAME STATE READ WRITE CKSUM
dpool ONLINE 0 0 0
ada1p5.eli ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
dpool/ggated/cloudia2:<0x1>
dpool/ggated/cloudia2@2017-04-20_21:27:<0x1>
Unfortunately all the errors occurred in the zvol for the cloudia2 pool. The cloudia2 pool could be accessed over ggated using zogftw on another system and was scrubbed as well:
fk@t520.local /home/fk $sudo zpool status -v cloudia2
pool: cloudia2
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 3 days 21:03:27 with 6 errors on 2021-12-17 08:14:24
config:
NAME STATE READ WRITE CKSUM
cloudia2 ONLINE 0 0 0
label/cloudia2.eli ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
cloudia2/dvds/the-good-wife/season-2@2016-06-30_16:33:/THE_GOOD_WIFE_S2D3/VIDEO_TS/VTS_03_4.VOB
Apparently all the errors affected the same file but luckily two copies of the file were available on other pools:
fk@t520 ~ $zogftw lookup the-good-wife/season-2 NAME USED AVAIL REFER MOUNTPOINT cloudia2/dvds/the-good-wife/season-2 41.3G 99.4G 41.3G /cloudia2/dvds/the-good-wife/season-2 intenso5/dvds/the-good-wife/season-2 41.3G 2.40T 41.3G /intenso5/dvds/the-good-wife/season-2 wde5/dvds/the-good-wife/season-2 41.3G 7.00G 41.3G /wde5/dvds/the-good-wife/season-2
Instead of restoring the file right away I decided to keep the partially-corrupt file
around until error correction with zfs receive becomes available
in OpenZFS.
While it would be great to know how the data corruption occurred I was unable to figure it out. As it only affected one disk I suspect that a firmware issue is more likely than a bug in the ElectroBSD patch set or in FreeBSD itself.
While human error can't be ruled out either, I was the only person with access to the disk and I'm not sure how one would accidentally cause a corruption like this.
Finally, just to show that the implemented geli search command actually works if the meta data is still valid:
[fk@steffen ~]$ uname -a
ElectroBSD steffen 12.3-STABLE ElectroBSD 12.3-STABLE #22 electrobsd-n234792-52515feff497-dirty: Fri Dec 17 12:51:48 UTC 2021 fk@steffen:/usr/obj/usr/src/amd64.amd64/sys/ELECTRO_BEER amd64
[fk@steffen ~]$ sudo geli search /dev/ada1
Searching for GEOM::ELI metadata on /dev/ada1.
Found GEOM::ELI meta data at offset 4000785927680.
Metadata found on /dev/ada1:
magic: GEOM::ELI
version: 7
flags: 0x0
ealgo: AES-XTS
keylen: 128
provsize: 3985542778880
sectorsize: 512
keys: 0x01
iterations: 447024
Salt: 23[...]05
Master Key: 20[...]29
MD5 hash: 9fdffeb97ca6b34379512a9191cbfaeb
Try making the data attachable with: gnop create -o 15243149312 -s 3985542778880 /dev/ada1
[fk@steffen ~]$ sudo gnop create -o 15243149312 -s 3985542778880 /dev/ada1
[fk@steffen ~]$ sudo geli attach /dev/ada1.nop
Enter passphrase:
[fk@steffen ~]$ sudo zpool import dpool
[fk@steffen ~]$ sudo zpool status -v dpool
pool: dpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 1 days 09:00:47 with 66 errors on 2021-12-07 18:40:55
config:
NAME STATE READ WRITE CKSUM
dpool ONLINE 0 0 0
ada1.nop.eli ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
dpool/ggated/cloudia2:<0x1>
dpool/ggated/cloudia2@2017-04-20_21:27:<0x1>