nighttob 发表于 2019-10-22 10:02

76000小时,弱扇区,修好继续用

本帖最后由 nighttob 于 2019-10-23 20:25 编辑

也许会变成下一个放弃治疗的案例。

最近发现qBit偶尔会报IO错误,IO延迟起伏有点大,所以检查了下状态。
./storcli /c0/e62/s4 show all
CLI Version = 007.1017.0000.0000 May 10, 2019
Operating system = VMkernel 6.7.0
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e62/s4 :
================

------------------------------------------------------------------------------
EID:Slt DID State DG   Size Intf Med SED PI SeSz Model            Sp Type
------------------------------------------------------------------------------
62:4   12 JBOD-1.819 TB SATA HDD N   N512B ST2000DL003-9VT166 U-
------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=Unsupported|UGShld=UnConfigured shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported


Drive /c0/e62/s4 - Detailed Information :
=======================================

Drive /c0/e62/s4 State :
======================
Shield Counter = 0
Media Error Count = 740
Other Error Count = 11
Drive Temperature =31C (87.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e62/s4 Device attributes :
==================================
SN = 6YD0KDHW
Manufacturer Id = ATA
Model Number = ST2000DL003-9VT166
NAND Vendor = NA
WWN = 5000C50037240706
Firmware Revision = CC32
Raw size = 1.819 TB
Coerced size = 1.818 TB
Non Coerced size = 1.818 TB
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
NCQ setting = Enabled
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = Port 4 - 7 x1


Drive /c0/e62/s4 Policies/Settings :
==================================
Enclosure position = 1
Connected Port Number = 1(path0)
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address
-----------------------------------------
   0 Active 6.0Gb/s   0x4433221106000000
-----------------------------------------


Inquiry Data =
5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00
00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20
59 36 30 44 44 4b 57 48 00 00 00 00 04 00 43 43
32 33 20 20 20 20 54 53 30 32 30 30 4c 44 30 30
2d 33 56 39 31 54 36 36 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80
00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00
3f 00 10 fc fb 00 10 00 ff ff ff 0f 00 00 07 00

如果这是跑生产业务的机器,看到有Media Error,这就啥都不用想了,直接换。
不过既然是自己的机器,可以稍微分析一下。
./storcli /c0/e62/s4 show smart
CLI Version = 007.1017.0000.0000 May 10, 2019
Operating system = VMkernel 6.7.0
Controller = 0
Status = Success
Description = Show Drive Smart Info Succeeded.

Smart Data Info /c0/e62/s4 =
0a 00 01 0f 00 76 57 20 47 22 0b 00 00 00 03 03
00 54 54 00 00 00 00 00 00 00 04 32 00 63 63 d0
07 00 00 00 00 00 05 33 00 64 64 00 00 00 00 00
00 00 07 0f 00 55 3c 04 86 5b 16 00 00 00 09 32
00 0d 0d fb 2b 01 00 00 00 00 0a 13 00 64 64 00
00 00 00 00 00 00 0c 32 00 64 64 79 03 00 00 00
00 00 b7 32 00 64 64 00 00 00 00 00 00 00 b8 32
00 64 64 00 00 00 00 00 00 00 bb 32 00 01 01 dc
05 00 00 00 00 00 bc 32 00 64 01 69 01 03 00 03
00 00 bd 3a 00 5e 5e 06 00 00 00 00 00 00 be 22
00 44 2d 20 00 1d 32 00 00 00 bf 32 00 64 64 00
00 00 00 00 00 00 c0 32 00 64 64 95 03 00 00 00
00 00 c1 32 00 63 63 d9 07 00 00 00 00 00 c2 22
00 20 37 20 00 00 00 11 00 00 c3 1a 00 23 03 20
47 22 0b 00 00 00 c5 12 00 63 63 88 00 00 00 00
00 00 c6 10 00 63 63 88 00 00 00 00 00 00 c7 3e
00 c8 c8 01 00 00 00 00 00 00 f0 00 00 64 fd 2b
ea 00 00 ba 39 2c f1 00 00 64 fd 2f 87 af c1 00
00 00 f2 00 00 64 fd 5a 2f 4f f0 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 82 00 64 02 00 7b
03 00 01 00 01 ff 02 56 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 08 0f 08 08 08 1d 1e
1d 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00
00 00 00 00 00 00 00 00 8a 50 e7 6e 7a fb 00 00
00 00 00 00 01 00 95 ff 2f 87 af c1 84 48 01 00
5a 2f 4f f0 c8 bf 1e 00 00 00 00 00 2e bd 30 0c
00 00 00 04 00 00 00 00 20 12 00 00 51 00 0b 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 13
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c5
从SMART来看:
05h=0
09h=(12bfb)hex=(76795)dec
B8h=0
BBh=(5dc)hex=(1500)dec
C5h=(88)hex=(136)dec
C6h=(88)hex=(136)dec
通电76795小时,有1500的UNC,E2E为0,05也是0,pending有值。
看起来有“修”好的可能性。

晚上回去先看下RAID log,找找有没有ASC code,再拿下来彻底擦除一下。

nighttob 发表于 2019-10-22 22:17

RAID卡日志显示问题是从10月19日开始的,报ASC 3/11/0,也就是UNRECOVERED READ ERROR。
看来我这发现的还是比较及时的。

现在用HD Sentinel全盘扫描中,快50%了,没发现异常。

nighttob 发表于 2019-10-23 06:59

结果显示,问题出现在1698.6GB附近,算是相当靠后了。
跑完read test以后05h、C5h、C6h都没有变化,只是BBh增加了。然后继续跑reinitialize test。


nighttob 发表于 2019-10-23 20:24

扫完以后算是“复活”了,就是留下了1737个UNC。
插回服务器继续用。



tulei 发表于 2019-10-23 22:17

nighttob 发表于 2019-10-23 06:59
结果显示,问题出现在1698.6GB附近,算是相当靠后了。
跑完read test以后05h、C5h、C6h都没有变化,只是BBh ...

请教一下硬盘哨兵还有硬盘修复的功能吗?

nighttob 发表于 2019-10-23 22:40

tulei 发表于 2019-10-23 22:17
请教一下硬盘哨兵还有硬盘修复的功能吗?

是啊
这种程度的坏道我是已经修过好几个了

eterfinity 发表于 2019-10-24 13:51

好贴   这个通电时间够硬
我决定去把仓库吃灰的16盘台系存储拿出来盘子全部这么过一遍正好也都是12年前的盘

tulei 发表于 2019-10-24 16:31

nighttob 发表于 2019-10-23 22:40
是啊
这种程度的坏道我是已经修过好几个了

能把硬盘哨兵修复硬盘的过程大致说一下吗?先谢谢了。

nighttob 发表于 2019-10-24 17:02

tulei 发表于 2019-10-24 16:31
能把硬盘哨兵修复硬盘的过程大致说一下吗?先谢谢了。

http://bbs.pceva.com.cn/thread-56702-1-1.html

tiancai2nd 发表于 2019-10-25 23:09

哨兵修复“坏道”跟DISKGENIUS修复反应时超过XX毫秒的扇区的原理一样吗?各有啥优劣呢?

nighttob 发表于 2019-10-25 23:40

tiancai2nd 发表于 2019-10-25 23:09
哨兵修复“坏道”跟DISKGENIUS修复反应时超过XX毫秒的扇区的原理一样吗?各有啥优劣呢? ...

不知道
我只用DG做分区镜像和克隆分区
连恢复功能都不用,因为我每天都备份
页: [1]
查看完整版本: 76000小时,弱扇区,修好继续用