Ругань в messages. Сыпется один из дисков ?

garrotte

Приветствую всех
Машина с FreeBSD 8.2 amd64, в ДЦ стоит около года, конфиг:
2хXeon E5630 2.53GHz
48Gb RAM 1333 ECC Registered
LSI SAS контроллер
3 SAS 146Gb 10000 rpm
1 SAS 600Gb 10000 rpm
2 SAS 73Gb 15000 rpm
2 SATA 1Tb 5400 rpm
С неделю в логи посыпалось следующее:

Код: Выделить всё

Nov  2 01:21:13 srv kernel: mpt0: request 0xffffff8000b529e0:45460 timed out for ccb 0xffffff0220267000 (req->ccb 0xffffff0220267000)
Nov  2 01:21:13 srv kernel: mpt0: attempting to abort req 0xffffff8000b529e0:45460 function 0
Nov  2 01:21:14 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b529e0:45460
Nov  2 01:21:14 srv kernel: mpt0: abort of req 0xffffff8000b529e0:0 completed
Nov  2 01:21:43 srv kernel: (da7:mpt0:0:20:0): READ(10). CDB: 28 0 28 ac de df 0 1 0 0.
Nov  2 01:21:43 srv kernel: (da7:mpt0:0:20:0): CAM status: SCSI Status Error
Nov  2 01:21:43 srv kernel: (da7:mpt0:0:20:0): SCSI status: Check Condition
Nov  2 01:21:43 srv kernel: (da7:mpt0:0:20:0): SCSI sense: ABORTED COMMAND info:28acdf8e asc:11,1 (Read retries exhausted) sks:0x80,0
Nov  2 01:21:43 srv kernel: g_vfs_done():da7s1d[READ(offset=349397925888, length=131072)]error = 5
Nov  2 01:21:52 srv kernel: (da7:mpt0:0:20:0): READ(10). CDB: 28 0 28 ac db 7f 0 0 e0 0.
Nov  2 01:21:52 srv kernel: (da7:mpt0:0:20:0): CAM status: SCSI Status Error
Nov  2 01:21:52 srv kernel: (da7:mpt0:0:20:0): SCSI status: Check Condition
Nov  2 01:21:52 srv kernel: (da7:mpt0:0:20:0): SCSI sense: MEDIUM ERROR info:28acdba0 asc:11,1 (Read retries exhausted) actual retry count: 63
Nov  2 01:21:52 srv kernel: g_vfs_done():da7s1d[READ(offset=349397483520, length=114688)]error = 5
Nov  2 02:27:20 srv kernel: (da7:mpt0:0:20:0): READ(10). CDB: 28 0 28 b0 18 1f 0 1 0 0.
Nov  2 02:27:20 srv kernel: (da7:mpt0:0:20:0): CAM status: SCSI Status Error
Nov  2 02:27:20 srv kernel: (da7:mpt0:0:20:0): SCSI status: Check Condition
Nov  2 02:27:20 srv kernel: (da7:mpt0:0:20:0): SCSI sense: ABORTED COMMAND info:28b01914 asc:11,1 (Read retries exhausted) sks:0x80,0x4
Nov  2 10:15:22 srv kernel: (da7:mpt0:0:20:0): READ(10). CDB: 28 0 28 ac 72 7f 0 1 0 0.
Nov  2 10:15:22 srv kernel: (da7:mpt0:0:20:0): CAM status: SCSI Status Error
Nov  2 10:15:22 srv kernel: (da7:mpt0:0:20:0): SCSI status: Check Condition
Nov  2 10:15:22 srv kernel: (da7:mpt0:0:20:0): SCSI sense: ABORTED COMMAND info:28ac72e9 asc:11,1 (Read retries exhausted) sks:0x80,0
Nov  2 10:16:01 srv kernel: mpt0: request 0xffffff8000b58c50:9650 timed out for ccb 0xffffff0014259800 (req->ccb 0xffffff0014259800)
Nov  2 10:16:01 srv kernel: mpt0: attempting to abort req 0xffffff8000b58c50:9650 function 0
Nov  2 10:16:02 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b58c50:9650
Nov  2 10:16:02 srv kernel: mpt0: abort of req 0xffffff8000b58c50:0 completed
Nov  2 10:16:22 srv kernel: mpt0: request 0xffffff8000b50160:10102 timed out for ccb 0xffffff008dddd800 (req->ccb 0xffffff008dddd800)
Nov  2 10:16:22 srv kernel: mpt0: attempting to abort req 0xffffff8000b50160:10102 function 0
Nov  2 10:16:22 srv kernel: mpt0: request 0xffffff8000b48fc0:10103 timed out for ccb 0xffffff022026a800 (req->ccb 0xffffff022026a800)
Nov  2 10:16:22 srv kernel: mpt0: request 0xffffff8000b55ad0:10104 timed out for ccb 0xffffff00142b0000 (req->ccb 0xffffff00142b0000)
Nov  2 10:16:22 srv kernel: mpt0: request 0xffffff8000b4d2b0:10105 timed out for ccb 0xffffff00142b1000 (req->ccb 0xffffff00142b1000)
Nov  2 10:16:22 srv kernel: mpt0: request 0xffffff8000b48900:10106 timed out for ccb 0xffffff0220267000 (req->ccb 0xffffff0220267000)
Nov  2 10:16:22 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b48fc0:10103
Nov  2 10:16:22 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b55ad0:10104
Nov  2 10:16:22 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b4d2b0:10105
Nov  2 10:16:22 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b48900:10106
Nov  2 10:16:22 srv kernel: mpt0: completing timedout/aborted req 0xffffff8000b50160:10102
Nov  2 10:16:22 srv kernel: mpt0: abort of req 0xffffff8000b50160:0 completed
Nov  2 10:17:17 srv kernel: (da7:mpt0:0:20:0): READ(10). CDB: 28 0 28 ac 72 7f 0 1 0 0.
Nov  2 10:17:17 srv kernel: (da7:mpt0:0:20:0): CAM status: SCSI Status Error
Nov  2 10:17:17 srv kernel: (da7:mpt0:0:20:0): SCSI status: Check Condition
Nov  2 10:17:17 srv kernel: (da7:mpt0:0:20:0): SCSI sense: ABORTED COMMAND info:28ac7315 asc:11,1 (Read retries exhausted) sks:0x80,0
Nov  2 10:17:54 srv kernel: (da7:mpt0:0:20:0): READ(10). CDB: 28 0 28 ac 72 7f 0 1 0 0.
Nov  2 10:17:54 srv kernel: (da7:mpt0:0:20:0): CAM status: SCSI Status Error
Nov  2 10:17:54 srv kernel: (da7:mpt0:0:20:0): SCSI status: Check Condition
Nov  2 10:17:54 srv kernel: (da7:mpt0:0:20:0): SCSI sense: ABORTED COMMAND info:28ac7377 asc:11,1 (Read retries exhausted) sks:0x80,0x4

da7 - это один из хардов:

Код: Выделить всё

da7 at mpt0 bus 0 scbus0 target 20 lun 0
da7: <TOSHIBA MBF260LRC 6102> Fixed Direct Access SCSI-5 device
da7: 300.000MB/s transfers
da7: Command Queueing enabled
da7: 572325MB (1172123568 512 byte sectors: 255H 63S/T 72961C)

Нагрузка на него приличная, он выделен под исключительно под картинки, файлов очень много а ресурсы на серваке стоят тяжелые по посещаемости.
То есть это все однозначно указывает что диск приказывает долго жить ?
Если да, какие варианты решения, поменять диск от греха или попытаться как-нить вылечить ?
Заранее спасибо

Хостинг HostFood.ru · **Хостинг HostFood.ru**

Тарифы на хостинг в России, от 12 рублей: https://www.host-food.ru/tariffs/hosting/
Тарифы на виртуальные сервера (VPS/VDS/KVM) в РФ, от 189 руб.: https://www.host-food.ru/tariffs/virtualny-server-vps/
Выделенные сервера, Россия, Москва, от 2000 рублей (HP Proliant G5, Intel Xeon E5430 (2.66GHz, Quad-Core, 12Mb), 8Gb RAM, 2x300Gb SAS HDD, P400i, 512Mb, BBU):
https://www.host-food.ru/tariffs/vydelennyi-server-ds/
Недорогие домены в популярных зонах: https://www.host-food.ru/domains/

Alex Keda

smart посмотрите на дисках

garrotte

Код: Выделить всё

Vendor:               TOSHIBA
Product:              MBF260LRC
Revision:             6102
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Logical Unit id:      0x50000392a80bd1b0
Serial number:        EA80PA7004BM
Device type:          disk
Transport protocol:   SAS
Local Time is:        Wed Nov  2 12:36:40 2011 MSK
Device supports SMART and is Disabled
Temperature Warning Disabled or Not Supported
SMART Health Status: HARDWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=12]

Current Drive Temperature:     25 C
Drive Trip Temperature:        65 C
Manufactured in week 29 of year 2010
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  39
Specified load-unload count over device lifetime:  200000
Accumulated load-unload cycles:  0
Elements in grown defect list: 4090

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0    32561        26         0          0     193115.412          26
write:         0        0         0         0          0       4740.417           0

Non-medium error count:       69

GhOsT_MZ

Это общие сведения, а где список SMART-аттрибутов?

garrotte

не вдуплил, почему
Device supports SMART and is Disabled

GhOsT_MZ

Ну смарт можно в биосе выключить, так что стоит это проверить

garrotte

GhOsT_MZ в биосе вряд-ли, поскольку для остальных дисков SMART включен
ща вроде включил через

Код: Выделить всё

smartctl -s on /dev/da7

но инфу он я так понимаю теперь собирать должен

GhOsT_MZ

Теперь нужно ее получить) smartctl'ом можно посмотреть:
smartctl -A /dev/da7

garrotte

по smartctl -A /dev/da7 он выдает только:

Код: Выделить всё

Current Drive Temperature:     25 C
Drive Trip Temperature:        65 C
Manufactured in week 29 of year 2010
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  39
Specified load-unload count over device lifetime:  200000
Accumulated load-unload cycles:  0
Elements in grown defect list: 4344

lap · Непрочитанное сообщение **lap** » 2011-11-02 14:21:36

попробуй "a" маленькую. и вообще smartctl -h.

-a вывлит чтото типа этого:

Код: Выделить всё

[root@nfs log]# smartctl -a /dev/sdb
smartctl 5.41 2011-06-09 r3365 [i686-linux-2.6.40.6-0.fc15.i686.PAE] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F1 RE
Device Model:     SAMSUNG HE103UJ
Serial Number:    S13VJ1DQ514651
LU WWN Device Id: 5 0000f0 009156415
Firmware Version: 1AA01112
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Wed Nov  2 15:20:04 2011 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 114) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (12159) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 203) minutes.
Conveyance self-test routine
recommended polling time:        (  22) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   098   051    Pre-fail  Always       -       17
  3 Spin_Up_Time            0x0007   076   076   011    Pre-fail  Always       -       7890
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       10060
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       27896
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
 13 Read_Soft_Error_Rate    0x000e   100   098   000    Old_age   Always       -       20
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       595
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   050   000    Old_age   Always       -       28 (Min/Max 16/28)
194 Temperature_Celsius     0x0022   070   047   000    Old_age   Always       -       30 (Min/Max 16/32)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       4042
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 37 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 37 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 83 2c e1 ea  Error: UNC at LBA = 0x0ae12c83 = 182529155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 60 7f 2c e1 ea 08      21:39:56.350  READ DMA
  c8 00 04 1f 8f 0d e9 08      21:39:56.350  READ DMA
  c8 00 a0 df 2b e1 ea 08      21:39:56.350  READ DMA
  c8 00 80 5f 2b e1 ea 08      21:39:56.350  READ DMA
  c8 00 40 1f 2b e1 ea 08      21:39:56.350  READ DMA

Error 36 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 83 2c e1 ea  Error: UNC at LBA = 0x0ae12c83 = 182529155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 7f 2c e1 ea 08   3d+21:00:16.270  READ DMA
  c8 00 20 1f 8f 0d e9 08   3d+21:00:16.260  READ DMA
  c8 00 60 1f 2c e1 ea 08   3d+21:00:16.250  READ DMA
  c8 00 80 1f 88 49 ef 08   3d+21:00:13.900  READ DMA
  c8 00 80 9f 87 49 ef 08   3d+21:00:13.900  READ DMA

Error 35 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 83 2c e1 ea  Error: UNC at LBA = 0x0ae12c83 = 182529155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 7f 2c e1 ea 08  12d+09:30:44.110  READ DMA
  c8 00 20 1f 8f 0d e9 08  12d+09:30:44.110  READ DMA
  c8 00 60 1f 2c e1 ea 08  12d+09:30:44.110  READ DMA
  c8 00 20 ff 8e 0d e9 08  12d+09:30:44.110  READ DMA
  c8 00 20 df 2a e1 ea 08  12d+09:30:44.110  READ DMA

Error 34 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 83 2c e1 ea  Error: UNC at LBA = 0x0ae12c83 = 182529155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 7f 2c e1 ea 08  17d+06:44:50.726  READ DMA
  c8 00 20 1f 8f 0d e9 08  17d+06:44:50.726  READ DMA
  c8 00 60 1f 2c e1 ea 08  17d+06:44:50.726  READ DMA
  c8 00 80 9f 2b e1 ea 08  17d+06:44:50.716  READ DMA
  c8 00 80 1f 2b e1 ea 08  17d+06:44:50.716  READ DMA

Error 33 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 83 2c e1 ea  Error: UNC at LBA = 0x0ae12c83 = 182529155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 7f 2c e1 ea 08  12d+21:56:01.352  READ DMA
  c8 00 38 47 2c e1 ea 08  12d+21:56:01.332  READ DMA
  c8 00 80 c7 2b e1 ea 08  12d+21:56:01.312  READ DMA
  c8 00 80 47 2b e1 ea 08  12d+21:56:01.282  READ DMA
  c8 00 48 ff 2a e1 ea 08  12d+21:56:01.262  READ DMA

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     27670         182529155
# 2  Short offline       Aborted by host               20%     27646         -
# 3  Extended offline    Aborted by host               90%     27623         -
# 4  Short offline       Aborted by host               20%     27622         -
# 5  Short offline       Aborted by host               20%     27598         -
# 6  Short offline       Aborted by host               20%     27574         -
# 7  Short offline       Aborted by host               20%     27550         -
# 8  Short offline       Aborted by host               20%     27526         -
# 9  Short offline       Aborted by host               20%     27502         -
#10  Short offline       Aborted by host               20%     27478         -
#11  Extended offline    Aborted by host               90%     27455         -
#12  Short offline       Aborted by host               20%     27454         -
#13  Short offline       Aborted by host               20%     27430         -
#14  Short offline       Aborted by host               20%     27406         -
#15  Short offline       Aborted by host               20%     27382         -
#16  Short offline       Aborted by host               20%     27190         -
#17  Short offline       Aborted by host               20%     27166         -
#18  Short offline       Aborted by host               20%     27142         -
#19  Extended offline    Aborted by host               90%     27119         -
#20  Short offline       Aborted by host               20%     27118         -
#21  Short offline       Aborted by host               20%     27094         -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

можно еще -x

garrotte

уже ман перечитал)
smartctl -a выводит побольше, но все равно не все

Код: Выделить всё

Vendor:               TOSHIBA
Product:              MBF260LRC
Revision:             6102
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Logical Unit id:      0x50000392a80bd1b0
Serial number:        EA80PA7004BM
Device type:          disk
Transport protocol:   SAS
Local Time is:        Wed Nov  2 15:24:36 2011 MSK
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: HARDWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=12]

Current Drive Temperature:     25 C
Drive Trip Temperature:        65 C
Manufactured in week 29 of year 2010
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  39
Specified load-unload count over device lifetime:  200000
Accumulated load-unload cycles:  0
Elements in grown defect list: 4436

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0    36730        41         0          0     193136.818          41
write:         0        0         0         0          0       4746.971           0

Non-medium error count:       75
No self-tests have been logged
Long (extended) Self Test duration: 5358 seconds [89.3 minutes]

но вероятно фраза
SMART Health Status: HARDWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=12]
тут неспроста)

garrotte

smartctl -x
вывалил заголовок от смарт -а, но еще и портянку лога такого вида

Код: Выделить всё

Background scan results log
  Status: halted due to fatal error
    Accumulated power on time, hours:minutes 6526:26 [391586 minutes]
    Number of background scans performed: 271,  scan progress: 63.27%
    Number of background medium scans performed: 0

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 6365:17  00000000256b4324  [1,18,7]   Recovered via rewrite in-place
   2 6365:17  00000000256b432d  [1,18,7]   Recovered via rewrite in-place
   3 6365:17  00000000257a406c  [1,18,7]   Recovered via rewrite in-place
   4 6365:18  00000000257e30e1  [1,18,7]   Recovered via rewrite in-place
   5 6365:19  00000000259af7e3  [1,18,7]   Recovered via rewrite in-place
   6 6365:19  00000000259af82f  [1,18,7]   Recovered via rewrite in-place
   7 6365:19  00000000259b38bd  [1,18,7]   Recovered via rewrite in-place
   8 6365:19  00000000259b38f9  [1,18,7]   Recovered via rewrite in-place
   9 6365:20  00000000259e7161  [1,18,7]   Recovered via rewrite in-place
  10 6365:20  00000000259e8a47  [1,18,7]   Recovered via rewrite in-place
  11 6365:20  00000000259e8a27  [1,18,7]   Recovered via rewrite in-place
  12 6365:20  00000000259e8a33  [1,18,7]   Recovered via rewrite in-place
  13 6365:20  00000000259e8a38  [1,18,7]   Recovered via rewrite in-place
  14 6365:20  00000000259e8a46  [1,18,7]   Recovered via rewrite in-place
  15 6365:20  00000000259eb938  [1,18,7]   Recovered via rewrite in-place
  16 6365:21  0000000025a96207  [1,18,7]   Recovered via rewrite in-place
  17 6365:21  0000000025a98e82  [1,18,7]   Recovered via rewrite in-place
  18 6365:22  0000000025ad3d12  [1,18,7]   Recovered via rewrite in-place
  19 6365:22  0000000025adfafd  [1,18,7]   Recovered via rewrite in-place
  20 6365:26  0000000025f96bfb  [1,18,7]   Recovered via rewrite in-place
  21 6365:27  00000000261c1e22  [1,18,7]   Recovered via rewrite in-place
  22 6365:27  00000000261c1e24  [1,18,7]   Recovered via rewrite in-place
  23 6365:27  00000000261c1e26  [1,18,7]   Recovered via rewrite in-place
  24 6365:27  00000000261c1e45  [1,18,7]   Recovered via rewrite in-place
  25 6365:27  00000000261c1e5e  [1,18,7]   Recovered via rewrite in-place
  26 6365:27  00000000261c1e66  [1,18,7]   Recovered via rewrite in-place
  27 6365:27  00000000261c225b  [1,18,7]   Recovered via rewrite in-place
  28 6365:27  00000000261c2eb9  [1,18,7]   Recovered via rewrite in-place
  29 6365:27  00000000261c32f3  [1,18,7]   Recovered via rewrite in-place
  30 6365:27  00000000261c2e75  [1,18,7]   Recovered via rewrite in-place
  31 6365:28  00000000261c4a61  [1,18,7]   Recovered via rewrite in-place
  32 6365:28  00000000261c4a7a  [1,18,7]   Recovered via rewrite in-place
  33 6365:28  00000000261c70db  [1,18,7]   Recovered via rewrite in-place
  34 6365:28  00000000261c9658  [1,18,7]   Recovered via rewrite in-place
  35 6365:28  00000000261c9991  [1,18,7]   Recovered via rewrite in-place
  36 6365:29  00000000261cb5bd  [1,18,7]   Recovered via rewrite in-place
  37 6365:34  0000000026a5fa77  [1,18,7]   Recovered via rewrite in-place
  38 6365:34  0000000026a5f92d  [1,18,7]   Recovered via rewrite in-place
  39 6365:34  0000000026a61731  [1,18,7]   Recovered via rewrite in-place
  40 6365:34  0000000026a61734  [1,18,7]   Recovered via rewrite in-place
  41 6365:35  0000000026a652a1  [1,17,6]   Successfully reassigned
.....................
и т.д.

hranitel_y2k

В смарте параметры:

Код: Выделить всё

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       1

4 сектора стоят UNC (т.е. не читаются нормально). Я обычно такое ремапил и смотрел как себя винт дальше вести будет. Но в вашем случае - меняйте.

garrotte

на следующий день, этих ошибок стало столько что фря начала в кору падать
в итоге купили новый хард, через dd вытащил с битого все что смог, починил и перелил на новый
старый по гарантии пойдет
Спасибо всем откликнувшимся

forum.lissyara.su

Ругань в messages. Сыпется один из дисков ?

Ругань в messages. Сыпется один из дисков ?

Услуги хостинговой компании Host-Food.ru

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?

Re: Ругань в messages. Сыпется один из дисков ?