Synology NAS FAIL Adventure

EEVblog2

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 бер 2024
My Synology DS418 NAS failed again, but some really bizzare symptoms.
Previous drive fail: • Synology NAS Western D...
The C2000 bug: • EEVblog #1288 - Synolo...
NAS thermal measurements: • WD Red NAS Thermal Mea...
Western digital Red NAS HDD teardown: • EEVblog 1398 - Western...
If you find my videos useful you may consider supporting the EEVblog on Patreon: / eevblog
Web Site: www.eevblog.com
Main Channel: / eevblog
EEVdiscover: / eevdiscover
AliExpress Affiliate: s.click.aliexpress.com/e/c2LRpe8g
Buy anything through that link and Dave gets a commission at no cost to you.
T-Shirts: teespring.com/stores/eevblog
#ElectronicsCreators #Synology #NAS
Наука та технологія

КОМЕНТАРІ • 289

@marklaffan Місяць тому ⁺⁴³
After years of running raid arrays from cheap ones to enterprise systems, this will always happen when a drive partially fails, not enough for the system to say “crap you’re dead, disconnected and flag a warning to the user” it will just sit there trying to write to the one bad drive and bog the system down. When you have a drive that completely fails does the array finally realise it and shuts it down. Stupid but it happens :)
@TradieTrev Місяць тому ⁺²⁹
Cracked up when he said he threaten it with physical violence. Time for a beer I think Dave :D
@thephantom1492 Місяць тому ⁺⁵⁹
That can happen when the disk "work" but is super slow. The system never get an error from the disk so keep waiting for the operation to complete.
Since SATA drives do not have a big command queue, the operation still complete before the system timeout. So the system never drop the drive.
So if the timeout is 15 seconda, at the maximum 32 commands, at 4k, that is 128kB in 15 seconds, or 8kB/sec trip limit. That is if the drive support the maximum 32 NCQ and the system uses it. In reality it can be less.
@ErazerPT Місяць тому ⁺⁴
Yep, that was my thought too. Broken enough to take forever but not enough to be a proper "fail". Had a fair few drives do that over the years, they never quite died but past a point never quite worked properly either.
@DuskHorizon Місяць тому ⁺⁴
That is where buying the "SAN" drives can help, they have TLER (Time Limited Error Recovery). They give up and error out earlier, which tends to be what you want in a RAID array.
@thephantom1492 Місяць тому ⁺³
@@DuskHorizon TLER would not have helped. The commands get executed within the time frame. IIRC TLER is 7 seconds. If the command execute in 6, it pass, but you get a few kB/sec of transfert.
@metamon2704 Місяць тому
I have a Synology NAS and this happened to me, the drive was flagged as bad with 'timeout' errors.
@MatthewMattoxcube8021 Місяць тому ⁺¹⁰
As an enterprise storage guy, I have seen this with other storage arrays were a disk goes bad and starts throwing garbage on bus which took a whole disk shelf offline. (Higher end storage shelves can detect this and cut power to a slot)
When this happens you can get very weird behavior.
@marios2liquid Місяць тому ⁺⁴²
I kinda like the pissed off Dave mode
@ksbs2036 Місяць тому ⁺⁵
I enjoyed the pre-whining 🙂 "Go Ahead, Make My Day. Write your stupid comment" Lol
@Okurka. Місяць тому ⁺¹
That's the default mode.
@katrinabryce Місяць тому ⁺⁹
Just bear in mind if you are using WD Reds, that they changed the design of them a few years back, and the new design is not suitable for use in NAS devices with more than one drive. After a lot of push-back, they re-introduced the old design as "Red Plus".
So if you are replacing the drive, make sure you get a Red Plus, Red Pro, or Gold; or a Seagate Iron Wolf, Iron Wolf Pro, or Exos.
I've listed the product teirs for WD and Seagate from mimumum acceptable to best. Last time I was looking for drives, the more premium tier ones were actually cheaper. Red Plus is the same model as the Reds you currently have in there.
@gibsonblogger Місяць тому ⁺²
Thanks for taking the time make this video.
@markr9069 Місяць тому ⁺⁷
It used to be that you had to use drives with Time Limited Error Recovery (TLER) in RAID configurations to avoid this exact problem. Not sure if your drives support/are enabled for TLER.
@oguzhan001 Місяць тому ⁺¹¹
I'm saving that comment in case your cloud "backup" gets vaporized by big pharma.
@CAMintmier Місяць тому ⁺⁶
I wonder how bad that bad drive's SMART readings are. Glad nothing was permanently unrecoverable.
@user-we4og5bz2l Місяць тому
Thanks Dave I look forward to seeing your videos
@Hobypyrocom Місяць тому ⁺⁹
i have HDD from 1998 and it still works perfectly... last 10 years or so i am changing HDDs at least once a year or two... quality of electronics declined so bad lately, seems like the "cold war" era engineers are retired...
@MLeoDaalder Місяць тому ⁺⁷
There were duds back then too. My Dad once (around that 1998 timeframe) came home with a proverbial spring in his step, he just bought a new computer (Mom was livid, cost a months salary). The HDD let out literal smoke during Windows installation.
@ncot_tech Місяць тому ⁺⁷
Your 1998 hard drive probably isn't trying to cram multiple terabytes into the same sized platters and using increasingly more advanced signal processing to get the data back off. remember - hard disk storage has increased by orders of magnitude but the physical size of the devices hasn't changed since the 80s. If you think modern hard drives are less robust, go learn how they work, you'll come away surprised they work at all 😄
@Hobypyrocom Місяць тому ⁺²
@@ncot_tech i know how the HDDs work, i am firmware dev for embedded devices... the problem is that precision and materials engineering has come long way since 90th and with that its natural to expect that even with higher data density and transfer rates, important device such as HDD would be way more robust and better than HDDs back then, the real issue here might be the "planned obsolescence" and "cutting corners" to stay competitive...
@Hobypyrocom Місяць тому
@@MLeoDaalder true that, but again, failure nowadays is more common
@larrybud Місяць тому ⁺⁷
@@Hobypyrocom My experience is completely the opposite. You used to have to do scandisk ALL the time back 20+ years ago. Can't remember the last time I had a HD failure.
@edgar9651 Місяць тому
Thanks Dave, sometimes there are just strange errors which we never experienced before. You fixed it. That's the important part. Take care.
@dolbyman Місяць тому ⁺¹
I have referenced your older Syno clock degradation video many times in the QNAP forums...very useful (QNAP units had the same issue)
@S95Sedan Місяць тому ⁺⁶
This isnt a synology specific thing, can happen in windows aswell where it bogs down to a crawl because of an edgecase drive that cant be read properly. Even if its not ran as your main (operating system) drive.
@atkelar Місяць тому ⁺²
The number of times that people actually *do* confuse a RAID or similar storage solution for a "backup" is alarming though. So I wouldn't be too hard on the commenters pointing that out - I even made a video about it way, way back in the days...
@QsTechService1 Місяць тому
Thanks Dave for sharing appreciate it Yeah, I did the modification on the same device with the clock signal … with their manufacturer defect with the chipset .. this is pretty interesting
@Rickmakes Місяць тому ⁺¹
My DS918 was acting bizarre the other day. After thorough diagnosing I found that the power brick was failing (indicator light was flickering). I ordered an off brand power supply and was back in business the next day. Lots of anxiety having a NAS down that you rely on for work.
@AK-vx4dy Місяць тому ⁺²
In ancient times some drives "failed" nastly in a way that they won't signal error but do things very very slow, so if there is no timeout system will try to work but 1000x slower
@RichardDePas Місяць тому
I've had this same issue with two separate Netgear ReadyNAS units. Failed drive knocked them offline but all indicators showed fine. Learned to pull one drive at a time to find the bad one.
@michael.a.covington Місяць тому
This is something I came across on a laptop just the other day (and have encountered before). A failing disk drive takes longer and longer to respond but may not throw actual errors at all for a while -- then you start getting timeouts.
@bikerchrisukk Місяць тому
Sorry to hear about your hassles! Am I right thinking that Synology install the OS across the drives? I thought they loaded it on a separate NVMe or dedicated drive?
@bobert4522 Місяць тому ⁺²²
Your login web page may have just been cached, not actually loading the site from the CPU. Always good to check it in incognito.
@dolbyman Місяць тому
I do like that the glass front door breaking warrants a full evacuation of the building
@EEVblog2 Місяць тому ⁺¹
People just hear an alarm and think it's a fire alarm. First time it's ever happened here. People did say that it sounded different to the normal fire alarm though.
@Razor2048 Місяць тому ⁺¹
For the failed drive, see if you can run something like Spinrite on it, often if a drive runs into a few errors that lead to the "Reallocated sector count" to increase, it will cause some NAS appliances to freak out because the drive will just hang rather than immediately returning an error, thus it can take a while for a NAS to give up. But if the drive runs through those sector issues and the counter stops going up, then you can often get the drive to run again. Though at that point it is best to not trust the drive, but I have recovered arrays where 2 drives failed with a 1 drive redundancy.
One of the drives with that issue where a NAS rejected the drive, I used for a number of years (after completing a level 5 scan using spinrite on the 2TB drive) after to basically test a fresh windows install on various systems, where if a strange issue could be software related, I would just disconnect the original drive and connect the old rejected 2TB drive and test a fresh install of windows.
I largely did that for awhile until I could easily get cheap SSDs, now I use a cheap 240GB SSD for that purpose.
@Bluelagoonstudios Місяць тому
That's why I always have a backup disk outside the NAS it's a Seagate Enterprise 8Tb, last month the power brick was also dead. So replacing the damn thing. And back online.
@JoeStuffzAlt Місяць тому ⁺³
I never thought of working straight off a NAS like one like that. I might have to once I get a better life situation because I have more than 1 PC.
I also was thinking of a NAS like that, which hopefully would be easier than maintaining a PC to share with a family, but I don't think the others might be able to fix it if it goes down
@ncot_tech Місяць тому ⁺¹
Working off a NAS is fine if you aren't trying to move giant files. They are definitely easier to maintain than a regular PC though, the web interfaces are designed for end users, and when drives fail you just pop the drive bay open and swap the disk. It doesn't require completely disassembling a PC and fixing it afterwards.
@QuickQuips Місяць тому
It's ideal for media and documents. Plus the plus series can back computers up at a full level.
@JoeStuffzAlt Місяць тому
@@QuickQuips @ncot_tech Thanks for the perspective!
@BoraHorzaGobuchul Місяць тому
To really work off a NAS productively will require a good NAS, ideally with SSDs, or enough HDDs, a good fast network card, and a good LAN, ideally 10gbe. Otherwise out may be slow enough to cause aggravation.
@m4d3ng Місяць тому
I'd be getting ready to replace the other drives, too
@cody5495 Місяць тому ⁺¹
Crazy i have the exact same model and had this same problem last month
@EmilePolka Місяць тому
Well, the OS it self is actually stored on the actual hdd, so if something fails, the OS comes along with it.
All synilogy NAS usually only comes with a 1GB DOM just enough to load thier propietary bootloader.
@UpLateGeek Місяць тому
Wow, that sounds like a massive hassle! I've looking at upgrading the drives in my NAS for the last few months since it's pretty much full, but now I'm thinking it might be a good idea to replace the box with a new one too. Only problem is that I've been putting it off because replacing the drives is pretty expensive, replacing the NAS as well is just going to make it even more expensive!
@BoraHorzaGobuchul Місяць тому
Why replace a nas? It'd make more sense to get a new one and keep the old box for backup of the new one.
@AzrethK9 Місяць тому
Had a similar thing happen to a 3 months old Fujitsu W5010. Total lockup after reboot at the Fujitsu logo. One 2 TB WD Drive of the data Raid 1 was dying.
Never had this happen before with similar setup on dozens of W510/W520/W550/W580 with boot SSD and data HDD Raid 1.
@pedro_8240 Місяць тому
Dave, configure a static lease for your NAS.
And you could look into repurposing a PC and running TrueNAS on it, it's fairly easy to setup basic stuff, it's set and forget just like many off the shelf stuff, works great, and gives you the option of easily upgrading it in the future.
@alexquant1335 Місяць тому
This has prompted me to backup the config of my Synology!
@theythero123 28 днів тому
Been in IT over 20 years and I can confirm this is a thing that happens. Sometimes drives fail in a way where they don't outright die but instead take ages trying to seek or write, timing out, and repeating. It's just one of those issues you have to either just be familiar with or be lucky enough to have some sort of log you can get to. It's annoying but not Synology's fault. It's also expected behavior that you would have a hard time logging in because I'm pretty sure Synology installs their OS to the drives themselves, so if the RAID array is unresponsive the OS would be, too.
I've got no excuse for the dumb third party locator program. I'd rather just log into the router or whatever is handing out DHCP and get the info straight from the horse's mouth. It's also good practice to do static reservations for important equipment and even better practice to just do static IPs outside of the dynamic range.
@ecaparts Місяць тому ⁺¹
Time to upgrade that consumer NAS to enterprise server equipment... Time to check the dumpster room again. Love the rant on future comments about 'Truenas' and 'Nas is not a backup'. 😂😂 👍
@cuteswan Місяць тому ⁺¹
Years ago I got a 4-bay WD RAID 5 thing… two weeks before ThioJoe made a video about why RAID 5 won't cut it anymore. About a year later it did what he'd warned about: It worked when a drive failed but then while rebuilding encountered a read error and completely gave up. On the bright side, I'd paid $26 for the extended warranty and they refunded the full purchase price, and by then high-capacity drives could fit into my new case anyway. In any case, best of luck to you.
@KarlBaron Місяць тому ⁺¹
RAID5 is fine as long as you're using ZFS or BTRFS with checksumming and scheduling regular scrubs. The scrubs act the exact same way as a rebuild does and exercise the drives in the same way, so that any errors or weaknesses are detected before you've lost redundancy.
I admin a ZFS pool with 22 drives, the drives are scrubbed monthly, never lost a sector of data - the UBE rates reported on the white papers for drives that are quoted in the "RAID is dead" videos/blog posts are worst case scenarios so they don't have to honor the drive warranty.
@katrinabryce Місяць тому ⁺¹
@@KarlBaron ZFS isn't RAID5 though, it is RAIDZ1 (or Z2 etc), which, unlike RAID5, gives you protection from non-catastrophic disk failures. That is where the drive responds, but gives the wrong answer. RAID5 will know there is a problem, but won't know which drive is giving the wrong answer. RAIDZ1 will be able to figure that out.
You should not use BTRFS's RAID5 equivalent as it is not stable.
@simccaffrey Місяць тому
back everything up before you do anything else...especially a rebuild, the most likely time for another drive to go is during the rebuild (in which case you'll lose everything).
@zebo-the-fat Місяць тому
Be sure to back everything up to a good quality cassette tape!
@maxheadrom3088 Місяць тому
I would love an investigation on the SATA port multiplier issue that causes a second disk to fail when a first one fails. (first and second has no relation to their position or numer - the first fails from natural causes and that causes the second also to fail). Thanks!
@Xiefux Місяць тому
id suggest buying wd easystore drives (or similar ones) instead. you can take the drive out and use it in a NAS, all you need to do is cover a couple pins on the power connector to get it to work. the drives themselves are basically same ones they use in datacenters, more reliable than most. much cheaper too
@x3roxide Місяць тому
I'd be interested to see the drive health status using something like crystal disk info.
curious as to why the synology nas didn't pick it up.
@russellhltn1396 Місяць тому ⁺¹⁴
What I've found is that modern hard drives do their darndest to hide any of their pain from the user. Sometimes even hiding it from diagnostic software. Your only clue is excessive time to respond. Why Synology doesn't have a time out - you'll have to ask them. They probably also have a threshold where it takes a number of read failures to trigger a fault.
@KarlBaron Місяць тому ⁺⁵
Yeah that's the main difference with enterprise/data center drives, NAS drives and consumer/desktop drives. Not so much the hardware itself but the firmware. Enterprise drives have firmware that fails fast to allow for the RAID controller to deal with it, desktop drives spend ages trying to re-read bad blocks.
@jondonnelly4831 Місяць тому ⁺⁶
Synology will just tell you to buy Synology brand drives to be sure it is fully supported.
@ncot_tech Місяць тому ⁺³
I've got a four drive TrueNAS machine and have had to replace three drives over the past 8 years. Only 1 of those drives ever gave me a warning before it happened. And the only way I noticed something was wrong was when file transfers took forever. Two of the drives failed because their spindles stopped spinning so at least the machine could tell there was a fault with them. The last drive just started giving SMART errors so it was me who decided to replace it.
Home NAS setups need to make it really really obvious when things are going wrong. We stick these boxes in cupboards, under a pile of wires under our desks and then forget about them. They're not in a data centre being monitored 24/7. I think they should make angry beeping noises or something persistently irritating so their owners can figure out there's a problem. Blinking LEDs and email status messages that get lost in spam aren't the way.
@htwingnut Місяць тому ⁺⁴
@@KarlBaron Most NAS drives you can adjust TLER. I had to set TLER for my disks to 7 seconds, but needed to add smartctl commands to /etc/init/enableTLER.conf file. But yeah, this should be a default behavior of Synology NAS devices.
@KarlBaron Місяць тому ⁺²
@@htwingnut yep NAS drives are typically tuned closer to the data center drives with their firmware so that you don’t screw up RAID rebuilds and such, but since they’re still consumer oriented you can’t quite rely on them without researching the models because occasionally you get someone like WD selling you SMR drives
@james141111 Місяць тому
Ditched my Synology it just used to eat drives, I was replacing 2 a year on a 6 bay unit, and I wasn't buying cheap drives. Moved to Truenas with SSD's no problems since
@IanScottJohnston Місяць тому ⁺¹
I have a 6 bay QNAP which threw 2 drives, I kept them and eventually for fun put all 6 in a home built TrueNas box and it’s still going strong without error to this day!
@Kifter1983 Місяць тому ⁺¹
Drive related problems on regular PCs would always lock everything up and make the system unusuable. This being a dedicated piece of hardware you'd think they would have that kind of issue under control and well managed. Also, I'd be a bit wary of running the check if there's anything on the array that's crucial. Those file system check programs typically just take an axe to the data and chop out all the bad parts...
@thomasesr Місяць тому ⁺¹
I have a HP server at home, and I was testing some old hard drives to see if there was some data left and one of the drives made the whole server freeze. There is something to do with the S.M.A.R.T that works just enough to make the computer wait for a response that never arrives. When I hard reset the system it would hang on boot while detecting the drives on the array initialisation. And it only started working again after I removed the bad hard drive.
@SionynJones Місяць тому ⁺¹
it's because it relies on smart data. Steve Gibson from grc has a good white up on the failings of SMART enabled drives.
@stephengentle2815 29 днів тому
I had an issue a bit like this just the other day - this drive was one in a ZFS mirror pair in Proxmox but I think the issue (in both my and your case) is likely in the Linux SATA driver. It seems to lock up if it gets these kind of errors, but in my case I could still get in to the server (although it did fail to boot twice too with a kernel bug, but came back up on reset), it’s just certain things wouldn’t load like the ‘Disks’ page in the Proxmox web UI. The dmesg log was full of SATA errors. Seems in certain cases it just keeps trying and trying, slowing stuff down a lot, and never gives up and marks the drive as bad.
It may be that Synology’s software is trying to query something and the kernel is just blocking - perhaps their software is more monolithic so it stops anything from working, unlike Proxmox where only parts of the web UI wouldn’t load for me.
@ralphj4012 Місяць тому
You would think something as potentially important as a NAS would have a more real-time alert popup, something like an SNMP window. Used Synology for years and so far (touch wood) only issue has been after a Windows update (well documented resolution).
@jerryfraley5904 Місяць тому ⁺⁹
Thanks for sharing this Dave. I use a Synology NAS as a backup solution and also install them at client's facilities. While I did wonder about the location of the primary OS partition, I have never actually looked much at the OS-level allocations across the disk drives. This implies that the system partition may not be mirrored and if corruption or hardware failure impacts the drive, one would be SOL. So tomorrow, going to re-evaluate how we integrate these devices into our businesses. So, THANKS DAVE.
@janbrittenson210 Місяць тому ⁺⁵
All drives have a copy of the system volume, and they're mounted raid 1. So you can boot the system from just a single drive, should you need to. The boot loader, kernel, and some various other things reside in flash - enough to download and install a new system to a set of bare drives, but not all of DSM. (Not sure if that's even DSM or some stripped-down version of it.)
@DigitalDependance Місяць тому ⁺²
Particularly if the drives aren't NAS drives. If it hits a bad sector and the drives aren't NAS drives the drive will do extended recovery which can take minutes to complete, which can either break the raid or make it hang if it doesn't respond for an extended period. A NAS drive will fail quickly on the bad sector and map it out and not try the extended recovery. (This is why NAS drives are required)
@BoraHorzaGobuchul Місяць тому
@@DigitalDependancealso, the definition of what makes a drive a "nas drive" seems to be a bit shady nowadays. I'd say go Enterprise instead of that, those are more likely to report an error instead of pretending to be fine when they're not
@DigitalDependance Місяць тому
@@BoraHorzaGobuchul its the firmware and also you presume they use better components for longer mtbf when run always on
@jeremiefaucher-goulet3365 Місяць тому ⁺⁵
My first was a FreeNAS (before the name change). My second was a QNAP, I figured something designed for it would be better.
Big mistake. I get so much trouble from a proprietary consumer solution. From now on, any future ones will be TrueNAS. That one is still running perfectly.
@KarlBaron Місяць тому ⁺²
The upside to Synology is that underneath it's nothing proprietary - it's just Linux mdraid and BTRFS with a pretty UI. Synology even have detailed instructions on their website on how to put the drives from a Synology into a Linux PC and mount them if you need to do recovery. One of the reasons I'm happy to use Synology at home even if I use TrueNAS at work.
@katrinabryce Місяць тому
@@KarlBaron With TrueNAS, you can pull the drives out and recover them in any FreeBSD system. Linux systems that have zfs support will probably also work.
@jeremiefaucher-goulet3365 Місяць тому
@@KarlBaron Same as QNAP. I've had the displeasure of trying both.
Never had issues with the OS. It's the proprietary bits that cripples and are unstable whenever you try to do anything more difficult than an SMB share with a local user.
@itsmesb4399 Місяць тому
I had a similar thing happen to me with my Dell PowerEdge T420, a failing WD purple drive would cause the whole system to lock up. I thought it was bad RAM, but a few days later the RAID controller told me the drive had a SMART failure and it automatically powered it down.
@dj_paultuk7052 Місяць тому ⁺¹
Just my 2c here. IT engineer of 34 years. I have used NAS units at home over the years and have never really been happy with them, plus have had unexplained instances like yours. I have lots of data to store (music mainly), so i built my own server. Dell T110 tower server, these are dirt cheap on ebay. They are quiet and easy to work on. It has an 8 channel disk controller and can use regular SATA drives. So i maxxed it with drives, installed ESXi and then built a Windows Server 2012 R2 file server. So i use the File Server for my storage. The added bonus of this is that i can run GoogleDrive desktop and select specific folders for live backup to the cloud. Plus i use the server for other VMs, PFsense router, PiHole for blocking ads, and so on. I have a small APC UPS to filter its power and give some backup time and the whole setup has been sweet for the last 6 years.
@KarlBaron Місяць тому ⁺³
I admin a TrueNAS machine at work, which is why I use a Synology at home. I don't want to have to deal with admin stuff in my free time. The Synology works fine, it's just Linux with a pretty UI. I've seen the exact problem Dave had here with both Windows and Linux boxen as well, a drive goes into recovery instead of reporting the error to the machine and the OS freezes up.
Synology does everything you describe as well - they have a built-in Google Drive client (along with a Dropbox client that actually works better then the official Windows desktop client with a large amount of files). You just log in, check the directories to sync and it does it automatically.
Synology also has everything else you described like built-in docker support, VM support etc etc, all super user friendly. I run a linux VM for experimentation as well as some random docker stuff like homebridge, the unifi controller, etc. My Synology is from 2018 and has been rock-solid.
@Okurka. Місяць тому ⁺³
I guess you have never seen Dave use a computer if you think he can setup a server.
@originalmianos Місяць тому
No more esxi, the personal use one is canned now.
@mattatwar Місяць тому ⁺²
I recommend with the synologys you can use hyper backup to backup to like another file share, external drive or another NAS or even Google drive
@repatch43 Місяць тому
I believe the Synology NAS's store their OS on the HDDs, so if the drive is having trouble but hasn't failed completely you can have issues exactly like this.
@yjk_ch Місяць тому
Yeah, which is also why Synology asked if you want to set-up new NAS after inserting HDDs(with Synology OS installed) using hot-swap. It probably boots into small "initial setup OS" if it can't boot into OS from HDD.
@sam2943 Місяць тому
When drives are marginal, I try to use Spinrite to see if it can help the drive recover itself. I've had drives that have errors but not errored enough for the OSs to complain.
@tlhIngan Місяць тому ⁺²
Some drives, when they're failing will fail into a state where they work, but not really. It's nothing Synology could do - some drives lie and tell the host controller that it's got the data and please wait, it's coming. Of course, it hits the bad spot on the disk and never quite finishes reading the disk, so it retries and retries and hangs the system. And while it's hanging the system, it's hanging the controller because it's supposed to have the data ready. This is highly dependent on the drive firmware - you never state what drives they are. Most of the "NAS" drives will not do this - they will return with an error or not really quick, while desktop drives often will hang the system trying to get at the data. You can't really implement a timeout as it's a hardware failure - the only way to fix it would be a complete reset of the hardware itself as it's effectively dead. This means even if you hot swap the drive, nothing will happen.
@jeroenlodder5838 Місяць тому
Yeah, that’s a drawback of software raid.
@saddle1940 Місяць тому
I put 4 3Gig drives in my DS412+ in 2010/2011 and even though it's up 24/7, I haven't touched it since. Probably time to swap in some new drives and blow out the fans. Maybe they really don't make drives like they used to.
@Solkre82 Місяць тому
As requested, I prefer to run a box that can do TrueNAS or similar so I'm never really bound to a vendor like Synology. I'll still have failures, but can recover/rebuild on any hardware I choose.
@MatthewSuffidy Місяць тому
Sounds really iffy for data. X drives missing still ok? I saw one NAS server just die and I use formost for linux to get as many pictures off of it as possible.
@hairrywolf9242 Місяць тому
Also don't forget to update your software on your Synology NAS it looks like you're using a fairly old version, I have the exact same NAS and the interface looks totally different but I keep mine updated even though mine spends most of its time off
@G-Hawks Місяць тому ⁺²
Seriously though, Trunas :) lol sorry Dave, had to do it.
@thewhizard Місяць тому
The synology os is stored on the drives (?) having trouble botting os?
@stephentidwell2022 Місяць тому
I’ve seen drives fail like this before. It would even allow windows to attempt booting. Somewhere along that boot process it would just go incredibly slow and never fully boot.
Heck I’ve got a drive right now that works perfectly but has always thrown a caution since new.
Sometimes a drive is in denial about being in good condition and thus causes issues further down the line 😂
@InspectorGadget2014 Місяць тому
Sadly, I ran quite recently in to a very similar problem;
In my case the external power-brick failed and with no spare power-brick available I moved the HDD's into another NAS with a built-in powersupply.
Eventually it looked good but quite soon it gave 1x orange led on 1x drive, 2x red leds on 2x drives and 1x drive green.
I also noticed the slow response of the NAS system, I believe the NAS system is waiting on the host's of the respective HDD's to respond.
Even worse, my drives are the infamous WD RED with SMR, so known to cause performance issues when things do go south.
Although I also expect the CPU of the NAS not being that powerful to deal with the many requests (errors) of the HDD's.
I do wonder what drive (brand & model) failed in your case, would not surprise me if it would be (also) SMR-type.
Odd thing is, in my situation, before the power-brick failed, all the drives were green and perfectly operational, and only gave problems after the power-brick failed and the HDD's were moved to another NAS.
I wonder if the old(er) NAS was not checking the health that "properly" of the HDD's?
@diyemc7206 Місяць тому
Just as a further test: reseat the drive several times and see if it starts working...if it does, I would think of replacing the NAS.
I built one myself one from an used HP xeon server/workstation. That one out of the sudden started regularly reporting errors (maybe every 2-3month) on 1 bay. Btw: SMART testing did not show any problem and the SAS controller suggested problems somewhere on the bus, not on the drives. My best guess is, that this bays connector went bad...or the SAS cable... Or some internal connector... Who knows... Anyway, the hardware is 10+years old and had a 2nd life with me. so now's time to retire and migrate to something new :)
@karelmensik2698 Місяць тому
I have one DS211+ which kept failing disks once in a few months. I always replaced the disk before noticing that it was always slot 2 that failed. All those disks we healthy, but the sata slot is somehow cursed, it generates errors itself.
@BlackICE1973 Місяць тому
Your login screen seems to be from old software version. Did not you install all updates? There has been some strange behaviour like this with older software versions, when a drive partially failed. With newest software version this problem does not occur.
Also this behaviour happens only when on first drive in the area of system partition problems are happening.
@phynixheart1583 Місяць тому ⁺¹
I've been doing computers for 20+ years professionally. This is not at all the first time I've heard of an issue like this. I've had a few computers that would not boot to BIOS if a particular broken drive was plugged in. This is the same situation. The driver chip on the logic board is likely broken in such a way that it works enough to send firmware information, but won't process further. This causes the CPU to expect data, but either not get it and wait for it, or be interrupted by the drive locking up the system. I've also had this issue a LOT with laptop wireless cards ( mPCI/mPCIe ).
@phynixheart1583 Місяць тому
PS: Good documentation ( telling and details ) of your story!
@humidbeing Місяць тому
When mechanical HDDs were the norm (Win XP days) I saw several computers with bad drives that would run incredibly slow. Like take an hour to boot. Yet they never actually crashed or had any read errors. Running the OEM diagnostics (like Seatools) would eventually turn up some errors after running for like multiple days. Weird.
If your NAS is using RAID-5 with a parity drive. Then the CPU has to constantly compute the real data using the parity info.
@markwerley6965 Місяць тому ⁺¹
I've worked with Synology drives for many years and I do rather love them. In your case, for disk 1, I'd guess the drive had not completely failed. After some number (probably a large number) of failed read attempts, it would actually respond with the data. For whatever reason that delay was not sufficient to trigger a failed state in Synology's DSM OS. It might even be a tuneable parameter.
@kidcarrasco Місяць тому
some NAS, like iomega, lenovo, emc, dell, all system config is saved in disk 1, when disk1 fail, all data in raid 5, 0 or 1 it's lost.
@janjschneider Місяць тому ⁺¹
Did you setup scrubbing of the Raid with regular intervals?
@eidodk Місяць тому
The errors in the log shows clearly it's a physical disk error.
@janbrittenson210 Місяць тому
The drives recover soft errors on their own by retrying; if successful they don't return an error status for the operation, but do bump the soft error SMART counts on the drive. As far as DSM is concerned the operations succeeded, but it may have taken a huge number of retries. Some drives can be configured to record soft error counts above a certain threshold as hard errors, not sure if DSM has any way to do this (it usually requires drive-specific utilities). When I/O operations take long to complete processes that perform disk I/O (including paging something in or out) will block waiting, and this includes web server threads. I'd suggest checking the error counts on all your drives every six months of so (they're under the SMART status for each drive in DSM as I recall).
@EEVblog2 Місяць тому
The NAS runs a SMART check every week.
@Monkeh616 Місяць тому
@@EEVblog2 Which is about useless - not only is the information not standardised, but on top of the numerous bugs in drive firmware, they also flat out lie.
@erikdenhouter Місяць тому
Would love to see the SMART data read with the manufacturers tool, then you can also see if Windows can access it.
@edgarcornette6387 Місяць тому
Dave . I wonder if the Synology NAS might use drive 1 for some kind of cache for the dynamic menu system or maybe to speed up remote access.. they might not have set a time out correctly on that aswell.. i dont know how much ram those have ... does it have a terminal to the command line to check running apps ect. anyway yea i bet they use a small amount on drive 1 for something like that....
@tschuuuls486 Місяць тому
I think it uses disk 1 to boot from.
@Monkeh616 Місяць тому
No, drive 1 isn't treated any differently to any other drive. It just happened to be drive 1 this time.
@6581punk Місяць тому
I ditched these stupid proprietary boxes ages ago. Got a PC case with 24 drive bays, a couple of SAS controller cards and some extender cards which allow more drives on the cards. FreeNAS as the OS and a relatively modest 16GB RAM and an AMD ryzen. It also had a 10GB ethernet card in there, using copper not fibre.
@erfgzx4 Місяць тому
The HDD S.M.A.R.T status sometimes will not report the correct fault because the faulty parts are the HDD read/write head itself,not the disk .The system will stay waiting for the faulty read/write head to read the faulty disk.If the system doesn't have a wait time out count,it may just keep waiting forever, and the system frozen with no response.
@leaveempty5320 Місяць тому
Synology run loads of necessary processes, indexing, creating thumbnails etc which probably don't help drive life. They don't make it easy to disable these features, although you can with SSH and some command line stuff. Try a ps -ef.......
@towmantowman Місяць тому ⁺¹
I have a older ds1515+ and messing with it, drive 1 runs my os and configs as i dont have a built in memory for running the os from unfortunately. So if drive 1 dies on mine it needs a reset then i can pull my config back from the synology cloud backup config and its back to normal again. Does yours have a built in storage like a m.2 at all? I didnt look your model up but throwing this out there to try and help
@EEVblog2 Місяць тому ⁺¹
Nope, doesn't have that. Has external USB 3 only.
@jeroenlodder5838 Місяць тому
In a synology all drives run the os. This is not configurable.
@krz8888888 Місяць тому ⁺¹
Geez Dave did someone salt your coffee today 😅
@tlafeir Місяць тому
I had the same problem at a customer location. I replaced it after all the tricks failed. Seems to be an old synology thing.
@eidodk Місяць тому ⁺¹
The Synology partition is what holds the operating system, it usually installs on drive 1 which is your failed drive - if that fails then it can't run the operating system without failing. When you remove Drive 1, you are ALWAYS required to reinstall the Synology partition. It's a hidden partition on the raid. Removing Drive 1 removes the operating system. That is why it tries to install the operating system again.. All the data is still available, it just needs the operating system reinstalled.
@simontay4851 Місяць тому
Thats why on my NAS, the OS is on a 4GB disk on module. All the HDDs could fail and it would still start up fine.
@Monkeh616 Місяць тому ⁺¹
Even Synology aren't daft enough to have a single point of failure like that. The OS lives on a RAID1 across all drives.
@Quakes27 Місяць тому
Ok so your system reported “Bad Sectors”. This happens when the drive fails. The reason your NAS can boot and operate is because it’s mostly only reading data. It does NOT matter if it’s your computer, this NAS or any other system. You need to have your Synology scrub all contents regularly so it knows when bad sectors start to form. It only knows this when files are accessed. Scrubbing does this. In Synology it’s under Storage Manager -> Storage Pool -> Data Scrubbing. I personally have it set up for the first of the month. It reads all data front to back and when it runs unto bad sectors it will “Repair” the data somewhere else on the disk and flag the sector. You then will get a notification that bad sectors have “increased”, meaning the drive is on its way out and give you a heads up to begin replacing that drive, before issues like this occurs.
For more info I recommend a Tekzilla video for many years ago about ‘Are green drives killing your NAS’. Don’t worry about the title of the video. But they explain the scrubbing in detail.
@glynnetolar4423 Місяць тому
Western Digital Red SMR drive by chance?
@infango Місяць тому
wasn't there few months ago news about faulty PSUs in Synology nas killing hdd arrays ?
@theythero123 28 днів тому
Also if one drive fails and the rest are the same model with the same mileage it's appropriate to expect the others to start cr*pping out soon.
@Agent24Electronics Місяць тому ⁺¹
Sounds like the drive has a bunch of bad sectors it was trying to recover, one after the other. I've seen PCs running very slowly for this reason, the drives had hundreds of bad sectors in SMART but weren't completely dead. Maybe a dying head/preamp or something.
Don't know if you were using a NAS grade drive with TLER or not, this may have helped in such a situation, or maybe it won't if there's a failing head and it's just getting error after error.
@jaro6985 Місяць тому
Wonder if there is a way to send an alert after a certain number of disk errors. Though seems like such a rare issue and you might get false positives.
@KarlBaron Місяць тому ⁺²
Synology will send you notifications when there are disk errors (failed reads or increased bad blocks SMART count) via either email or push notifications, whatever you have set up
@QuickQuips Місяць тому
Dang. Maybe a five drive enclosure with raid 6/shr2 is in the future.
@AirzonesBlasters Місяць тому
Yeah that's "fun"... I recall having a similar thing happening on a HP-UX mainframe with 40 drives... So yeah, I got to spend some time playing whack-a-mole. Basically the drive mostly fails, but is marginally below the failure threshold.
@AirzonesBlasters Місяць тому
BTW... urhg you use True-NAS. And urgh... NAS isn't backup urgh!
@NiddNetworks Місяць тому
From memory, these "domestic" Synology boxes have a small partition on the first disk which holds the working OS, RAID layout and geometry etc. If that's giving errors and retries, it'd definitely slow performance!
Re the wrong IP address, does the unit have two ethernet interfaces?
For those who aren't aware - RAID is NOT backup. Don't think that having RAID is your backup. RAID is just to increase reliability.
Backups are important. Backups which are NOT on the same box. Depending on what you're holding, not in the same room / building / post code!! (Dave uses a cloud based backup, to keep the important stuff safe... Be like Dave!)
@BoraHorzaGobuchul Місяць тому
Syno os is spread across the disks iirc
@SeanBZA Місяць тому
Take that disk 1 and run a full SMART test on it, it likely will come up with the disk having had to reallocate sectors. Gsmartmon works.
@1kreature Місяць тому
Handling of disk errors is simply too poor on these systems. They have been like this all my time with em and I think it is not just the software on the NAS.
It seems many HD's never time out, as if it's their job alone to decide when to give up and the NAS controller just waits for drive to either complete or error out.
@XSpImmaLion Місяць тому
It's because a holiday is coming that all hell breaks loose right before it. xD
Weirdly enough, I had kinda the opposite problem with my Synology NAS several years ago.. I think it was back in 2016 or 2017? Something like that. Mine is an older 2-bay model. Opposite in the sense that my Synology went on the too cautious side and gave me lots of headaches for it.
I was coming back from end year holiday's visit to relatives, so it spent some good month without me checking on it.
At first, seemed like a one drive fail, so crappy but fairly normal, I set the thing up to operate in RAID 1 (mirror), just a matter of switching the defective drive.
But then I noticed BOTH drives had failed. Like, almost simultaneously, from the logs. I could still access the DSM panel, but the content of both drives were inaccessible.
Like what? That seemed like some BS to me. What are the odds?
And then from the logs I noticed this happened right after an OS update.
There's nothing really truly important on my NAS that I don't have several copies elsewhere, but I just didn't want to rebuild this whole thing from scratch, so I started exploring the thing. It's like, you always make a calculation whether it's easier to try to recover things somehow, or rebuild everything from scratch.
I can't really remember everything I did because... well, it's several years ago and it was all in kind of a panic, didn't even bother to document the whole process because it was a mess of Synology Forum visits, e-mail exchanges, checking out information all around, poking and probing with all sorts of recovery software, plus a bunch of stuff on decrypting hard drives and messing with SMART state. I even got a DIY bodge kit and some DIY instruction on how to reset the SMART status by shorting some crap on the HDDs and whatnot, which I ended up not needing.
My final conclusion on the whole ordeal was that after one Synology update and potentially a power failure, both of my drives decided to flag the SMART system, which took both of them down. I took both of them out and connected to my main desktop.
I think I ended up paying for a piece of software that could either decrypt or perhaps bypass something to read the drives, and then managed to recover all files after running out to buy a new one.
Here's the thing though - after I got all my files back and things settled... I set up both those supposedly failed drives to store some non-essencial stuff outside the NAS, more like an external drive setup. They are working to this day, no bad sectors crop up, no weird issues, nothing. So I dunno what the hell happened to flag both drives' SMART system, but they are still working like 8 years later.
@echelonrank3927 Місяць тому
these bad boys are notorious for wacky led indications. if a drive is more than 3 years old the led will go yellow etc.
@dedr4m Місяць тому
What brand drive have you used?
Over a decade and half ago, Seagate's Barracuda drives would lock up on the first bad sector encountered on some of their model of mechanical drives. One could even mark the first sector as bad and the whole drive became unserviceable beyond a Low Level Format over UART.
The last few years it's been Western Digital who's controllers in my experience would just go into a coma and lock up the host system in bizarre ways (Even my laptop's SSD at 97% usable life is having issues as though it's at 10%).
Seagate is currently far more reliable than Western Digital. Used to be the other way round.
@SaltCollecta Місяць тому
*sarcastic voice* Wow Dave, imagine using a NAS at all.
@ololh4xx Місяць тому
Dave, here is your answer from a software engineers' perspective : this software-assisted RAID protocol implementation probably is "incomplete" to some extent - RAID can recover from many failure states, but a few, key software steps in the overall failure-detection algorithm seem to have been unaccounted for. Basically, this specific failure scenario probably was not included in all of their test cases. The reason being : someone was lazy, inexperienced or completely overburdened with work. The last option being actually somewhat common - sometimes, entire, major products rest on the shoulders of one or two developers - and thats probably not a good thing. It should always be a whole team - and it should've been done in an "agile" way, even if non-software dev people seem to hate that term.
@Poxenium Місяць тому ⁺²
dave you doing it wrong
@byrd203 Місяць тому
setup email alerts so you get a email if this happens I have the full email alerts setup on all apps too
@hermannschaefer4777 Місяць тому
Every time Dave talks about computer... oh boy.. Anyway, it's quite normal that dead drives block a SATA/SCSI/IDE port or bus. Only way is to disconnect them and get a new one. "Better" RAID controllers like Areca or Infortrend indeed have a time out for drives and kick them out of the RAID set. That's one reason you should not use standard drives in RAID arrays, because standard drives tend to drop out quite often (because of a different firmware). Most NAS boxes just use a dumb SATA controller with no special logic for RAID arrays and do everything in software. Cheap, but you risk dead locks like this.

Наступне

Автоматичне відтворення

That's a LOT of hard drives! | Synology DS1821+ NAS review