Oldskooler Ramblings

the unlikely child born of the home computer wars

You cannot violate the laws of physics

Posted by Trixter on May 4, 2018

It’s technology refresh time at casa del Trixter.  I’m dabbling in 4K videography, and upgrading my 9-year-old i7-980X system to an i7-8700K to keep up.  Another activity to support this is  upgrading the drives in my home-built ZFS-based NAS, where I back up my data before it is additionally backed up to cloud storage.  The NAS’ 4x2TB drives were replaced with 2x8TB and 2x3TB (cost reasons) in a RAID-10 config, and it mostly went well until I started to see disconnection errors during periods of heavy activity (ie. a zpool scrub):

Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc] Device not ready
Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc] 
Apr 30 19:32:07 FORTKNOX kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc] 
Apr 30 19:32:07 FORTKNOX kernel: Sense Key : Not Ready [current] 
Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc] 
Apr 30 19:32:07 FORTKNOX kernel: Add. Sense: Logical unit not ready, cause not reportable
Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc] CDB: 
Apr 30 19:32:07 FORTKNOX kernel: Read(16): 88 00 00 00 00 00 08 32 11 70 00 00 01 00 00 00
Apr 30 19:32:07 FORTKNOX kernel: end_request: I/O error, dev sdc, sector 137498992
Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc] Device not ready
Apr 30 19:32:07 FORTKNOX kernel: sd 0:0:2:0: [sdc]

At first I thought the drive was bad, so I replaced it.  I then saw exactly the same types of errors on the replacement drive, so to make sure I wasn’t sent a bad replacement, I tested the drive in another system and it passed with flying colors.  So now the troubleshooting began:  Switch SATA ports on the motherboard:  No change.  Switch SATA cables: No change.  Switch SATA power cables: No change.  Switch SATA cables and ports with one of the drives that was working:  No change; that specific drive kept reporting “Device not ready”.  I even moved the drive to a different bay to see if the case was crimping the cables to the drive when I put the lid back on:  No change.

It was really starting to confuse me as to why this drive wouldn’t work installed as the 4th drive in my NAS.  I started to doubt the aging Xeon NAS motherboard, so I bought a SAS controller and a SAS-to-SATA forward breakout cable so that the card could handle all of the traffic.  This seemed to work at first, but eventually the errors came back.  I then started swapping SATA breakout ports, then entire SAS cables, then eventually a replacement SAS controller.  In all instances, the errors eventually came back on just that single drive, a drive that worked perfectly in any other system!

The solution didn’t present itself until I started building my replacement desktop system based on the i7-8700k.  In that system, I opted for a modular power supply to keep the cable mess at a minimum (highly recommended; I’ll never go back to non-modular PSUs).  When I was putting my video editing RAID5 drives into the new desktop, I noticed with irritation that each of the modular SATA power cables only had three headers on them instead of four.  This sucked because I was hoping to use one SATA power breakout cable for all four drives, and now I’d have to use two cables which added to the cable clutter inside the case.  This power supply was Gold rated, high wattage — why only put three SATA power headers on a breakout cable?  In thinking about the problem, I came to the conclusion that the makers of the power supply were likely being conservative, to avoid exceeding the limits of what that rail was designed to provide.

And that’s when I remembered that I was putting four drives on a single rail back on the NAS, and not three like the new power supply was enforcing.  When I moved the misbehaving NAS drive to a SATA power header on another rail, all of the drive disconnection problems went away.  Whoops.

How did this work before?  The power draw of 2x8TB + 2x3TB drives was just high enough to be dodgy, when the previous configuration of older 4x2TB drives was not.  The newer drives draw more power than the older drives did.

Lesson learned, and now I have spare controllers and cables in case there’s a real failure.

7 Responses to “You cannot violate the laws of physics”

  1. Hah, yeah. Power Supply problems are sometimes unintuitive to diagnose because they tend to be so… un-digital.

    I learned my lesson as a teenager, when I gradually upgraded my 50MHz 486 to an IBM-branded Cyrix 6×86 (with several steps in between). At some point during those upgrades, my setup became unstable, the most noticeable symptom being gcc segfaulting or signaling internal errors when building large projects (kernel…) with alarming regularity (but of course no consistency).

    Having no own income yet, I had to put up with whatever was in front of me (I could tell a lot of other stories about that), so I tried just about everything to deal with the problem, swapping cables and toggling almost every BIOS setup option a million times.

    I don’t recall at all what it took to finally diagnose the power supply as the problem (though it was likely the fact that the power supply was one of the last remaining components), but I do remember the epiphany when I realized that my meager 150-200W power supply was the cause. It had been perfectly adequate for the 486DX50 PC it was originally a part of, but could not deliver enough power to my newer components.

    In hindsight it was also somewhat obvious, but at the time it didn’t cross my mind yet that a “perfectly good and working” power supply could have failure modes between “works” and “doesn’t turn on”.

    • Trixter said

      We can thank enthusiast PC gaming builders for better components in desktops.

      I had a power issue with my 486 the first time I was building it — the mainboard power connectors were not notched and I connected the PSU the wrong way. As soon as I turned it on, my lights went out. Turns out I had tripped the circuit breaker for the entire block of apartments I was in!

      • Oh god, those “P8” and “P9” labeled AT power connects were the worst. Nice to know they apparently they were bad enough to cause power loss for an entire apartment block at least once. 8)

        I still recall the mnemonic of having “all black in the middle” (if anyone’s curious, google for “P8 P9 mainboard power” and it becomes apparent how that applies). The further you go back the less coding there seemed to be. A few years earlier everything was somehow DB9, DB25 or DIN.

  2. Brolin Empey said

    Interesting post but I suspect you meant TB instead of GB for the storage capacities of the modern hard disc drives (HDDs). Also, I gather that a power supply can have a minimum load in addition to a maximum load. This probably does not matter in your case but it is another thing to consider when troubleshooting.

    Which OS are you using for the (NAS) computer using ZFS? Some form/variant of (Open)Solaris? ZFS sounds interesting but I have to use GNU+Linux instead of other *nix OSes because I use some Linux-specific stuff, such as Hamachi (to provide a Virtual LAN for my company). Yes, technically Hamachi is available for Mac OS X too but I do not drink the Apple Kool-Aid. No, I will not use the anachronistic neologism macOS instead of the original name of Mac OS X because Mac OS is the name of the original Mac OS (for m68k and PowerPC, not x86), not the Apple rebranding of NextStep originally named Mac OS X.

    • Trixter said

      I used to run Solaris, then OpenSolaris, then OpenIndiana. I now run Debian Linux with ZFS-On-Linux which is very stable, but my next os will be CentOS to match the activity I’m performing at my day job.

  3. Rich Shealer said

    This reminds me of a Novell Network I installed once. I kept adding new workstations to a new network. When I got to IBM PS/2 computer number X, I don’t remember how many, maybe the high teens low twenties, the computer wouldn’t connect. I tried another Token Ring patch cable, no go. I took a patch cable from working computer number X-1 and put it on computer number X, it worked fine. Tried the two “bad” cables on computer number X-1 they didn’t work. In fact none of the remaining cables worked.

    I tried various combinations removing cables from other stations they all worked but these last few did not. At least that what it seemed like.

    Long story short, the cables here incorrectly made. The cable was twisted pair, but the pairs were not used properly. The Token RIng adapters were able to make it work for quantity X-1 computers, but computer connection number X was the straw that broke the camel’s back and failed out of the system. Since no one was using the network yet, I didn’t realize that speeds were affected as the number of incorrect connections where added nor if other computers dropped out during this exercise.

    • Trixter said

      I think these are the kinds of things that old electrical engineers always keep in the back of their heads, and work around early. I was mostly “born digital”, so I tend to think in bits only.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: