I’m a heavy user of iSCSI and frequently advocate its use. For me its been reliable and has saved money. However Steve and I have over the last year worked with to clients where the use of iSCSI has caused an interesting corruption situation.
The first was a database corruption repair engagement where the database was actually corrupt but as we worked to repair the database we kept noticing new corruption in places where it hadn’t been. Quickly the engagement changed to helping the customer find the cause. We narrowed down on the storage. In this case the client has Dell Equallogic, Dell Switches, Dell Servers and VMware. It turned out that there was both a bad twinax cable AND really old firmware on the iSCSI switch. The combination of these was causing the problem. As soon as the firmware as updated and the cable replaced the issue stopped getting worse and we were able to fix the corruption.
The second instance was with Hyper V, Dell Equallogic and Dell Switches. The customer reported corruption in their database. In this case the database was significantly smaller and it only took a few minutes to run DBCC CheckDB instead of hours. We noticed that each time we ran CheckDB the database was corrupt but in a different way. Migrating the VM to a different server fixed the problem and no corruption repair was needed because the data on disk wasn’t actually corrupt so once iSCSI was fixed CheckDB came back clean.
However I noticed the switch firmware was also old in this case as well. A third client of ours has Dell Equallogic storage and the same SAN switches as customer two but has never had a problem. However they keep their firmware up to date and I know from past experience that specific Dell switch wasn’t a picture of stability in its early firmware versions.
So what did we learn. First regular DBCC Check DB’s are important, Second regular firmware and software updates are needed, Finally regular backups no matter how redundant or resilient a system are required.