I seem to run into this more often than I would like. The Dell MD3000i (a rebadged IBM which is also available as a generic LSI product) is a horrible SAN. The worst part is that you can have terrible performance with no way to know why unless you run these cli commands to figure out that the write cache is disabled. There’s no SNMP and no other reliable way to monitor the damn things, which is probably because they want you to buy their Equilogic line.
Anyway, I had a battery die on a controller again and write cache was disabled. Replacing the battery on that controller did not fix it — it was still saying the write cache was suspended. After a LOT of poking it finally came back, but I don’t know exactly how or why. I currently have a theory though that I hope helps someone out some day.
1) Known issue that MD3000i units ship with writeCacheEnabled=true and mirrorCacheEnabled=true, which puts the write cache into a suspended mode because it can’t mirror with only one controller. (They come with one controller unless you order two)
2) Fixing the above problem requires disabling both, then re-enabling only the write cache.
3) Trying the above did not work. Mirror disabled, but still suspended. Somehow with mirror disabled and forcing it to ignore the battery (cacheWithoutBatteriesEnabled=true) would get write cache working again. That’s not a good or safe solution, though.
4) Disabling all write cache settings, doing an offline/online of each controller, and then enabling all the write caches and fixing paths seemed to work.
I think it refuses to mirror the cache which keeps it suspended because the controllers still have some weird cache setting that doesn’t get cleared until you offline/online both controllers to “reset” them. It’s currently the best theory I have, but doesn’t explain why I could force it to work by setting cacheWithoutBatteriesEnabled=true.
tl;dr this SAN is a big box of shit.