Author Topic: File Server Issues  (Read 992 times)

Conan

  • Postcount killed Trogdor
  • *****
  • Posts: 844
  • E-points: +44/-12
  • \(_o)/
    • View Profile
File Server Issues
« on: September 07, 2015, 12:35:10 am »
FA has been stuck in "read only mode" for much of the past 15 hours after "issues" cropped up that have to do with the file server.

In typical FA fashion, the whole situation has been handled in ridiculous ways.

Update: We had to reboot a server after it experienced a kernal panic (it was unresponsive via IPMI). Doing some final integrity tests.
Those "checks" took less than 10 minutes before they opened the site up again. Minutes later, they tweeted that thumbnails were generating slowly before the whole thing fell over again.

Note: Site thumbnails are generating slowly for new submissions. We're aware and are working on it.

As is usual FA fashion, there was a problem with thumbnails that cropped up a few days ago that was probably a sign that something was wrong. They didn't pay much attention to it and look where we ended up. The "problem" is apparently bad enough they were scrambling to relocate the data that lives on that server (which Piche said is "I/O", their only known file server.)

We've determined there's an IO/hardware issue with the system which hosts our thumbnails. We're looking into relocating the data currently.
We're currently running full scale RAID integrity checks to try to find the source of the IO errors. Getting the site up is our priority.

Also typical for FA was the attitude that there was nothing wrong, apparently so much so that Dragoneer brought Sciggles with and had her wait in the car, because surely there's nothing a reboot won't fix instantly.
Sciggles is stuck waiting in the car while we fix this.

Shit. Did I roll the windows down?

The site has come back a few times in the past 15 hours, but each time, it falls over once a bunch of people submit content at once.

This is the first time that I recall they had a severe issue with the file server itself. It's pretty clear right now that they don't have redundant file servers, either, which doesn't bode well if the problems get worse.

I'm sure IMVU is going to love this.

Also if you were interested in what their rack looks like today, some pictures were tweeted:
https://twitter.com/Dragoneer/status/640633455320150016 - One has to wonder what the storage server "Spike" is doing...
https://twitter.com/Dragoneer/status/640631400748699648 - Switch
https://twitter.com/Dragoneer/status/640630222266728448 - Novastorm is almost 10 years old and is still being used for something.

Conan

  • Postcount killed Trogdor
  • *****
  • Posts: 844
  • E-points: +44/-12
  • \(_o)/
    • View Profile
Re: File Server Issues
« Reply #1 on: September 07, 2015, 04:00:42 am »
Whelp, turns out the thumbnail drive array's file system corrupted and they lost all the thumbnails all the thumbnails and previews need to be regenerated.

Quote
The thumbnail storage drive array came down with a filesystem corruption and has been reformatted. Due to this the data server now has to regenerate all of the previews, which happens on first request. Thumbnails will be loading slower than usual for the duration, which can take a few days. No user data has been lost.
We will continue to monitor the site performance and availability.

Which really sounds like we were so, so close to the "doomsday scenario" that people have predicted where they would lose all the user files.

venthewolf

  • Posts: 23
  • E-points: +1/-2
  • I'm your browser
    • View Profile
    • My blog
Re: File Server Issues
« Reply #2 on: September 07, 2015, 11:01:11 am »