Friday, May 2, 2008

Recent WiscMail Outage

The CIO/Vice Provost for Information Technology sent a detailed message to the Information Technology Committee about Wednesday's WiscMail outage. We will, I'm sure, be getting another update at our next ITC meeting.

Just to quick translate a few things in the message. First, a "memory leak" basically means that the longer the program runs, the slower it becomes. They are bugs and are occasionally encountered in new versions of software, and in general the right thing to do is to go back to the old version. Second, to understand "rebuild the message storage index database", just focus on "rebuild index." If you rip the index out of the back of a book, to replace it you have to start at the beginning of the book and re-read the whole thing. Without the index, the mail server can't find someone's email messages without searching through all of the messages. You can imagine that building an index for 60,000 inboxes takes a little bit of time.

All

Now that WiscMail service is restored, I want to give you an update on what happened.

First, let me apologize for the inconveniences this caused you. I am certain that we all count on e-mail for much of what we do every day and having these services unavailable is a significant disruption for us.

Tuesday morning, the WiscMail team installed an update to the Sun software we use to process mail. It was believed to be a fairly modest upgrade. By noon Tuesday, a few of our mail servers were experiencing a "memory leak," which degraded from poor to no performance rather quickly. The mail team worked with the vendor to address the problems and decided to roll back the upgrade to the previous version of software. The team worked around the clock to restore service for our
users. Wednesday morning we experienced performance problems that resulted from the rollback, which required the mail team to rebuild the message store index databases. Full IMAP, POP, and webmail email delivery services were restored early afternoon Wednesday.

It is extremely rare that our WiscMail service experiences an outage. I want to assure you that steps are being taken to make sure every future upgrade does not bring this effect. The mail team is working with Sun Microsystems to fully diagnose the error(s), fix the upgrade and reapply it when we have full assurance it will work under current loads. We will also hold a post-incident review to learn from this event and adjust our strategies as appropriate.

Thank you for your patience and trust.


--
Ron Kraemer
Vice Provost for Information Technology
Chief Information Officer (CIO)
University of Wisconsin-Madison
ron.kraemer@cio.wisc.edu


I'm sure that this will bring up some discussions about the future of WiscMail as well. Here are some things to think about.
  • First, WiscMail is not designed to have "no downtime." To truly take it to the next level, we would need to have multiple active data centers. At present, there is an emergency plan, and all data continuously stored at multiple locations. If something catastrophic were to happen to the Computer Sciences building, not a single email would be lost. However, if the Computer Sciences building can't recover, it would take a few days to totally restore email service from the backup site. It would be VERY expensive to both fully equip the additional data centers and more importantly, modify our email system (and the additional software that is necessary to run the email system) to automatically fail over and have just a few minutes at most of downtime. I don't know if there's something to be found in the middle of these two extremes, where we could restore email within say a few hours of a catastrophe at Computer Sciences. Clearly, there was no catastrophe at Computer Sciences Wednesday, so there is a lot of work to do with the current system so it can recover quickly even with one active data center.
  • Second, switching to GMail or some other outsourced email provider is not a "no-brainer" in terms of cost. GMail would reduce some costs, and will increase others. We don't know entirely know how those balance out. In addition, there are a whole host of new legal issues related to data ownership, privacy, and state records requirements that would need to be identified as part of any move. GMail does have multiple data centers, but they too have had the occasional multi-hour outage and are much less forthcoming about those outages.
The big questions will be How much does Email cost?, and How much is Email worth to us? I don't know how much of this we'll get to discuss at the last ITC meeting of the year, but these issues will certainly be on people's minds, and hopefully we'll be able to have a good discussion on May 16th.

No comments:

The Associated Students of Madison Shared Governance Committee Blog serves as a space for shared governance appointees and the UW-Madison student body to communicate on issues relating to shared governance. As part of their responsibilities as student representatives, appointees will post a report following each meeting attended.