Sometime on the weekend of January 6-7, tomcat on mustard crashed. It appears that it was a typical crash induced by unclosed logging threads.
When sysadmin brought tomcat back up the automated monitoring system sent an http request to one/some cocoon sites, got a response and called it fixed. As far as I am aware no manual check of sites was done. The problem was that Cocoon was sending an error in response to the http request, not actual content (like an ISE page). So, all Cocoon-based sites were dead.
Michael Best discovered the problem and contacted sysadmin. A sysadmin attempted to work through the problem but was unable to solve it. Dr. Best emailed me about 3 or 3:30 on Sunday, and I got the message about 7 or 7:30 and worked through until about 11pm on various solutions.
When I looked at the ise site I saw a Cocoon error "Initialization problem" with a reference to TraxProcessor. I tried replacing several config files for Cocoon and restarting it, but the only thing that happened was a change in the error: instead of TraxProcessor, I got QuartzJobScheduler. I came to the conclusion that somehow a library was corrupted or was refusing to come up when Cocoon initialized. My solution was to overwrite the WEB-INF directory with a backup.
Recovering the backup from tape was not possible as there was a TSM tape problem that was not resolvable Sunday night. The directory was recoverd the next morning (Monday) from a version on a sysadmin's local machine. Cocoon initialized and sites were restored.
How do we stop this from happening again? I don't know, as I'm not certain I knwo what really happened. I now have at least two versions of the entire tomcat stack (lettuce and mustard) on my local machine, and a backup on the SAN. Recovery should only take a moment if this DOES happen again.