Log in

HCMC Journal

Web cluster issues 2025-08-09 to 2025-08-15

to : Martin Holmes
Minutes: 85

Over the weekend, I got repeated emails from UptimeRobot suggesting that our home1t-hosted sites were down for five or ten minutes at a time. After much puzzlement and back-and-forth with sysadmin and the help desk, it turned out that one of the web cluster servers, rangpur, had lost its connection to the CephFS filesystem. As a result, if you happened to get a session on that server, you would see 403 errors for all our sites; but if you got one of the other servers, everything would look normal. But it took a day and a half to figure this out. A simple restart of rangpur seemed to fix it, but we are a bit worried that there was nothing alerting sysadmin that this mount problem had occurred, and it seems to us that when similar problems have happened in the past, rangpur has also been the problem server, so we are hoping they can discover whether there’s something peculiar or different about that particular server and fix or replace it.