We've been having spontaneous reboots on several machines in the last two months or so.
We've had an electrician double-check the power we're getting and all is well.
Looking in to potential computer-based issues I discover that many people experience this kind of thing with SSDs on Sabayon, Debian, Ubuntu, CentOS, and likely other distros. There do not seem to be any real solutions, but suggestions include the usual:
1) Get the latest firmware for the drive. Right now we have at least 2 different model of SSD in our machines (haven't checked Martin's yet): OCZ Vertex (96GB running f/w v1.6) and Vertex2 (115GB running f/w v1.29 or 1.33). Firmware and general info on OCZ drives can be found here. Firmware is here
2) Adjust fstab and /etc/rc.local like this
I'll come back to this next week.
<egg_on_face>
Figured it out. The problem was I stupidly changed the security update config to reboot after a security update gets done. It explains everything. The file has been edited, the package updated. Now I wait with my fingers crossed...
</egg_on_face>
SR has added something thatrequires curl, so add it to the build script.
Discussed timeline and sequence for this.
Main goals:
* A web cluster will be created (only mustard to begin with) that will handle all web apps (in the case of Cocoon apps, it will - as it does now - handle the port 80 end of things). Mustard will become mustard.hcmc.uvic.ca and will be built up with RHEL5-64bit and all current s/w required (Apache, PHP, MapServer, etc.). As far as h/w goes, RE figures the 3GB RAM we have in mustard and lettuce is sufficient.
* Chard DBs will be either retired or migrated to Cress. Chard will then be rebuilt with latest/greatest (RHEL5-64 and so forth) and redeployed (as chard.hcmc.uvic.ca). Once it's handling the DBs entirely on its own we need to decide how to use Cress (more on that in bit).
The plan is to have all HCMC machines behind the main UVic load balancers; all outside requests go through the load balancers.
Mustard and (eventually) Lettuce, will be a web cluster of two, and will be the application servers for all PHP/SQL apps, as well as handling all port 80 requests to our Cocoon apps.
All Tomcat/Cocoon requirements will be handled by Pear and fronted by Mustard/Lettuce/load balancers.
All SQL-type DB requirements will be handled by Chard. (We need to decide how best to use Cress in future. Choices include mothballing it, replacing Lettuce with it, or continuing to use it as a running back-up for Chard. Although the safety advantage of the latter is good, it seems like a bit of a waste of a machine - anyway, we have time to discuss this amongst ourselves.).
So, the sequence and timeline go something like this:
RE would like to begin work on Mustard by mid next week (as in August 4th). To do that he needs the go ahead from us; I'll be going through a final check today/tomorrow.
Once that's done (RE figures it shouldn't take more than about a week to get mustard back on its feet) Chard gets similar treatment - migrate necessary DBS, mothball the others, rebuild machine, redeploy, populate new DBs. Again, this is about a week's work. Ideally this will all be done by mid-August.
Subsequent work will be to do the same thing to Lettuce and Cress, but we don't have a timeline in mind for them yet.
What HCMC needs to do
Right now:
* Check that mustard is ready to be rebuilt.
*** Martin has already dealt with the most problematic aspects of this, but we should probably make sure that incoming requests to places like the ACH/ALLC 2005 proceedings (like mustard.tapor.uvic.ca/cocoon/ach_abstracts/xq/xhtml.xq?id=176) get handled as redirects to Pear. I notice that if you search for "The Edition Production Technology (EPT) and the ARCHway" you come up with some hard-coded links to mustard - I'm guessing that isn't *too* uncommon.
* as an aside, RE will be moving virtual hosts on mustard to lettuce and viola (there are still two VHs on mustard for ise and tapor)
* RE will give us a dump of DB names on Chard - I'll track down owners and let them know what's going to happen (RE locks the DB, dumps it, restores it on Cress, we test)
* once mustard is ready, RE will provide dev ipnames for our existing virtual hosts, and we can begin migrating our PHP/etc. apps from lettuce to the new mustard.
Mid-term:
* discuss what we're going to do about stuff like Ruby. Plans are (I believe) afoot to rebuild the apps relying on Ruby, but I don't know if there is a timeline yet. RE and I discussed the idea of using Fennel to host a wee VM running the oddball stuff that we plan to drop of over time (we could call it detritus.hcmc.uvic.ca).
* we also need to address the home1t issue. It's still big and unwieldy in an emergency, and we need some of the space for Fennel's OS storage. We have some time, but we really need to trim our use of home1t and make it shrinkable.
Turns out I lied. I tried to image the Win partition with NRH just now and had nothing but failure. I found a post on the forum that said to try it from the CLI, like this:
1. Get the list of disks, note the one with Windows
diskutil list
2. Check to see how much space we could save if we resized it
/usr/local/sbin/ntfsresize --info -f /dev/disk2s3
3. Clone to Windows partition to a disk image
/usr/local/sbin/ntfsclone --save-image -o windows.img /dev/disk2s3
First run through was a test. Results:
Rebuild:sbin admin$ sudo /usr/local/sbin/ntfsresize --info -f /dev/disk0s3
ntfsresize v1.13.0 (libntfs 9:0:0)
Device name : /dev/disk0s3
NTFS volume version: 3.1
Cluster size : 4096 bytes
Current volume size: 31868277248 bytes (31869 MB)
Current device size: 31868280832 bytes (31869 MB)
Checking filesystem consistency ...
100.00 percent completed
Accounting clusters ...
Space in use : 14566 MB (45.7%)
Collecting resizing constraints ...
You might resize at 14565183488 bytes or 14566 MB (freeing 17303 MB).
Please make a test run using both the -n and -s options before real resizing!
So I did a test run:
Rebuild:sbin admin$ sudo /usr/local/sbin/ntfsresize -s 14565183488 -n /dev/disk0s3
ntfsresize v1.13.0 (libntfs 9:0:0)
Device name : /dev/disk0s3
NTFS volume version: 3.1
Cluster size : 4096 bytes
Current volume size: 31868277248 bytes (31869 MB)
Current device size: 31868280832 bytes (31869 MB)
New volume size : 14565179904 bytes (14566 MB)
Checking filesystem consistency ...
100.00 percent completed
Accounting clusters ...
Space in use : 14566 MB (45.7%)
Collecting resizing constraints ...
Needed relocations : 902174 (3696 MB)
Schedule chkdsk for NTFS consistency check at Windows boot time ...
Resetting $LogFile ... (this might take a while)
Relocating needed data ...
ERROR: Extended record needed (5232 > 1024), not yet supported!
Please try to free less space.
I need to come back to this. In the meantime it's back to Winclone.
Per Martin's recommendation I suggest we purchase 2 Yamaha S659 DVD players for the labs. They will cost $199 each at A&B Sound. My justification for the recommendation is that DVD playback in the labs is too flaky, convoluted and unreliable. I have easily spent $400 worth of labour on trying to resolve this issue, to no avail.
UPDATE: these have been ordered (early September 2007)
The roles of Squash and Crossroads-B have changed. They are no longer running the Crossroads daemons for the labs, and the access permissions on a per-directory basis are not as robust as we'd like. The current OS (Server 2000) does not provide a granular enough method of restricting access to test materials, which means that we need to set permissions on test directories manually each time (and then turn them off later).
By installing Windows 2003 Server we can set up a permissions cascade that allows direct access to a specific file, but does not allow read access to any of the parent directorie(s). This should mean that we can set read perms on test directories for instructors only. They can then find the test file and send it, and the students can receive it, but not browse for it. The only wrinkle in this cunning plan is that we'll need a complete list of NetLinkIDs for all instructors. There may be a solution in LDAP, however. If I can set permission on test dirs based on the user's role, which is had by querying UVic's LDAP DB.
So the task is:
Rebuild Crossroads-B with Windows 2003 Server.
Rename it.
Put in a second drive for data.
RoboCopy all the teachingmaterials from Squash to the data drive.
Write a script to mirror the squash\teachingmaterials directory regularly.
Make it a file server.
Enter it in to the domain.
Put it in the server room, add it to the IP database, and give it a static IP.
Test accessibility via Sanako apps in Labs.
Research/Install/Configure LDAP connection to accommodate cunning plan above.
Test cunning plan.
Reading back through our blog postings, we discovered the previous occurrence of the same issue back in December. The problem then was caused by deployment of XEP, and apparently it was triggered by confusion of two different versions of saxon.jar (or possibly two copies of the same version in different places). Worked with Greg on a plan to deal with this:
Meanwhile, work on EMLS and ScanCan2 is shelved; EMLS needs large file support, and ScanCan2 needs XEP, so neither will work until this is fixed.
Task was:
Set up the project folders in the new user folders, upload data into eXist where necessary (ScanCan), test all the functionality, and report back/work with Greg to fix anything that's not working.
This is now complete. The only current bug is the character encoding problem which shows up in ScanCan, which will also be deadly for Moses. That must be fixed in the new year, as a priority.
:: Next Page >>
This blog is the location for all work involving software and hardware maintenance, updates, installs, etc., both routine and urgent, in the server room, the labs and the R&D rooms.
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| << < | > >> | |||||
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 | |