![]() |
Introduction to FCLA Aleph Reindexing Project Aleph v.19 OCLC Reclamation ILS Futures Discovery Tools Futures FCLA Website Redesign Aleph v18 Service Packs Aleph v18 Circulation Aleph v18 Acquisitions and Serials Aleph v18 Cataloging Mango/Endeca Metalib SFX Aleph DLU01 Library Reports and Statistics Data Warehouse and Reporting |
LMS News: 25 June 2008Hardware failure brings Aleph production downOn Friday morning around 8:43, we lost connection to about 40% of our disks. Unfortunately these were the disks used by production Aleph. We quickly determined that the problem was in the hardware but could not immediately isolate where. We have dual paths between all the servers and the disks, and usually a failure of one component only blocks one of the two paths. This time both paths were down in one part of the disk subsystem. After about 45 minutes of downtime with no resolution in sight, we activated the COOP Aleph in Tallahassee in read only mode. This meant that book charges had to be done on the PC backup system or manually, invoices could not be paid and catalogers could not add or edit records because all these types of transactions require updating Aleph. Two FCLA sys admins immediately joined three IBM technicians at the computer center to work on the problem. IBM soon upgraded the severity from level 1 to level 2 and by noon to level 3. Level 3 is the group that designed and built the disk system. By mid-afternoon the team re-established one path to the disks. However, they thought something in the non-operational data path might cause the working path to go down again so we did not return production Aleph back to the SULs. The on-site crew of 5 continued working until midnight and returned again at 8:00 on Saturday morning. Level 3 personnel stayed on into the night going over diagnostic data. Around 2:00 p.m. on Saturday, the team was able to restore the second data path, but the the disk management software could not find four disks (out of the 80 disks that had gone down). The final four disks were recovered around 11:30 p.m. Saturday, and Aleph became operational. FCLA librarians began testing on Sunday morning to see if they could find anything not working. By 1:30 everything they had tried worked correctly so we switched Aleph back to Gainesville from Tallahassee, put out a public email announcement and called every open circ department to see if they needed help getting the backlogged PC charges into Aleph. Over the weekend 20 FCLA staff were involved in the support process at some stage or other. Some of those staff were setting up plans to convert the Tallahassee Aleph from read only mode to update mode on Monday morning if the Gainesville problem was not solved by Sunday night. The problem turned out to be in IBM firmware on one of the disk boxes. We buy disks 16 at a time. The 16 disks come pre-installed in a box that fits into standard sized racks at the computer center. Each box has circuitry at its "two front doors" where the two cables attach giving the dual data paths into the box. Somewhere in the circuitry is some specialized software that is installed at the factory to control the flow of data to/from the disks. That software is called firmware because it used to be burned into chips that could only be erased with ultraviolet light - thus, making it "firm" on the chip. The firmware went haywire when a system administrator was allocating disk space to the digital archive on an entirely different box of disks on a different set of data paths. Why allocating space on a disk box not on the two data paths that went down is a mystery to us. We have moved the archive disks to another disk controller and installed firmware that should keep the problem from showing up again.
Last modified 19 August 2008 at 9:18 am by jeanp | |||||||||||||||||||||||||||||||||||||||||||||||||||