savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on co


From: Sylvain Beucler
Subject: [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone
Date: Thu, 19 Nov 2009 19:47:17 +0100
User-agent: Mutt/1.5.20 (2009-06-14)

On Thu, Nov 12, 2009 at 12:33:17PM +0100, Sylvain Beucler wrote:
> On Sat, Oct 31, 2009 at 11:13:51AM +0100, Sylvain Beucler wrote:
> > > On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote:
> > > > Ah I see, I was waiting for comments on this - should be able to go out 
> > > > this weekend to do 
> > > > replacements / reshuffles / etc, but I need to know if savannah-hackers 
> > > > has a strong 
> > > > opinion on how to proceed:
> > > > 
> > > > (1) Do we keep the 1TB disks?
> > > > > - Now that the cause of the failure is known to be a software failure,
> > > > > do we forget about this, or still pursue the plan to remove 1.0TB
> > > > > disks that are used nowhere else at the FSF?
> > > > 
> > > > That was mostly a "this makes no sense, but that's the only thing 
> > > > that's different about 
> > > > that system" type of response; it is true they are not used elsewhere, 
> > > > but if they are 
> > > > actually working fine I am fine with doing whatever savannah-hackers 
> > > > wants to do.
> > > > 
> > > > (2) Do we keep the 2 eSATA drives connected?
> > > > > - If not, do you recommend moving everything (but '/') on the 1.5TB
> > > > > disks?
> > > > 
> > > > Again if they are working fine it's your call; however the bigger issue 
> > > > is if you want to 
> > > > keep the 2 eSATA / external drives connected, since that is a 
> > > > legitimate extra point of 
> > > > failure, and there are some cases where errors in the external 
> > > > enclosure can bring a system 
> > > > down (although it's been up and running fine for several months now).
> > > > 
> > > > (3) Do we make the switch to UUIDs now?
> > > > > - About UUIDs, everything in fstab in using mdX, which I'd rather not
> > > > > mess with.
> > > > 
> > > > IMHO it would be better to mess with this when the system is less 
> > > > critical; not using UUIDs 
> > > > everywhere tends to screw you during recovery from hardware failures.
> > > > 
> > > > And BTW totally off-topic, but eth1 on colonialone is now connected via 
> > > > crossover ethernet 
> > > > cable to eth1 on savannah (and colonialone is no longer on fsf 10. 
> > > > management network, 
> > > > which I believe we confirmed no one cared about)
> > > > 
> > > > (4) We need to change to some technique that will give us RAID1 
> > > > redundancy even if one 
> > > > drives dies. I think the safest solution would be to not use eSATA, and 
> > > > use 4 1.5TB drives 
> > > > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives 
> > > > would need to fail to 
> > > > bring savannah down. Other option would be 2 triple RAID1s using eSATA, 
> > > > each with 2 disks 
> > > > inside the computer and the 3rd disks in the external enclosure.
> > 
> > On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote:
> > > Hi,
> > > 
> > > As far as the hardware is concerned, I think it is best that we do
> > > what the FSF sysadmins think is best.
> > > 
> > > We don't have access to the computer, don't really know anything about
> > > what it's made of, don't understand the eSATA/internal
> > > differences. We're even using Xen as you do, to ease this kind of
> > > interaction. In short, you're more often than not in better position
> > > to judge the hardware issues.
> > > 
> > > 
> > > So:
> > > 
> > > If you think it's safer to use 4x1.5TB RAID-1, then let's do that.
> > > 
> > > Only, we need to discuss how to migrate the current data, since
> > > colonialone is already in production.
> > > 
> > > In particular, fixing the DNS issues I reported would help if
> > > temporary relocation is needed.
> > 
> > 
> > I see that there are currently 4x 1.5TB disks.
> > 
> > 
> > sda 1TB   inside
> > sdb 1TB   inside
> > sdc 1.5TB inside?
> > sdd 1.5TB inside?
> > sde 1.5TB external/eSATA?
> > sdf 1.5TB external/eSATA?
> > 
> > 
> > Here's what I started doing:
> > 
> > - recreate 4 partitions on sdc and sde (but 2 of them in an extended
> >   partition)
> > 
> > - added sdc and sdd to the current RAID-1 arrays
> > 
> >   mdadm /dev/md0 --add /dev/sdc1
> >   mdadm /dev/md0 --add /dev/sdd1
> >   mdadm /dev/md1 --add /dev/sdc2
> >   mdadm /dev/md1 --add /dev/sdd2
> >   mdadm /dev/md2 --add /dev/sdc5
> >   mdadm /dev/md2 --add /dev/sdd5
> >   mdadm /dev/md3 --add /dev/sdc6
> >   mdadm /dev/md3 --add /dev/sdd6
> >   mdadm /dev/md0 --grow -n 4
> >   mdadm /dev/md1 --grow -n 4
> >   mdadm /dev/md2 --grow -n 4
> >   mdadm /dev/md3 --grow -n 4
> > 
> > colonialone:~# cat /proc/mdstat 
> > Personalities : [raid1] 
> > md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0]
> >       955128384 blocks [4/2] [UU__]
> >       [>....................]  recovery =  0.0% (43520/955128384) 
> > finish=730.1min speed=21760K/sec
> >       
> > md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0]
> >       19534976 blocks [4/4] [UUUU]
> >       
> > md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1]
> >       2000000 blocks [4/4] [UUUU]
> >       
> > md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1]
> >       96256 blocks [4/4] [UUUU]
> > 
> > - install GRUB on sdc and sdd
> > 
> > 
> > With this setup, the data is both on the 1TB and the 1.5TB disks.
> > 
> > If you confirm that this is OK, we can:
> > 
> > * extend this to sde and sdf,
> > 
> > * unplug sda+sdb and plug all the 1.5TB disks internaly
> > 
> > * reboot while you are at the colo, and ensure that there's no device
> >   renaming mess
> > 
> > * add the #7 partitions in sdc/d/e/f as a new RAID device / LVM
> >   Physical Volume and get the remaining 500GB
> > 
> > 
> > Can you let me know if this sounds reasonable?
> 
> up!

Seriously, can you answer if it's OK to move the RAID 1tb->1.5tb and
plan a disk re-plug soon?

-- 
Sylvain





reply via email to

[Prev in Thread] Current Thread [Next in Thread]