54 lines
2.5 KiB
Plaintext
54 lines
2.5 KiB
Plaintext
[09] WHAT SHOULD I DO IF A SYSTEM CRASHES OR LOCKS UP?
|
|
|
|
Hopefully this will not happen at all to you, but if you experience
|
|
'lock ups' or 'freezes', please follow these steps to help prevent
|
|
your own data loss.
|
|
|
|
Also, it is important to note that you do not have a direct connection
|
|
to SDF and are mostly likely hopping through 10 or more networks to
|
|
get to SDF. You can use ping and traceroute to measure lag between
|
|
your computer and SDF. So, your experience of lag on SDF is subjective
|
|
and it is very important for you to understand that.
|
|
|
|
Typically a lockup will occur when you are trying to access a
|
|
file that is resident on the fileserver. For instance, say you
|
|
are trying to cat a file and instead of seeing the contents you
|
|
get either nothing or a message similar to:
|
|
|
|
ol1:/sys: not responding
|
|
|
|
Be patient, the fileserver will recover shortly and your task
|
|
will be completed .. you will probably see:
|
|
|
|
ol1:/sys: is alive again
|
|
|
|
which means your request will actually begin to be processed.
|
|
|
|
During the hang time, you can use ^T (CTRL T) to display the
|
|
status of your job .. for instance:
|
|
|
|
load: 2.04 cmd: tail 12966 [select] 0.00u 0.00s 0% 808k
|
|
|
|
[select] is the current state of the process id 12966 which
|
|
is the 'tail' program. If the system is waiting on actual
|
|
disk I/O, you'll probably see [biowait]. In cases of a hang
|
|
you may see either [nfsrcvlk] (Network File System Received Lock)
|
|
or [vnlock] (Virtual Node Lock) which the system will usually
|
|
recover from, but can be telling of a serious resource problem
|
|
on the NFS client should this state be prolonged.
|
|
|
|
In the event that the fileserver becomes unavailable, it is
|
|
important that you do not become impatient and interrupt, quit
|
|
or suspend your jobs (^C, ^\ or ^Z) but rather, wait them out.
|
|
If you are patient your chances of losing data will be
|
|
significantly reduced. Usually the fileserver will respond
|
|
within a few seconds, but usually no longer. In the case when
|
|
it is the NFS client's problem (vnlock for more than say 20
|
|
seconds) that particular host will most likely need to be reset.
|
|
|
|
More on this. SDF is pushing NetBSD to its limits and we are
|
|
currently (2003-2004) doing quite a bit of investigation with
|
|
the uvm/vfs/vnode code developers to help NetBSD become scalable
|
|
in high usage situations such as the loads we experience on SDF.
|
|
Solutions we find will be incorporated into the public code.
|