..::Eduardo Fraga's HP-UX blog: Crash::..

sexta-feira, 6 de janeiro de 2012

There are essentially three types of system crashes:

High Priority Machine Check (HPMC): This is normally the result of a piece of hardware causing a Group 1 interrupt, an HPMC. A Group 1 interrupt is the highest priority interrupt the system can generate. Such an interrupt signifies THE MOST serious event has just occurred. The interrupt will be handled by a processor and passed to the operating system for it to process further. When the operating system receives an HPMC, the only thing it can do is to cause the system to crash. This will produce a system crashdump. As an example, a double-bit memory error will cause an HPMC. Many other hardware-related events will cause an HPMC. There is a small chance that an HPMC could be caused by a software error, but the vast majority of HPMCs are caused by hardware problems.
There is also a Low Priority Machine Check (LPMC). An LPMC does not necessarily cause the system to crash. An LPMC may be related to a hardware error that is recoverable, e.g., a single-bit memory error.
Transfer of Control (TOC): If a system hangs, i.e., you can't get any response from a ping, from the system console, the system has frozen, and you may decide to initiate a TOC from the system console by using the TC command from the Command Menu (pressing ctrl-b on the console or via the GSP). If you are using Serviceguard, the cmcld daemon may cause the system to TOC in the event of a cluster reformation. All of these situations are normally associated with some form of software problem (the Serviceguard issue may be related to a hardware problem in our networking, but it was software that initiated the TOC).
PANIC: A PANIC occurs when the kernel detects a situation that makes no logical sense, e.g., kernel data structures becoming corrupted or logical corruption in a software subsystem such as a filesystem trying to delete a file twice (freeing free frag). In such situations, the kernel decides that the safest thing to do is to cause the system to crash. A PANIC is normally associated with a software problem, although it could be an underlying hardware problem (the filesystem problem mentioned above may have been caused by a faulty disk).

In summary, an HPMC is probably a hardware problem, and a TOC or PANIC is probably some form of software problem.

domingo, 21 de agosto de 2011

Nobody is free from an unexpected crash, I'll leave some tips to help in this critical situation.

Some important logs:

System log after crash:

/var/adm/syslog/syslog.log

System log before crash:

/var/adm/syslog/OLDsyslog.log

Event log - some hardware problem?

/var/opt/resmon/log/event.log

You can do a MP dump to check others hardware logs.

Look for some "panic", this file hold some information about shutdow (who? when?)

/etc/shutdownlog

If the /var/tombstone/ dir exist - This is normally the result of a piece of hardware causing a Group 1 interrupt, an HPMC.

Check the software:

# swlist -l product

# swlist -l bundle

Default crash place:

/var/adm/crash

If the crash wasn't created automatic you can try use the "savecrash" command.

where's the crash?

If you can't find the crash in the default place, you can confirm the path in the bellow file:

/etc/rc.config.d/savecrash

CRASHINFO - Crash analisy
It can be download by hp software site, it's free, always use the last version.

Crashinfo

After download..

Sent to server

It's necessary change the permision, adding execution to crashinfo.bin file (not necessarily 777).

# chmod 777 crashinfo.bin

getting the reports for analisy:

[Disk space ] It's recommended use the same memory size for crash zone, the system will send some warning to syslog when the size of /var is less than 500mb.

You can check a lot of important things in the bellow output:

#./crashinfo.bin -c > crash_c.out

#./crashinfo.bin -v > crash_v.out

---------------------------------------
pt/br

Ninguém esta livre de um crash inesperado, nesta postagens dou algumas dicas para descobrir a origem de um crash, existem outros metodos, assinalo os que julgo importante.

Logs importantes quando acontece crash:

Log do sistema posterior ao boot:
/var/adm/syslog/syslog.log

Log do sistema anterior ao boot:
/var/adm/syslog/OLDsyslog.log

Event log - Hardware com problema?
/var/opt/resmon/log/event.log

É interessante fazer um dump dos logs da MP também para poder isolar qualquer problema.

Se o diretório /var/tombstone/ existir - Isto normalmente resulta de uma falhar de hardware, um HPMC.

Importar para constatar o crash:
/etc/shutdownlog

Verificar pacotes:
# swlist -l product
# swlist -l bundle

Local padrão do arquivo de crash:
/var/adm/crash

where's the crash?
Neste arquivo pode ser definido a localização dos arquivos de crash
/etc/rc.config.d/savecrash

Analisando o crash:Pode ser baixado através do site de "software" da hp, é free, procure sempre pegar a última versão.

Utilizando o crashinfo
Após baixar:
O envie para o servidor
Altere as permissões a modo que você consiga o executar, não necessariamente 777.
# chmod 777 crashinfo.bin
Obtendo os relatórios para análise:
Muito importante manter a área assinalada para receber o crash que tenha no minimo 1gb, quando esta área chegar a 500mb voce receberá mensagens no syslog da maquina indicando pouco espaço.
Através dos relatórios obtidos nos próximos passos voce podera analisar a origem do crash, quantidade memória livre no momento do crash entre outros dados úteis.
#./crashinfo.bin -c > crash_c.out
#./crashinfo.bin -v > crash_v.out

Eduardo Fraga's HP-UX blog

system crashes types

CRASH

Archives

Useful links

Categories

Access counter