- High Priority Machine Check (HPMC): This is normally the result of a piece of hardware causing a Group 1 interrupt, an HPMC. A Group 1 interrupt is the highest priority interrupt the system can generate. Such an interrupt signifies THE MOST serious event has just occurred. The interrupt will be handled by a processor and passed to the operating system for it to process further. When the operating system receives an HPMC, the only thing it can do is to cause the system to crash. This will produce a system crashdump. As an example, a double-bit memory error will cause an HPMC. Many other hardware-related events will cause an HPMC. There is a small chance that an HPMC could be caused by a software error, but the vast majority of HPMCs are caused by hardware problems.There is also a Low Priority Machine Check (LPMC). An LPMC does not necessarily cause the system to crash. An LPMC may be related to a hardware error that is recoverable, e.g., a single-bit memory error.
- Transfer of Control (TOC): If a system hangs, i.e., you can't get any response from a ping, from the system console, the system has frozen, and you may decide to initiate a TOC from the system console by using the TC command from the Command Menu (pressing ctrl-b on the console or via the GSP). If you are using Serviceguard, the cmcld daemon may cause the system to TOC in the event of a cluster reformation. All of these situations are normally associated with some form of software problem (the Serviceguard issue may be related to a hardware problem in our networking, but it was software that initiated the TOC).
- PANIC: A PANIC occurs when the kernel detects a situation that makes no logical sense, e.g., kernel data structures becoming corrupted or logical corruption in a software subsystem such as a filesystem trying to delete a file twice (freeing free frag). In such situations, the kernel decides that the safest thing to do is to cause the system to crash. A PANIC is normally associated with a software problem, although it could be an underlying hardware problem (the filesystem problem mentioned above may have been caused by a faulty disk).
system crashes types
Labels: Crash
CRASH
Nobody is free from an unexpected crash, I'll leave some tips to help in this critical situation.
Some important logs:
System log after crash:
/var/adm/syslog/syslog.log
System log before crash:
/var/adm/syslog/OLDsyslog.log
Event log - some hardware problem?
/var/opt/resmon/log/event.log
You can do a MP dump to check others hardware logs.
Look for some "panic", this file hold some information about shutdow (who? when?)
/etc/shutdownlog
If the /var/tombstone/ dir exist - This is normally the result of a piece of hardware causing a Group 1 interrupt, an HPMC.
Check the software:
# swlist -l product
# swlist -l bundle
Default crash place:
/var/adm/crash
If the crash wasn't created automatic you can try use the "savecrash" command.
where's the crash?
If you can't find the crash in the default place, you can confirm the path in the bellow file:
/etc/rc.config.d/savecrash
CRASHINFO - Crash analisy
It can be download by hp software site, it's free, always use the last version.
Crashinfo
After download..
Sent to server
It's necessary change the permision, adding execution to crashinfo.bin file (not necessarily 777).
# chmod 777 crashinfo.bin
getting the reports for analisy:
[Disk space ] It's recommended use the same memory size for crash zone, the system will send some warning to syslog when the size of /var is less than 500mb.
You can check a lot of important things in the bellow output:
#./crashinfo.bin -c > crash_c.out
#./crashinfo.bin -v > crash_v.out
---------------------------------------
pt/br
Ninguém esta livre de um crash inesperado, nesta postagens dou algumas dicas para descobrir a origem de um crash, existem outros metodos, assinalo os que julgo importante.
Logs importantes quando acontece crash:
Log do sistema posterior ao boot:
/var/adm/syslog/syslog.log
Log do sistema anterior ao boot:
/var/adm/syslog/OLDsyslog.log
Event log - Hardware com problema?
/var/opt/resmon/log/event.log
É interessante fazer um dump dos logs da MP também para poder isolar qualquer problema.
Se o diretório /var/tombstone/ existir - Isto normalmente resulta de uma falhar de hardware, um HPMC.
Importar para constatar o crash:
/etc/shutdownlog
Verificar pacotes:
# swlist -l product
# swlist -l bundle
Local padrão do arquivo de crash:
/var/adm/crash
where's the crash?
Neste arquivo pode ser definido a localização dos arquivos de crash
/etc/rc.config.d/savecrash
Analisando o crash:Pode ser baixado através do site de "software" da hp, é free, procure sempre pegar a última versão.
Utilizando o crashinfo
Após baixar:
O envie para o servidor
Altere as permissões a modo que você consiga o executar, não necessariamente 777.
# chmod 777 crashinfo.bin
Obtendo os relatórios para análise:
Muito importante manter a área assinalada para receber o crash que tenha no minimo 1gb, quando esta área chegar a 500mb voce receberá mensagens no syslog da maquina indicando pouco espaço.
Através dos relatórios obtidos nos próximos passos voce podera analisar a origem do crash, quantidade memória livre no momento do crash entre outros dados úteis.
#./crashinfo.bin -c > crash_c.out
#./crashinfo.bin -v > crash_v.out
Labels: Crash