LinuxDevCenter.com

oreilly.comSafari Books Online.Conferences.

We've expanded our Linux news coverage and improved our search! Search for all things Linux across O'Reilly!

Search
Search Tips

advertisement

Listen Print Discuss Subscribe to Linux Subscribe to Newsletters

Linux System Failure Post-Mortem
Pages: 1, 2

ksymoops

When the kernel detects an unrecoverable or serious error, it prints a status report to the kernel log file. This report includes such things as the contents of the registers, the contents of the kernel stack, and a function trace of the functions being executed during the fault.



All this stuff is extremely useful -- but is in machine-readable form, and addresses vary depending on the configuration of the individual machine. So the kernel log file alone is useless when determining precisely what went wrong. This is where ksymoops comes in.

ksymoops converts the machine-readable kernel oops report to human readable text. It relies on a correct System.map file, which is generated as part of the kernel compilation. It also expects klogd to handle loadable modules correctly, if appropriate.

ksymoops requires the "oops text," usually available as Oops.file from the system logger. If this file can't be found, grab the oops text from dmesg, or from the console -- copied by hand, if necessary.

The output of ksymoops is a list of messages that might contain a kernel problem. Where possible, ksymoops converts the addresses to the function name the address occurs in.

>>EIP; c0113f8c <sys_init_module+49c/4d0>
Trace; c011d3f5 <sys_mremap+295/370>
Trace; c011af5f <do_generic_file_read+5bf/5f0>
Trace; c011afe9 <file_read_actor+59/60>
Trace; c011d2bc <sys_mremap+15c/370>
Trace; c010e80f <do_sigaltstack+ff/1a0>
Trace; c0107c39 <overflow+9/c>
Trace; c0107b30 <tracesys+1c/23>
Trace; 00001000 Before first symbol

man ksymoops explains these lines in great detail, but what is important to most system administrators is the list of the function names in which problems occurred. Once you know the key function, and the functions which called it, you can make an educated guess as to the cause of the kernel error.

Be aware that the output of ksymoops is only as good as the input -- if the System.map file is wrong, the loadable modules don't report when they are loaded in and out, or the vmlinux, ksyms, lsmod and object files are different from the ones present when the crash occurred, ksymoops will produce invalid output. Run it as soon as possible after the crash, for the most accurate data -- and certainly before you change the kernel!

gdb

If you're an experienced C programmer, you might want to debug the kernel itself. Use the ksymoops output to determine where in the kernel the problem is, then use gdb to disassemble the offending function and debug it.

gdb /usr/src/linux/vmlinux
gdb> disassemble offending_function

Fixing kernel problems

You've figured out that the problem was something you can correct, perhaps a driver or a loadable module. What now?

Install any appropriate patches, check that the driver is correct -- and recompile the kernel and add it as a new lilo entry. Test the new kernel. If that doesn't correct the problem, consider reporting the problem to the linux-kernel list, or the appropriate kernel developer.

Reporting a kernel problem

If you are reporting a bug to the Linux Kernel mailing list, or to any of the linux kernel developers, post the information to linux-kernel@vger.kernel.org, or to the relevant developer, with the subject of "ISSUE: one line summary from [1.]".

  1. One-line summary of the problem:
  2. Full description of the problem/report:
  3. Keywords (i.e., modules, networking, kernel):
  4. Kernel version (from /proc/version):
  5. Output of Oops.. message with symbolic information resolved using ksymoops
  6. A small shell script or example program which triggers the problem (if possible)
  7. Environment
  8. Software (use the ver_linux script from $LINUXHOME/scripts/ver_linux)
  9. Processor information (from /proc/cpuinfo):
  10. Module information (from /proc/modules):
  11. SCSI information (from /proc/scsi/scsi):
  12. Relevant sections of the system log, if any:
  13. Kernel configuration file and symbol map:
  14. Description of hardware:
  15. Other information that might be relevant to the problem (please look in /proc and include all information that you think to be relevant):
  16. Other notes, patches, fixes, workarounds:

Note that the linux-kernel FAQ states that Oops data is useless if the machine with the oops has an over-clocked CPU, or is running vmmon from VMWare. If you have either of these, fix the problem and try to reproduce the Oops before reporting it.

Common hardware failures

If you get repeated, apparently random errors in code, your CPU fan may have died. If you're familiar enough with your equipment, you may be able to hear whether the CPU fan is running -- if not, the simplest test is to open the case and look. If the CPU fan isn't running, shut the machine down and replace the fan -- you may have saved your CPU.

If the CPU fan is running, but you're still getting random errors, suspect the RAM.
There are two common ways to test the RAM. One is to remove the suspect stick and try the machine with the other sticks of RAM, or to test the suspect stick in a known-working machine. The other is to repeatedly recompile a kernel. If you get a signal 11, the RAM is probably bad.

The final common cause of hardware failure is bad blocks on the hard drive. Use the program badblocks to test the drive.

Final words

With a little care, and a little luck, you'll get the up-time record in your local LUG -- unless a power outage downs your machine. But I can't help you with that!

Further reading

Related Reading

Running LinuxRunning Linux
By Matt Welsh, Matthias Kalle Dalheimer & Lar Kaufman
Table of Contents
Index
Sample Chapters
Full Description
Read Online -- Safari

  • man ksymoops
  • man dmesg
  • man syslogd
  • man klogd
  • man insmod
  • linux-kernel FAQ
  • $LINUXDIR/linux/Documentation/oops-tracing.txt
  • $LINUXDIR/linux/README
  • man gdb
  • info gdb

Jennifer Vesperman is the author of Essential CVS. She writes for the O'Reilly Network, the Linux Documentation Project, and occasionally Linux.Com.


Return to the Linux DevCenter.


Are there other open-source tools you use to find out what caused a system crash? Please share your experiences with other diagnostic tools.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 6 of 6.

  • modules
    2002-12-25 21:55:48  anonymous2 [Reply | View]

    Most of the time, oops occur because of the kernel modules.
    However, the methods/procedure described here is not applicabe to that scenario/situation.
    If an oops occur because of some kernel modules, we cant use vmlinux or System.map to trace the cause of oops.
    If this area has been touched, then it would be much better.
    Arun
    • Jennifer Vesperman photo modules
      2003-01-28 15:43:44  Jennifer Vesperman | O'Reilly Author [Reply | View]

      Thank you for that suggestion, I've written it into my ideas book.


      Jenn V.
  • Nice Article
    2002-12-25 12:53:54  arun4linux [Reply | View]

    Very nice article for someone who faces oops messages for the first time.
    Very well presented with good flow and simplicity.

    However I feel by adding/mentioning some more pointers in this article, this article would be a complete one even for experienced kernel hackers.
    e.g.,
    $LINUXDIR/linux/Documentation/nmi_watchdog.txt
    $LINUXDIR/linux/Documentation/serial-console.txt
    Remote debugging
    System freezes
    making use of tools like minicom for remote logging and getting the most oops/system messages.
    • Jennifer Vesperman photo Nice Article
      2003-01-28 15:47:08  Jennifer Vesperman | O'Reilly Author [Reply | View]

      Thank you.

      The advanced material is out of scope for that particular article, but I did err in not stating that, and in not pointing out resources for further research. My apologies.

      I have written this into my ideas book as well. 'Advanced Post-Mortems', anyone? :)

  • memtest86
    2001-11-08 13:00:30  refactored [Reply | View]

    9 times out of ten its going to be a hardware glitch.

    If your memory is flaky, no other test is going to make sense anyway. So its a good place to start testing.

    memtest86 is truly superb. It is GPL'ed, so get it down now, before bad things happen, and install it in your lilo setup.Or have a floppy with it on.

    http://www.teresaudio.com/memtest86/
    • Jennifer Vesperman photo memtest86
      2003-01-28 15:49:47  Jennifer Vesperman | O'Reilly Author [Reply | View]

      Thank you, that's a good tip. And I concur - most failures I've had were hardware glitches.


      Jenn V.


Tagged Articles

Post to del.icio.us

This article has been tagged:

linux

Articles that share the tag linux:

Managing Disk Space with LVM (74 tags)

Use Your Digital Camera with Linux (60 tags)

mdadm: A New Tool For Linux Software RAID Management (59 tags)

Asterisk: A Bare-Bones VoIP Example (43 tags)

View All

Sponsored Resources

  • Inside Lightroom
Advertisement

Sponsored by:

O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com