portalpostmortems

Information about portalpostmortems

Published on January 3, 2008

Author: Samantha

Source: authorstream.com

Content

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures:  Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures Joseph Dane University of Hawaii © 2005 University of Hawaii Overview:  Overview Motivation Concepts Examples Resources Why? When? Who?:  Why? When? Who? Why: because you’ve got a flaky application, and you don’t know the cause This is your only choice (except, perhaps …) When: now. Or possibly later. You do need a running JVM If you’ve got a dead JVM, you might have a core Who: java developers requires some familiarity with java threading, locks, etc. in particular but, no guru level knowledge required General Principles:  General Principles The goal is not to solve problems, but to add to the pool of knowledge about the problem Some sysadmin skills required/useful Finding PIDs, sending signals, munging text For some problems (e.g. G.C.), this will all be a complete waste of time For other (fairly rare) problems, this will immediately point out the problem, e.g. deadlocks “Postmortem”, in that we can (must, in fact) wait until a problem presents What can we learn?:  What can we learn? Resource contention network (database, RSS feed, etc.) java locks CPU Not so good for memory leaks Terminology:  Terminology “thread” -- a single execution path through a program “method” -- a java function call “stack” -- the context of the currently called methods/functions “frame” -- a single method call, or a single line in a stack “up/down” -- “up” is toward the TOS, “down” is toward the start of the program “TOS” -- “top of stack”, or the currently executing method “monitor” -- a java lock only one thread can hold a given lock at a time The Stack:  The Stack public void A() { Z(); B(); } public void B() { C(); } public void C() { … } public void Z() {} C B A In words: A calls Z, then B, which calls C. C is at the TOS, and is the currently executing method. Z was called previously by A, but is not currently executing, so is no longer on the stack. The Thread Dump:  The Thread Dump A snapshot of all activity in the program at a given instant All JVMs can produce them, in different ways and formats A text file, in various formats All include a stack trace per live thread. Some include explicit lock information. Generating a thread dump UNIX: send the QUIT signal to the JVM Linux: send to the root of the process tree Mac OS: apparently some flakiness in 1.4 JVM signals TIP: use -XX:+JavaMonitorsInStackTrace on Sun 1.3.1 Windows: CTRL-Break Need to arrange for the service to interact with the desktop. Redirect output to a file. jstack (tiger) Not available for windows? Can also show native frames. Generating the dump is usually safe. A Stack Trace:  A Stack Trace "tcpConnection-8443-0" daemon prio=5 tid=0x005469d0 nid=0x19b2800 in Object.wait() [f1112000..f1112ac0] at java.lang.Object.wait(Native Method) - waiting on <0x650c01f0> (a java.lang.Object) at com.caucho.server.TcpServer.accept(TcpServer.java:648) - locked <0x650c01f0> (a java.lang.Object) at com.caucho.server.TcpConnection.accept(TcpConnection.java:211) at com.caucho.server.TcpConnection.run(TcpConnection.java:132) at java.lang.Thread.run(Thread.java:552) The Thread Dump, Pictorially:  The Thread Dump, Pictorially Time T1 T2 T3 T4 T5 T6 Deadlock example:  Deadlock example Found one Java-level deadlock: ============================= "T2": waiting to lock monitor 0x0180ed00 (object 0x2a563978, a java.lang.String), which is held by "T1" "T1": waiting to lock monitor 0x0180ecdc (object 0x2a5639a8, a java.lang.String), which is held by "T2" Java stack information for the threads listed above: =================================================== "T2": at hawaii.jasig.DL$Locker.run(DL.java:47) - waiting to lock <0x2a563978> (a java.lang.String) - locked <0x2a5639a8> (a java.lang.String) at java.lang.Thread.run(Thread.java:613) "T1": at hawaii.jasig.DL$Locker.run(DL.java:47) - waiting to lock <0x2a5639a8> (a java.lang.String) - locked <0x2a563978> (a java.lang.String) at java.lang.Thread.run(Thread.java:613) The tool:  The tool Far too many threads for human consumption. Written in a good text processing language. Main concept: group related threads together. Simplest definition of “related” lexical identity of stack frame. Keep it Simple Work on the tool ahead of time, verify that it works as expected. Basic process: learn, hypothesize (I.e. guess), test, repeat. Sample tool output:  Sample tool output 1: at com.pipeline.email.engine.ExternalSessionIMAP4.close(ExternalSessionIMAP4.java:1920) 1: at java.lang.Object.clone(Native Method) 1: at org.apache.xml.utils.TreeWalker.startNode(TreeWalker.java:274) 2: at com.pipeline.email.engine.ExternalSessionIMAP4.close(ExternalSessionIMAP4.java:1914) 2: at com.iplanet.server.http.servlet.NSServletSession.writePrivate(Native Method) 3: at org.jasig.portal.UserInstance.renderState(UserInstance.java:282) 5: at com.pipeline.email.engine.ExternalSessionIMAP4.getPersonalNamespace(ExternalSessionIMAP4.java:243) 6: at java.net.PlainSocketImpl.socketAccept(Native Method) 14: at com.pipeline.email.engine.ExternalSessionIMAP4.getJMCacheSubfolders(ExternalSessionIMAP4.java:599) 21: at java.lang.Thread.sleep(Native Method) 47: at java.net.SocketInputStream.socketRead(Native Method) 262: at java.lang.Object.wait(Native Method) total of 365 threads Step 0 -- Preparation:  Step 0 -- Preparation Monitoring Manual, e.g. “top” “ps”, etc. Automatic, e.g. MRTG Learn about normal behavior. Number of threads? Distribution? Response plan Notifications. Who gets called? Gather wide expertise. What might be changed? A new switch or firewall? Has registration started? Analysis plan Access to the source is important, probably essential Organize. New directory for each incident, each dump in a separate file. Keep notes as you work. Step 1 -- Something’s Wrong!:  Step 1 -- Something’s Wrong! Generate the dump. Maybe a couple of dumps, but one is probably OK for now. Maybe clean the dump file. depends on how smart your tool is Check the system’s vital signs. Running out of memory? CPU bound? Idle CPU often means a concurrency problem, or an unavailable resource. Step 2 -- What’s happening?:  Step 2 -- What’s happening? 10/27/2004. production portal becomes non-responsive First, try looking at the tops of the stacks. See anything odd? [EXAMPLE 1] (2004102701) 1: at com.pipeline.grouptools.lib.GTSession.initialize(GTSession.java:203) 2: at java.util.HashMap.put(HashMap.java:324) 6: at java.net.PlainSocketImpl.socketAccept(Native Method) 20: at java.lang.Thread.sleep(Native Method) 47: at java.net.SocketInputStream.socketRead(Native Method) 133: at java.lang.Object.wait(Native Method) 235: at netscape.ldap.util.ConnectionPool.getConnFromPool(ConnectionPool.java:208) total of 444 threads Slide18:  Often the tops are not that useful [EXAMPLE 2] (ex2) 3: at java.net.PlainSocketImpl.socketAccept(Native Method) 6: at java.lang.Thread.sleep(Native Method) 16: at java.net.SocketInputStream.socketRead0(Native Method) 33: at java.lang.Object.wait(Native Method) total of 58 threads Lots of things call Object.wait. Best to ignore this for now. 16 Threads in socketRead0. Try looking down the stack interesting traces ... 16 interesting traces 17 frames down ... 10: at edu.hawaii.portal.channels.jasig.ReallySlowChannel. setRuntimeData(ReallySlowChannel.java:47) total of 10 threads Slide19:  edu.hawaii.portal.channels.jasig.ReallySlowChannel.setRuntimeData Thread 1 Thread 2 Thread 3 Thread 4 Slide20:  Thread 1 Thread 2 Thread 3 Thread 4 edu.hawaii.portal.channels.jasig.ReallySlowChannel.setRuntimeData Reality is usually more complicated …:  Reality is usually more complicated … [EXAMPLE 3] (2004120101) 6: at java.net.PlainSocketImpl.socketAccept(Native Method) 9: at com.example.foo.Bar(Bar.java:559) 9: at com.iplanet.server.http.servlet.NSServletSession.ostream_flush(Native Method) 9: at org.jasig.portal.UserInstance.renderState(UserInstance.java:267) 15: at java.lang.Thread.sleep(Native Method) 36: at org.jasig.portal.UserInstanceManager.getUserInstance(UserInstanceMana ger.java:76) 72: at com.example.foo.Foo.setFoo(Foo.java:596) 99: at java.net.SocketInputStream.socketRead(Native Method) 139: at com.example.foo.Foo.getFoo(Foo.java:534) 142: at com.iplanet.server.http.session.IWSHttpSession.removeAllObjects(IWSHt tpSession.java:705) 232: at java.lang.Object.wait(Native Method) total of 788 threads When TOS isn’t enough:  When TOS isn’t enough Sometimes you need to identify threads which pass through certain methods . “pass through” meaning “currently executing” Call these methods (and threads) “interesting”. Some threads may be blocked in the interesting methods. Some may have made it through to other methods. “interesting” example:  “interesting” example public class Example { private Object aLock = new Object(); public void foo() { synchronized (aLock) { bar(); } } private void bar() { // calls other methods … } } Slide24:  Let’s call removeAllObjects “interesting” and see what we get … interesting traces ... 212 interesting traces interesting TOS ... 70: at com.example.Foo.setUserFoo(Foo.java:596) 142: at com.iplanet.server.http.session.IWSHttpSession.removeAllObjects(IWSHttpSession.java:705) 212 threads made it into or past removeAllObjects. of these, 142 are blocked inside that method (why? dunno) and 70 have made it past, and are now blocked in setUserFoo Time to check the source … Slide25:  _getUserFoo(): synchronized( aLock ) { … } getUserFoo(): synchronized( aLock ) { … } setUserFoo(): synchronized( aLock ) { … } Slide26:  So, let’s find all locations where aLock is obtained Call these methods “interesting”, find all threads passing through these methods. 1: at java.net.SocketInputStream.socketRead(Native Method) 9: at com.example.foo.Foo.getUserFoo(Foo.java:559) 72: at com.example.foo.Foo.setUserFoo(Foo.java:596) 139: at com.example.foo.Foo.getUserFrob(Foo.java:534) total of 221 threads Bingo! One thread in socketRead! Find that thread! Slide27:  nid=0xc9 suspended [0x8dc77000..0x8dc79668] at java.net.SocketInputStream.socketRead(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:85) at oracle.net.ns.Packet.receive(Unknown Source) at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source) at oracle.net.ns.NetInputStream.read(Unknown Source) at oracle.net.ns.NetInputStream.read(Unknown Source) at oracle.net.ns.NetInputStream.read(Unknown Source) at oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:726) at oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:698) at oracle.jdbc.ttc7.Oclose.receive(Oclose.java:105) at oracle.jdbc.ttc7.TTC7Protocol.close(TTC7Protocol.java:533) at oracle.jdbc.driver.OracleStatement.close(OracleStatement.java:639) … TIP: Learn to recognize the unexceptional Resources:  Resources jstack, jconsole BEA analysis tool. uses XSLT to produce graphs, analysis http://dev.bea.com/products/wlplatform81/articles/thread_dumps.jsp sample code http://www2.hawaii.edu/~jdane/jasig Finally:  Finally If you haven’t considered this method (as a “real” method), you should. Success rate: 20/50/30 don’t neglect QC

Related presentations


Other presentations created by Samantha

Role of Technology
03. 01. 2008
0 views

Role of Technology

Anglo Saxon Poetry
02. 10. 2007
0 views

Anglo Saxon Poetry

wastewater
09. 11. 2007
0 views

wastewater

coins presentation
06. 12. 2007
0 views

coins presentation

Umbrella2007 Exploiting VLEs
10. 12. 2007
0 views

Umbrella2007 Exploiting VLEs

Duebel Turkey Covered Bonds 06
23. 11. 2007
0 views

Duebel Turkey Covered Bonds 06

class9b
23. 12. 2007
0 views

class9b

PWG Scale Up report deck
04. 01. 2008
0 views

PWG Scale Up report deck

dom casmurro
07. 01. 2008
0 views

dom casmurro

asthma
16. 11. 2007
0 views

asthma

Bank of America 09 17
04. 01. 2008
0 views

Bank of America 09 17

5 Organisational Appraisal
19. 02. 2008
0 views

5 Organisational Appraisal

Koch2003 06b
15. 11. 2007
0 views

Koch2003 06b

kotler02
24. 02. 2008
0 views

kotler02

JP Geant2004
04. 01. 2008
0 views

JP Geant2004

HealthAdvisoryforSch ools
11. 03. 2008
0 views

HealthAdvisoryforSch ools

CLOE VC Vancouver
12. 03. 2008
0 views

CLOE VC Vancouver

germany othereurope russell06
14. 03. 2008
0 views

germany othereurope russell06

carruthers training 2006b
18. 03. 2008
0 views

carruthers training 2006b

2010TourismOrganisin gPlan
21. 03. 2008
0 views

2010TourismOrganisin gPlan

faculty dor workshop
03. 10. 2007
0 views

faculty dor workshop

ees350 atmos consitutents
07. 04. 2008
0 views

ees350 atmos consitutents

nov14 tax rajeeva
30. 03. 2008
0 views

nov14 tax rajeeva

HO2e Macro ch09 2008
13. 04. 2008
0 views

HO2e Macro ch09 2008

cse331 week1
01. 10. 2007
0 views

cse331 week1

prepersianwar
07. 12. 2007
0 views

prepersianwar

18OctPlenary NEPAD PRESENTATION
07. 01. 2008
0 views

18OctPlenary NEPAD PRESENTATION

Vamos a Comer
15. 11. 2007
0 views

Vamos a Comer

Lahav Aarhus nu sep07
29. 12. 2007
0 views

Lahav Aarhus nu sep07

Chapter04 PartIII
19. 11. 2007
0 views

Chapter04 PartIII

APMA
07. 11. 2007
0 views

APMA

CIDS II
29. 12. 2007
0 views

CIDS II

efmc2002 becht takeovers
16. 11. 2007
0 views

efmc2002 becht takeovers

Santos D
20. 11. 2007
0 views

Santos D

M Zaluska
21. 11. 2007
0 views

M Zaluska

esri finished
07. 11. 2007
0 views

esri finished