Sunday, February 25, 2007

Avoiding application suicide by session pool

A recent post on Oracle-L mentioned:
    We had an issue affecting one of our production DBs. The middle tier application for some reason, went crazy spawning processes chewing up the DBs process parameter. DB started throwing errors indicating max process exceeded.


He went on to ask how to wake up PMON to clean up these processes - he'd had to bounce the database to get things cleaned up.

My response was to ask about the root cause, rather than the symptom. The poster may have been solving the wrong problem (or rather, after putting out the fire, he needed to find the cause and stop it happening again).

"Mid tier went crazy spawning processes" is often a symptom of session pool
madness. In such an application, X number of users share a smaller Y number of Oracle sessions. Everything tootles along happily; Users (midtier threads if you like) loop around:

  • get a session from the pool

  • issue one or two SQL

  • commit/rollback

  • give the session back


As long as the users spend less time in Oracle than they do in the rest of the
application (and waiting for user input etc), no problem.

Then something goes wrong; maybe a session sits on a lock that everyone needs; maybe a sequence cache isn't big enough (or is ordeed) and/or you forgot that We Don't Use RAC; maybe you had an SGA problem like ORA-4031.

What happens next:

  • all the Oracle sessions in the pool are busy

  • next midtier thread asks for an Oracle session

  • midtier pool manager says "no problem", launches a new Oracle session and adds it to the pool

  • that session becomes busy

  • and the next thread, and the next thread, and the next thread...


Soon instead of sharing say 100 Oracle sessions across 1000 processing threads,
your mid tier has responded to the blockage by adding 900 new sessions to the
load. That's probably made the problem worse, not better - kind of like slamming your foot on the accelerator when you see brakelights ahead in the fog.

I had exactly this problem performance last year, testing a J2EE app, using OC4J. We hit a 4031 problem (no bind variables in one part of the system) and then fairly immediately the application server did its lemming impersonation as described above.

Things to consider:
1) reduce the upper limit on the session pool size (definitely to below your Oracle processes level!)
2) if possible, slow down the rate of session starts (eg set a delay in the mid-tier session manager)
3) find out what caused the problem in the first case.

The good news is that if you dampen down this suicidal behaviour, you probably have a better chance of diagnosing the root cause next time.

No comments: