Sunday, April 02, 2006

How many binds does it take to save time?

Here's a tale about how something that seems intuitively correct can turn out to be 100% wrong - even though it has been stated as gospel by someone who seems like he should know.


We're building an application that - for the first year or two of operation - will include a legacy system component. Originally based on a kind of multiuser ISAM file system, the legacy data now lives in a series of over 100 Oracle schemas (one per site). The legacy app uses a legacy-to-relational bridge to access the data.

The bridge generates single table SQL to perform file-based actions like find-by-key, next record, previous record and so on. The tables are suitably indexed. However... the bridge (designed to support many different database platforms) issues SQL with no bind variables. So although there is a relatively small number of distinct access paths, each time you try one, it's got a new set of literal values in the predicates, and a hard parse is required.


The quick and easy solution is to have every bridge session start with:

ALTER SESSION SET CURSOR_SHARING='SIMILAR'
'Similar' statements share a parse tree, and the number of hard parses drops. In low volume everything works great. But as you crank up the workrate (we're using LoadRunner) the shared pool eventually gives up under the strain. Although there are relatively few 'families' of similar statements (ie statements which are exactly the same once all literals have been replaced by system generated bind variables), each single statement family may contain 100s of versions and use up to 150M of shared pool.


So the shared pool becomes fragmented and sooner or later it iss ORA-04031 time - unable to allocate memory. Even with 800M allocated to the shared pool, and 10% of it still available, the largest free chunk may be too small to be useful. We got down to a biggest chunk size of 23 bytes at one point.


A nasty side effect is that these errors result in everything slowing up. The main web application runs on clustered OC4J, using JDBC with session pooling to connect to a RAC back end. When queries start to slow down, the pool is exhausted so there's a sudden connection storm - which leads the entire system into a spiral of doom.


Now, the system architects aren't complete idiots. They had Oracle consultants on hand to remind them that bind variables are important. They even had access to the code base for the legacy bridge. So an attempt was made to add bind variable support.


It so nearly succeeded - but there was an itsy-bitsy bug in it.


The original bridge developer was asked to help solve the problem. And this is where he dug himself a big hole and jumped right in. A scientist proposes a theory, and conducts experiments that are designed to disprove it. This person simply stated what he believed:

Bind variables are used to keep certain info constant between calls. This
requires keeping and maintaining bind variable arrays. [...] To do it properly one must maintain a lot of stuff in separate arrays and relate them to the tables and types of calls being used. Rather than altering the session and letting Oracle bind them properly.


He (correctly, if sarcastically) describes what is happening:

I looked at your code briefly and your programmer brilliantly solved all the above problems. He creates a bind array, uses it in whatever the DB call is made, and then DESTROYS the array at the end of the call. It is not being used between calls AT ALL! Each DB call (in other words for every single record you read) creates, uses, then destroys the bind arrays.



And now he leaps to the wrong conclusion:

So technically you are binding, actually you have ADDED overhead and destroyed performance. [...] I am surprised you are not getting unexplained core dumps in a heavy locking situation where it accesses multiple files.


He assumed that:
  • using binds in the legacy bridge would be expensive

  • closing the OCI statement immediately after use would waste any savings

  • using CURSOR_SHARING=SIMILAR in Oracle would (a) work and (b) be more efficient

The team lost heart in the face of this clear statement, and the attempt to use bind variables was abandoned. Attempts were made over many months to resolve the 4031 problem by throwing memory at it, or even ignoring it.


But in fact, his advice was 100% wrong. Here's why:

The cost to the legacy bridge of setting up and using bind variables is barely detectable because:
  • Bind variables can easily be added into the statement as it is constructed

  • the memory management is simple (malloc and free)

  • few or no additional network roundtrips for bind data are needed - OCI piggy backs most or all of the information it needs onto the execution request

The Oracle server, on the other hand, is flat out. For it to avoid a hard parse it has to:
  • lexically analyse every incoming SQL to identify whether and where literals have been used

  • set up bind variables and extract the literals

  • now do a soft or hard parse using the transformed SQL

  • retain both forms of the SQL at least until the cursor is closed

  • manage all the memory structures in a pool shared by hundreds or thousands of sessions

No contest. Although a small overhead is indeed added at the bridge, a major bottleneck is cleared on the RAC server.


Now - nearly a year later - a more experienced team has had another go. Using bind variables solved the 4031 problem pretty much at the first attempt - just as any old Oracle hand would expect. And that bug? It turned out to be a simple one line coding error, which took a couple of hours code review to find and fix.


The moral of the story is: don't believe dogma; actually try and prove or disprove a hypothesis properly.


And the answer to the question in the title: if you are continually issuing 'similar' SQL statements, it is worth using bind variables even if you discard the prepared statement after a single use.

No comments: