LinuxDevCenter.com

oreilly.comSafari Books Online.Conferences.

We've expanded our Linux news coverage and improved our search! Search for all things Linux across O'Reilly!

Search
Search Tips

advertisement

Listen Print Discuss Subscribe to Linux Subscribe to Newsletters
Linux & Unix > Excerpts >

Range-Keyed Queries

by Dan Tow, author of SQL Tuning
01/06/2005

In a recent dialogue on the oracle-l@freelists.org mail group, I ran into a good prototype for a recurring class of SQL tuning problems. The coder had two versions of SQL getting dramatically different performance levels for a very frequent single-row query, and wanted to know the reason for the difference, citing logical and physical I/O figures for each alternative. To make a long story short, neither version was near optimal, and the optimal solution came from a strategy I've applied to a recurring class of problems, which I describe below. First, I present the original problem, summarized with his permission from the note by Mark Bobak (thanks, Mark!):

There are two tables involved: ADDS_USERS, which has AUSR_ID as the primary key, and AUTHORIZED_IP_ADDRESSES.

The original query was:

SELECT A.AUSR_LOGIN_SCREEN_NAME FROM ADDS_USERS A,
AUTHORIZED_IP_ADDRESSES B
WHERE A.AUSR_ID = B.AUSR_ID AND :B1 BETWEEN 
B.AIA_IP_ADDRESS_START AND
B.AIA_IP_ADDRESS_END;

Autotrace apparently misrepresented the execution plan, which was the original source of confusion, and the reason for the original note. The trace statistics for the query were correct, however, and are edited here just to show the important statistics:

Statistics
----------------------------------------------------------
       1321  consistent gets
        864  physical reads
          1  rows processed

1321 buffer gets were too many, not to mention the physical I/O, considering the frequency with which this was to be executed.

Therefore, Mark tried re-creating ADDS_USERS ordered by AUSR_ID, to improve AUSR_PK index clustering factor. This didn't help. Creating AUTHORIZED_IP_ADDRESSES as an index-organized table was also useless.

So Mark went back to SQL hacking, and finally came up with the following:

SELECT (select A.AUSR_LOGIN_SCREEN_NAME FROM ADDS_USERS A 
where
a.ausr_id = b.ausr_id) from AUTHORIZED_IP_ADDRESSES B
WHERE :B1 BETWEEN B.AIA_IP_ADDRESS_START AND B.AIA_IP_ADDRESS_END;

which produced the (accurate) plan:

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=110 Bytes=3080)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'ADDS_USERS' (Cost=2 Card=1 Bytes=14)
   2    1     INDEX (UNIQUE SCAN) OF 'AUSR_PK' (UNIQUE) (Cost=1 Card=1)
   3    0   INDEX (RANGE SCAN) OF 'AIA_INDX_PR01' (NON-UNIQUE) 
              (Cost=4 Card=3D110 Bytes=3D3080)

Statistics were much improved:

Statistics
----------------------------------------------------------
         67  consistent gets
          0  physical reads
          1  rows processed

Consistent gets were down to 67 from 1321, but I noticed that neither plan delivered a path that maximized efficiency for a single-row query. Here is the approach I recommend for this class of problems.

A class of tables I'll call range-keyed tables are laid out with one row per range of values, where the ranges of values are non-overlapping and (usually) inclusive, meaning they leave no "holes" in the midst of the ranges if the ranges are laid out end-to-end, so to speak. (These ranges must not overlap even at the endpoints, unless one end of the range is defined as non-inclusive.) The most common examples of this involve date ranges, although this particular example came from a case of IP-address ranges. Thus, to point to a row, you say:

:bindvariable between lowcol and highcol

or (where one range's highcol equals the next range's lowcol)

(:bindvariable > lowcol and : bindvariable <= highcol).

I'll show this another way. lowcol and highcol must reflect values belonging to some order-able set, some set capable of being mapped to points on a line, such as a number line or a timeline. Taking the case where one range's highcol value is the next range's lowcol value, the ranges would look something like Figure 1.

Figure 1
Figure 1. Ranges represented by rows 1 through 4

Most commonly, the designer chooses to create an index on (lowcol, highcol), but sometimes the order is reversed, or just one of these is indexed.

The problem is that any such index strategy means that the index range scan to find the single range that meets the query condition is likely to have to read about half of the index, if you ask for a range in the middle of the whole set of ranges. (With date ranges, you can often escape most of the dilemma because applications tend to ask for the most-recent range, or one of the most-recent ranges, anyway, so an index leading with highcol will see a very short range scan, usually. If, by contrast, you index (Low_Date, High_Date), in that order, you will guarantee that most index range scans will read the entire range, since every range will have Low_Date lower than the recent date inside the most-recent date range.)

E. F. Codd discussed (in The Relational Model for Database Management, Version 2, Addison Wesley) a useful set of comparators (greatest-less-than, least-greater-than, greatest-less-than-or-equal-to, and least-greater-than-or-equal-to) that turn out to be ideal for this problem. With these available, you'd just ask for an index range scan with:

highcol least-greater-than-or-equal-to :bindvariable

and the database would find the very first (least) value of highcol >= :bindvariable in an index range scan on a conventional index of highcol, and would then quit, because that's the row you want. The cost, if you look at the I/O, would be no different than an index unique scan! The database design for inclusive, non-overlapping ranges would no longer even require lowcol because lowcol would be implied by the previous range's highcol, and the database would automatically avoid potentially thorny corrupt-data issues with accidentally overlapping ranges!

Sadly, I've never found a database that has implemented this lovely concept. Instead, the best I've found is the following admittedly hacky solution I recommend for use on Oracle:

(Forewarning: the following solution may offend the relational purists among you. Yes, I know it's a hack.) Let's say the index on highcol is called highcolind. If you want to quickly find the row you're after, the query you need is:

SELECT /*+ INDEX(t highcolind) */ <whatever>
  FROM rangetable t
 WHERE highcol >= :bindvariable
   AND lowcol <= :bindvariable
   AND ROWNUM = 1

Here, the database won't keep searching for more ranges that might enclose :bindvariable once it finds the first one, and the range scan on highcolind finds the first one in the first index entry it sees, as long as the database follows the hint! (If it fails to follow the hint, performance will suffer, clearly, but the database will still return the right row, since it will keep looking until it satisfies both conditions on :bindvariable. This will not find multiple ranges that include :bindvariable, though, if the table contains overlapping ranges--I have no easy answer to find multiple ranges if your ranges overlap. If your ranges have holes, it will also have to do the entire range scan from :bindvariable to the end of the range before it will conclude that no range contains the particular value of :bindvariable you're looking with--that's why having no holes in the ranges helps.)

To build this into a more-complex query, you'd do something like:

SELECT /*+ ORDERED <maybe other hints> */ <whatever>
  FROM (SELECT /*+ INDEX(t highcolind) */ *
          FROM rangetable t
         WHERE highcol >= :bindvariable
           AND lowcol <= :bindvariable
           AND ROWNUM = 1) t, taba a, tabb b, ...
WHERE t.fkey = a.pkey
...

If you use this strategy, be sure to watch the endpoint cases. Mark's database design had endpoints that were inclusive, so he used BETWEEN to find the single range desired, and (presumably) the application guaranteed that the next range's value of AIA_IP_ADDRESS_START was the next legal address after the previous range's AIA_IP_ADDRESS_END. If these were Oracle DATE-type columns, the Low_Date for the next range would be one second after the High_Date for the previous range. If, on the other hand, the lowcol value for one range is equal to the highcol value of the previous range (as shown in Figure 1), then you must not use BETWEEN, and must use < or > for the comparison with the endpoint that is non-inclusive. For example, if High_Date is the last included time for one range, and Low_Date for the next range equals High_Date for the previous range, then the range condition (which is non-inclusive on the low end) will look like this:

WHERE High_Date >= :bindvariable
           AND Low_Date < :bindvariable

In Mark's particular case, a new version of the query would be:

SELECT /*+ ORDERED USE_NL(A) */ A.AUSR_LOGIN_SCREEN_NAME 
FROM 
(SELECT /*+ INDEX(B AIA_IP_ADDRESS_END_INDEX) */ * 
   FROM AUTHORIZED_IP_ADDRESSES B
  WHERE :B1 BETWEEN B.AIA_IP_ADDRESS_START AND
                    B.AIA_IP_ADDRESS_END
    AND ROWNUM = 1) IV,
ADDS_USERS A
WHERE A.AUSR_ID = IV.AUSR_ID;

which joins a single row from the range-keyed table to the users table. When Mark implemented this strategy, the fixed version of the query required just five logical I/Os, while the better version of the earlier two versions required 67 logical I/Os. More importantly, the fixed version read just a single row within each of its five logical I/Os, while most of the 67 logical I/Os of the better of the earlier queries were expensive range reads in index leaf blocks, which covered every row entry in the leaf block, burning far more CPU than a single-row read in a logical I/O.

Although I created this trick for use on Oracle, it should work on MySQL; use the clause “LIMIT 1” at the end of the query (or at the subquery inside the FROM clause), in place of the condition “ROWNUM=1”, and use the hint USE INDEX(<Keylist>) (or FORCE INDEX(<Keylist>) ) at the end of the table-reference, and the hint /*! STRAIGHT JOIN */ to force the join order (in place of ORDERED), if necessary. Thus, the generic case (for MySQL versions 4.1+) above becomes:

SELECT /*! STRAIGHT JOIN <maybe other hints> */ <whatever>
  FROM (SELECT *
          FROM rangetable t FORCE INDEX(highcol)
         WHERE highcol >= <bindvariable>
           AND lowcol <= <bindvariable>
           LIMIT 1) t INNER JOIN taba a ON t.fkey = a.pkey    
                      INNER JOIN tabb b ...
WHERE 
...

Similar tricks are likely possible in other open databases, as long as there is a way to force a join order and to stop a query, especially a subquery in a FROM clause, at the first row. I’d like to invite the readers to submit the equivalent versions for PostgreSQL, DB2, Firebird, and others.

SQL Tuning

Related Reading

SQL Tuning
By Dan Tow

Dan Tow is an independent consultant, operating under the banner SingingSQL (www.singingsql.com). His experience solving Oracle-related performance problems goes all the way back to his 1989 hire by Oracle Corporation. He has a Ph.D. in chemical engineering from the University of Wisconsin at Madison.


Return to the Linux DevCenter.


What are your thoughts and contributions to the query?
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 1 of 1.

  • With Oracle
    2009-07-13 08:56:42  ahmusch [Reply | View]

    There can be some volume problems with only adding AND ROWNUM = 1, specifically with a large data set and not-found condition.

    Assume the following index-organized table:

    SQL> DESC GEO_IP_ORG_LOC;
    Name Null? Type
    --------------------- -------- ----------------
    IP_NUM_START NOT NULL NUMBER(10)
    IP_NUM_END NOT NULL NUMBER(10)
    ISP VARCHAR2(255)
    ORG VARCHAR2(255)
    COUNTRY_CODE CHAR(2)
    REGION_CODE CHAR(2)
    CITY VARCHAR2(255)

    The primary key of this table is (ip_num_end, ip_num_start).

    If one searches this data for a found IP address like so:

    SQL> variable google_ip number;
    SQL> exec :google_ip := ((74*256*256*256) + (125*256*256) + (45*256) + 100);

    PL/SQL procedure successfully completed.

    SQL> variable private_ip number;
    SQL> exec :private_ip := ((10*256*256*256) + (0*256*256) + (0*256) + 1);

    select * from ag2.geo_ip_org_loc a
    where :google_ip between ip_num_start and ip_num_end
    and rownum = 1


    very few -- on the order of single digits -- consistent gets are required if the index is cached.

    But if the data isn't in the data set, performance can go sideways. The same query with a different IP address:

    select * from ag2.geo_ip_org_loc a
    where :private_ip between ip_num_start and ip_num_end
    and rownum = 1


    burns through 95,000+ consistent gets.

    The nature of the problem is that we can't tell Oracle that the ranges are discrete through constraints or other declarations, but we can emulate the "greatest less than" and "least greater than" through SQL.

    Consider the SQL with an additional where clause:

    select * from ag2.geo_ip_org_loc a
    where :google_ip between ip_num_start and ip_num_end
    and rownum = 1
    and ip_num_end = (select /* no_unnest */
    min(b.ip_num_end)
    from ag2.geo_ip_org_loc b
    where b.ip_num_end >= :google_ip
    )


    Again, single digit consistent gets with a found IP address. We also get similar performance in cases where the value we're searching for is in no range.

    The subquery is the least greater than, and the no_unnest hint allows it to only be evaluated when the preceding BETWEEN clause is true, depending on the release of Oracle being executed Some versions of the Oracle optimizer are more aggressive than others about unnesting that subquery.

    You can obtain exemplar data at http://www.maxmind.com/app/geolitecity




Tagged Articles

Be the first to post this article to del.icio.us

Recommended for You

  1. Cover of Even Grues Get Full
    Even Grues Get Full
    Print: $12.95
  2. Cover of Take Control of Permissions in Mac OS X
    Take Control of Permissions in Mac OS X
    Ebook: $10.00
  3. Cover of Applying RCS and SCCS
    Applying RCS and SCCS
    Print: $34.95
  4. Cover of Building Embedded Linux Systems
    Building Embedded Linux Systems
    Print: $49.99
    Ebook: $39.99

Sponsored Resources

  • Inside Lightroom
Advertisement

Sponsored by:

O'Reilly Media

©2010, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com