Better At Oracle

A collection of tutorials, code and tools to help you get better using Oracle


15 August 2011

How do I bind a variable IN list?

On the forums, this question comes up a lot in various forms. It is often along the lines of:

From the web application, a number of users can be selected from a list. Using that list of IDs, I want to select all the details for each user from the users table. How can I do this efficiently?

I call this the "How do I bind an IN list problem", or "How do I bind a variable number of values to a query".

This problem is tricky because the number of users that can be selected is based on a variable number of values for the same input variable. Most database queries are not like that - generally there are a fixed number of variables, and each variable can only have one value.

It is also an interesting problem, as solving it brings up a number of interesting points related to database performance and security, such as bind variables, query reuse, parsing and SQL Injection attacks.

Lots of ORs

The most obvious solution is to build up a list of OR statements, eg:

where (id = :b1
or     id = :b2
or     id = :b3
...
or     id = :bn)

Dynamically creating a query in the middle tier is pretty easy, and this solution will work. Provided you use bind variables for the user input, you are protected from SQL Injection attacks, so security wise it is fine too. A mistake many beginners will make is to take each user input and dynamically add it into the query they are generating, eg:

where (id = 1
or     id = 6
or     id = 78
...
or     id = n)

If you do this, you have just opened a massive security hole in your application leaving it vulnerable to SQL Injection Attacks. Even though it seems like more work, the ONLY safe approach is to generate the query using bind variable placeholders and then bind the user input into the query as it is executed.

Assuming you use bind variables, there is still a bit of a performance problem with this approach.

There are potentially a large number of inputs, and each time the query is executed there will be a different number of them, so the number of different queries to be run is potentially large. This means many different variations of the query will be created in the Oracle Shared Pool, possibly pushing other reusable queries out of the pool. In an extreme case, although unlikely, it could lead to a hard parsing problem in the database.

In an ideal application tier, the application will parse (or prepare) a query one time and then execute it many times - with this approach doing that would be very difficult due to the large number of different queries, and so the database may suffer from excessive soft parsing if the query is executed frequently.

The application can suffer too, as parsing (calling the JDBC Prepare method) a new query involves a round trip to the database, so by parsing and executing each query, two rounds trips are required for each query instead of just one.

All that said, this solution isn't terrible, but there is a better way.

Using the IN operator

For occasions where you need to query a table based on a variable number of inputs, SQL provides the IN operator, eg

where id in (1, 5, 8, ... n)

When developers learn of the IN operator, the first thing many try is to join all the input IDs into a comma separated string, and then bind that into the query, eg:

where id in (:inputs)

Were inputs is a sting like:

1,4,7,78,99

This is a perfect approach security wise and performance wise, except that is will never return the correct data! When you bind the string above into the query, it results in a query that looks like:

where id in ( '1,4,7,78,99' )

Which is totally different to the intended query:

where id in (1,4,7,78,99)

The first query has a single string as its input, and will probably error at run time with an INVALID NUMBER error. The second query has 5 numbers as input.

Faced with this, a developer has two options. Dynamically create a query similar to the one with many OR statements, eg:

where id in (:b1, :b2, :b3, ... :bn)

This is just the many OR statements query in disguise, and suffers from the same problems.

The other wrong approach is to dynamically generate the entire query based on the user input with no bind variables at all, eg:

where id in (1,4,7,78,99)

Unfortunately this is the easiest thing to do in the application tier, but it is a sure way to destroy your application. The database will quickly develop a hard parsing problem and the application is wide open to SQL injection attacks.

If you are not using BIND VARIABLES in your SQL for all USER INPUT you are doing it wrong.

The best solution

Ideally what we want is the ability to bind just one variable to the SQL query, which is a comma separated string of IDs. Then split this single string into a list of values (ie one row per ID), then it could be used to join against the user table.

Using an SQL trick, this is possible, try running the following and pass a comma separated list of IDs as the value for :txt:

SQL11G> SELECT regexp_substr(:txt, '[^,]+', 1, LEVEL) token
        FROM dual
        CONNECT BY LEVEL <= length(:txt) - length(REPLACE(:txt, ',', '')) + 1

TOKEN
-----
1
5
6
8
100

If the value '1,5,6,8,100' is bound as the value of txt, the query will return 5 rows. A combination of REGEXP_SUBSTR and CONNECT BY LEVEL is used to split the input string on comma and output a row for each ID.

Notice there is only a single bind variable, and it can contain any number of values and the query text is the same, so the query is reusable - it can be parsed once in the application and executed many times for many different inputs.

The user input is bound into the query, so security concerns are taken care of - there is no way to perform a SQL Injection attack on this query.

Given this trick, it is trivial to expand it into a query that can select the list of details from the Users table. First create a users table to experiment with:

SQL11G> create table users
        as 
        select rownum id, 'User'||rownum username
        from all_objects;

Then create a query that will accept an input as a comma separated string, and return the details for each user ID in this string:

SQL11G> with id_generator
        as
        (
          SELECT regexp_substr(:txt, '[^,]+', 1, LEVEL) token
          FROM dual
          CONNECT BY LEVEL <= length(:txt) - length(REPLACE(:txt, ',', '')) + 1
        )
        select u.id, u.username
        from users u, id_generator g
        where u.id = g.token;

Limitations

There is one limitation to this solution - the maximum length of the comma separated list is 4000 characters. Any more and Oracle gives the error:

SQL Error: ORA-01460: unimplemented or unreasonable conversion requested

To work around this, you could expand the ID_Generator part of the query using UNION ALL and bind in two separate strings, eg:

SQL11G> SELECT regexp_substr(:txt, '[^,]+', 1, LEVEL) token
        FROM dual
        CONNECT BY LEVEL <= length(:txt) - length(REPLACE(:txt, ',', '')) + 1
        UNION ALL
        SELECT regexp_substr(:txt2, '[^,]+', 1, LEVEL) token
        FROM dual
        CONNECT BY LEVEL <= length(:txt2) - length(REPLACE(:txt2, ',', '')) + 1

This complicates things slightly, and could start to become like using many OR statements if the list of IDs is too long, but 4000 characters should be enough for most applications, unless the IDs are very long numbers.

Another limitation is that the REGEXPSUBSTR function is only available in Oracle 10g and above. If you are using 9i, then an alternative version of the IDGenerator query can make use of instr and substr instead of REGEXP_SUBSTR:

SQL11G> select
        substr(txt,
               instr (txt, ',', 1, level  ) + 1,
               instr (txt, ',', 1, level+1) - instr (txt, ',', 1, level) -1 ) as token
        from (select ','||:txt||',' txt from dual)
        connect by level <= length(:txt)-length(replace(:txt,',',''))+1
blog comments powered by Disqus