When a programmer is asked to implement a complicated business calculation or report, he or she will often break that complicated task into multiple steps. This multi-step approach enables the brain to solve one problem at a time and often results in application code that is easier to read and maintain. This step-by-step methodology can also be applied to DB2 for i queries. Here I'll discuss various approaches that IBM i developers can use for multi-step query implementations. First we'll look at temporary tables, a technique that is commonly used in multi-step queries but has drawbacks. Then we'll investigate some alternative techniques for multi-step querying and discover which ones are likely to have the most favorable effect on query performance.
Multi-Step Querying and Temporary Tables
IBM provides several database features that make it simple for developers to employ the multi-step approach. Historically, the two DB2 features used most often are temporary tables (i.e., physical files) and the ability to write the results of a Query/400 report to an output file using the Report Output Type (OUTTYPE) parameter. Both of these DB2 capabilities allow programmers to easily break a complicated report into a series of simpler steps. For example, the programmer runs a query to join together data from two tables and stores the result of the join query into a temporary work table. A second query is run that joins the temporary table with another table in the database. Figure 1 shows a graphical representation of this process, which, though a highly simplified example, demonstrates the practice often used by IBM i developers.
The multi-step solution depicted in Figure 1 can be characterized as using physical steps. Each time a query or step is performed, the data resulting from that step is physically copied into a table that may or may not be a temporary table. Although using multiple steps is a sound approach, physically making copies of the data is not the best idea from a performance perspective.
This multi-step solution involving physical steps makes a copy of the data in the two tables. Making a copy of the data takes time and consumes system resources. Often data copied into the work table during the first step will be discarded by the time the multi-step process reaches the final step. Moving around data that will not be part of the final result set is not the best usage of system resources. Would you apply this same approach when moving from one home to another? The moving truck is packed with all the contents of your old home. When the truck arrives at the new home 100 miles away, you realize that all the appliances were moved even though they were supposed to be left at your old home. Think of all the time and energy wasted loading and transporting the heavy appliances when they should have never been touched. Multi-step query solutions involving physical steps waste system resources in a similar manner. Using more system resources than necessary prevents the application's performance from scaling as more users are added to the system.
Another downside of implementing a multi-step solution with physical steps is that it limits the ability of the DB2 for i query optimizer to boost the performance of your query request. The DB2 query optimizer has several algorithms and shortcuts at its disposal that can significantly improve query performance. The goal of the query optimizer is to use the minimum number of I/O operations and use the CPU as efficiently as possible. Getting a query to be CPU bound means running it as quickly as possible and completing the request sooner.
On the other hand, developers using multi-step processes increase the likelihood that more I/O operations will be needed to produce the same result. Figure 2 contains a simple multi-step query implementation that demonstrates this point.
(SELECT shipdate, custname, phone, orderkey, suppkey, supplycost
FROM orders o, customers c
WHERE c.custkey=o.custkey AND discount=0.08) WITH DATA;
DECLARE GLOBAL TEMPORARY TABLE temp2 AS
(SELECT shipdate, custname, phone, orderkey, suppname, supplycost
FROM session.temp1 step1, suppliers s
WHERE s.suppkey=step1.suppkey AND s.country='USA') WITH DATA;
DECLARE GLOBAL TEMPORARY TABLE temp3 AS
(SELECT custname, phone, orderkey, suppname, supplycost,
FROM session.temp2, dates
WHERE datekey= shipdate) WITH DATA;
SELECT * FROM session.temp3;
In this case, the developer was asked to deliver a report that needs to pull together data from four different tables. To simplify the task, the developer used temporary tables to join two tables at a time instead of a more complex four-table join. If you haven't seen the DECLARE GLOBAL TEMPORARY TABLE statement before, it's just SQL syntax to create a table in the QTEMP library on IBM i. The first temporary table is used to join the first two tables (orders and customers) and store the results of the join. The definition of the second temporary table then joins a third table, suppliers, to the previously joined tables. The third temporary table repeats this process to pull in the data from the last table, dates. At this point, the application just queries the last temporary table to access data that has been joined together from all four tables.
The downside of the multi-step approach used in Figure 2 is that the temporary table limits the amount of optimization that can be performed on the order that the tables are joined together. The query optimizer is only given a chance to see two tables at a time. The optimizer is never given the opportunity to analyze the join conditions on all four tables to determine the most efficient join order from a performance perspective.
Compare this with the trip planning you might do for shopping. Let's say that you have to pick up items from four different stores. The physical step approach would be similar to driving to a single store and then returning home before going to the next store. Instead, most of us would first look at the location of all four stores on a map and find the most efficient route that enables us to visit all four stores before returning home. In a similar fashion, allowing the query optimizer to see the entire query process dramatically increases the chances of a query performing well.
Furthermore, the temporary tables used in this multi-step process will not have as many indexes created over them as the base tables used to populate the temporary table. Anyone with a basic understanding of query optimization knows that a lack of indexes is not good for performance. The DB2 for i query optimizer is dependent on the statistics provided by indexes to get a deeper understanding of the data. This enhanced knowledge increases the optimizer's accuracy in determining the fastest-performing method for the query implementation. In addition, indexes are needed to speed up the execution of the query because they provide a method to quickly search and sort the data in a table.
At this point, it may appear that this article is headed in the direction of discouraging the usage of multi-step query implementations, which is not the case at all! Keep reading, and you will find that the DB2 for i SQL interface provides syntax that enables you to implement complex queries and reports by using logical steps instead of physical steps.
Temporary Tables and Access Plans
Before covering the SQL syntax that enables logical steps, there's one final performance issue related to the usage of temporary tables that needs to be discussed: the creation of access plans. An access plan (also known as a query plan) must exist before an SQL statement can run. An access plan is a data structure that contains a listing of the objects and methods that the database engine will utilize during the execution of the SQL request. An access plan is similar to the blueprint for a new building. Construction of the new building cannot begin until someone has taken the time to create the blueprint. In the same way, the database engine must have an access plan in place for an SQL request before the execution of the query can begin. (For more information about SQL access plans, see "Understanding Access Plans and Open Data Paths.")
Each unique SQL statement requires its own unique access plan for execution. Let's apply this requirement to the two Select statements shown in the following example:
SELECT col2 FROM table2 ORDER BY col1
In this example, it is perfectly logical to see why each Select statement requires its own access plan. Even though the two Select statements are almost identical, a separate access plan is required because the statements reference two different tables. Going back to the blueprint analogy, it would not be wise for a builder to use the same blueprint for two houses that are similar. A blueprint should only be shared for constructing houses that are identical, just as an access plan can only be shared for the execution of identical SQL statements.
At this point, you might be wondering why access plan creation is a performance concern with temporary tables. After all, if multiple jobs on the system are running the following Select statement, then all jobs should be able to share and reuse a single query plan—correct?
This appears to be the correct assumption because the SQL statement text is identical. However, the reality is that this access plan cannot be shared by different jobs running the same query against a temporary table.
When DB2 for i evaluates whether an access plan can be shared, DB2 examines the SQL statement text and the objects referenced by the statement. Although the name and definition of the temporary table are the same in each job, the object address of the temporary table is different because each job has its own unique instance of the temporary table in the QTEMP library. As a result, if you have 50 jobs that run the Select statement in the previous example, each job will have to use system resources to create a query plan to enable the Select statement to be run within the job. The fact that each job has to create its own access plan is where the performance issues start to arise with temporary tables.
Let's compare the lack of access plan sharing for temporary tables with the access plan sharing that occurs by revisiting the set of two Select statements shown previously. If these two Select statements are run in 50 different jobs on the system, then DB2 will only be asked twice to create access plans—once for each Select statement. The access plans for each statement will be reused and shared across all 50 jobs. If the two Select statements referenced temporary tables, the DB2 engine would be called 100 times to produce access plans—two access plans for each job. When you compare the costs of 100 calls to the DB2 engine to two calls to the DB2 engine, it should be clear how the usage of temporary tables can start to slow your system's performance.
Now that you understand the performance pitfalls of using temporary tables, let's look at alternatives that deliver better performance. One alternative is to use pseudo-temporary tables. Instead of creating temporary tables in QTEMP, a permanent work table is created in a permanent library on the system. As depicted in Figure 3, each job would store its temporary data in the pseudo-temporary table instead of storing the data in temporary tables in the QTEMP library. The usage of the same permanent table by all the jobs eliminates the excessive access plan-creation issue caused by the temporary tables solution.
Figure 4 shows the first step in this alternative solution: the creation of the pseudo-temporary work table. Instead of creating a table in the QTEMP library, the table (pseudo_temptable1) is created in a permanent library named AppLib. In this situation, you may want to consider creating the pseudo-temporary table into a library instead of an SQL-created schema because tables created into an SQL schema are journaled by default. More often than not, data changes to an intermediate work table do not require the recovery protection that journaling provides. From a performance perspective, you will want to carefully evaluate whether your pseudo-temporary table needs to be journaled.
Data_Col3 DECIMAL(9,2) )
/* Optional object */
CREATE SEQUENCE JobID
START WITH 1 INCREMENT BY 1
When creating a pseudo-temporary table that will be shared across jobs, you will need to add one additional column to the temporary table definition to uniquely identify the rows in the table that belong to this job. The name of that identify column for this pseudo-temporary table is Job_Identifier. The integer data type is chosen for this example because a DB2 sequence object will be used to generate the meaningless key value for the Job_Identifier column, as shown in Figure 4. Usage of DB2 sequence is optional. A more complex solution would involve the application retrieving the job information and use a concatenated value of the IBM i job name, job user, and job number as the job identifier values for the rows belonging to that job in the pseudo-temporary table.
The pseudo-code in Figure 5 demonstrates the runtime usage of the pseudo-temporary table.
VALUES (NEXT VALUE for JobID) INTO :hostVarID
/* Populate temporary table with data */
INSERT INTO applib/pseudo_temptab1
SELECT :hostVarID, custID, custName, ordAmount
FROM customers, orders
WHERE custID = ordCustID
/* Join data in temporary table to another table */
SELECT t.custID, custName, ordAmount, payment_method_code
FROM applib/pseudo_temptab1 t, customers_payment_method p
WHERE t.custID = p.custID AND
Job_Identifier = :hostVarID
/* Remove rows from temp table for current job*/
DELETE FROM applib/pseudo_temptab1
WHERE Job_Identifier = :hostVarID
When a job is ready to use a pseudo-temporary table, the first step is obtaining an identifier to uniquely identify the rows associated with the job. The VALUES statement generates this unique identifier by requesting the next value from the JobID sequence object. The sequence object simplifies the unique identifier generation process because DB2 automatically takes care of the locking and incrementing logic that is required to generate a unique key value.
With the job identifier value stored in a host variable, the next step is populating the pseudo-temporary table. Notice how the sub-select on the INSERT statement references this host variable. This reference guarantees that all the joined rows written to pseudo_temptab1 will have the same value for the job_identifier column. The application then joins customer payment method data with data stored in the pseudo-temporary table belonging to the current job by specifying the selection criteria of Job_Identifier = :hostVarID. Once the processing of the data stored in the temporary table is completed, the rows for the current job are deleted by specifying the same selection criteria (Job_Identifier = :hostVarID ) on the DELETE statement.
Although this pseudo-temporary table solution eliminates the performance overhead of excessive access plan creations, it still has the performance penalty of using physical steps in the solution. A fair amount of system resources will be consumed by the steps that physically copy rows into the pseudo-temporary table and then delete the rows for from the pseudo-temporary table once the processing has finished.
SQL Views and CTEs
Because of the performance overhead involved with using physical steps, the best solution is to eliminate the usage of temporary (and pseudo-temporary) tables by using SQL functionality that allows complex reports to be broken into logical steps. SQL views and common table expressions (CTEs) are the DB2 for i SQL features that are most commonly used to break complex requests into a series of simple, logical steps. The SQL derived tables support also enables logical step-by-step processing, but most developers find the CTE feature easier to understand and implement.
An SQL view is used to logically define a set of data. When the CREATE VIEW statement is executed, no data is copied into the SQL view object. Creating an SQL view is equivalent to creating a non-keyed logical file. The CREATE VIEW statement execution simply registers the logical data set definition with DB2, so that your application programs can use the logical data set.
Figure 6 shows how an SQL view can be used to logically implement the physical processing steps detailed in Figure 5.
SELECT custID, custName, ordAmount
FROM customers, orders
WHERE custID = ordCustID
SELECT custID, custName, SUM(ordAmount)
GROUP BY custID, custName
Normally, an application would not create the view as part of its processing. An application programmer would typically just reference the view by name as shown on the SELECT statement—knowing that the view had been created previously as part of the database creation. When the SELECT statement that references the view is executed by the application, the access plan generated by DB2 essentially will combine that SELECT statement with the SELECT statement contained into the view definition into a single SELECT statement. Using an SQL view to break the query request into multiple logical steps enables the DB2 query optimizer to see all the steps (statements) involved in the query and combine the logical steps into a single SQL request. This logical streamlining by the query optimizer is where your application can receive significant performance gains compared to the physical steps solution that relies on temporary tables.
Because SQL views create permanent objects in your database, you need to manage the view objects and their source code definitions just as you do the table objects that contain your business data. The more views that are created to break complex queries into logical steps, the more view objects need to be managed. This management issue often leads developers to use the SQL CTE support to implement a complex report with a series of logical steps. (You can find more information about SQL CTEs in "Using the WITH Clause for Better SQL Statements.")
One way to think of a CTE is that it allows you to define a temporary SQL view that exists only for the life of your query. A CTE provides the same logical step-by-step capability as a view, but there's no permanent object that is associated with the CTE. These two SQL features are often used together. An SQL view definition can use one or more CTEs, and a CTE has the ability to reference one or more views.
The query in Figure 7 demonstrates how multiple CTEs can be used to logically simulate the physical temporary table steps that were employed in the four-way join example shown in the Select statement against a temporary table discussed earlier.
(SELECT shipdate, custname, phone, orderkey,
FROM orders o, customers c
WHERE c.custkey = o.custkey AND discount=0.08),
(SELECT shipdate, custname, phone, orderkey,
FROM step1, suppliers s
WHERE step1.suppkey = s.suppkey AND s.country='USA'),
(SELECT custname, phone, orderkey, suppname, supplycost,
FROM step2, dates
WHERE datekey= shipdate)
SELECT * FROM step3
Defining a CTE requires usage of the WITH clause. When a statement contains multiple CTEs like the example in Figure 7, the WITH clause is specified only once and a comma is used to include additional CTE definitions. The CTE definitions being declared at the top of the query are similar to how an application program often has common routines or procedures defined at the beginning of the object. A CTE can be referenced multiple times in the query similarly to how a common routine will frequently be invoked from different parts of an application. This up-front style of CTE improves the readability of the query as well.
In Figure 6, notice how easily the Select statements used to define the temporary tables integrate into the CTE definition. Each CTE definition references the previous CTE in the same way that each temporary table referenced the data stored in the previous temporary table. The main Select statement retrieves all the columns and rows from the last table expression mirroring the simple Select statement at the end of the temporary table process.
Comparing the runtime implementation of the CTE logical steps with the temporary table physical steps is where things really get interesting. Figure 8 shows a Visual Explain diagram that represents the access plan built by the DB2 query optimizer for the CTE implementation.
There are two key differences to focus your attention on. First, the plan utilizes no temporary data structures that contain a copy of the database data. As discussed earlier, the temporary table implementation will always have the performance penalty of physically copying data from database tables. This copy process unnecessarily consumes CPU, memory, and disk I/O subsystem resources. At times, the DB2 optimizer might utilize temporary data structures such as hash tables that make a copy of the data. However, these temporary data structures can be populated much faster than a temporary table.
Second, the optimizer's plan uses a join order that is different from the temporary table join order. The temporary table implementation physically forced a join order of orders to customers to suppliers to dates. The Visual Explain output shows that the query optimizer determined the most efficient join order to join suppliers to orders to suppliers to customers. The logical step-by-step CTE implementation allows the query optimizer to see all the query steps. This complete perspective of the query gives the optimizer a chance to apply its advanced algorithms and query rewrite techniques to determine the most efficient implementation. That's a good deal when it comes to performance and scalability.
One Logical Step at a Time
What you should take away from this article is that using a series of simple steps to solve a complex problem is still a valid technique to employ—even when using SQL. A multi-step approach is best accomplished using SQL views and CTEs, which enable queries to be broken into logical steps instead of physical steps. Allowing the DB2 for i query optimizer to see the entire request with the usage of logical steps is essential in keeping the performance of both your queries and systems at a high level.