Sunday, 13 July 2014

Query Optimization

The basic steps in query optimization are as follows.

1.      Query Plan generation
2.      Cost Estimation
a.      Enumerate reasonable alternative plans
b.      Estimate the properties of input using catalog and also estimate the properties of result
c.       Repeat for space
3.      Choose plan
4.      Intimate Catalog Manager
The goal of a query optimizer is to find a good evaluation plan for a given query. Optimizing a relational algebra expression involves two basic steps:
1.      Enumerating alternative plans for evaluating the expression

2.      Estimating the cost of each enumerated plan, and choosing the plan with the least estimated cost.


Query Evaluation Plans

A query evaluation plan consists of an extended relational algebra tree, with additional annotations at each node indicating the access methods to use for each relation and the implementation method (index, sort etc.) to use for each relational operator (selection, projection etc.).

A query is broken into blocks. Each block is definitely a SELECT and FROM pair and optionally WHERE, GROUP BY, HAVING. If GROUP BY is present then it goes with SELECT at the time of evaluation.

E.g. Union query consist of two blocks. Now if each block is having a nested query then each can be further divided into two blocks. Hence there will be four blocks.

A block is taken as a unit for query optimization. The assumption is "Optimizing Individual block optimizes entire query." This is mostly right but not always.

A block can be converted to relational algebra expression and then each operator will be evaluated.

Following points must be considered at the time of generating query evaluation plans.

To obtain a fully specified evaluation plan, we must decide on an implementation for each of the algebra operations involved.


The index information is available in System Catalog. Hence using the relational algebra expression and system catalog, evaluate query and develop alternative plans.



Pipelined Evaluation

When a query is composed of several operators, the result of one operator is sometimes pipelined to another operator without creating a temporary relation to hold the intermediate result. Pipelining the output of an operator into the next operator saves the cost of writing out the intermediate result and reading it back in.

If the output of an operator is saved in a temporary relation for processing by the next operator, we say that the tuples are materialized. E.g. join the relations, write on disk and then go for selection and projection. It is a costly affair. But it is chosen if

1. Same join is used by another query.
2. By choice you have decided to keep data in denormalized form. This is done when this data is needed again and again. So we normalize the data as much as possible and then denormalize it a level down as it is required every now and then.

Consider the following SQL query:

Find names of sailors having rating > 5 and reserved boat with bid = 100

SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
AND R.bid = 100 AND S.rating > 5

This query can be expressed in relational algebra as follows: 

Plan 1

The relational algebra tree for the query is as follows. First we compute the natural join of Reserves and Sailors, then perform the selections, and finally project the sname field. 


We can use a page-oriented simple nested loops join with Reserves as the outer relation and apply selections and projections to each tuple in the result of the join as it is produced; the result of the join before the selections and projections is never stored in its entirety. This query evaluation plan is shown in the following figure.


If 5 buffers are available then

If materialization of the resultant relation is desired (due to any of the reasons mentioned above) then we can use 
  • Page oriented simple nested loops join – 1 buffer will be allocated to R, 1 buffer to S and 1 buffer to R |X| S. But the remaining 2 buffers will not be utilized.
  • Block nested loops join – Here we can allocate 3 buffers to smaller relation S, 1 buffer to relation R and 1 for the resultant relation.
At the end all 5 buffers will be available for selection and projection operation.

If it is to be done on-the-fly then

The steps of execution will be –
  1. Load as many buffers as possible for R and S (different for page oriented and block nested loops join)
  2. Match all tuples of R will all tuples of S and store satisfying tuples in the buffer allocated to resultant relation.
  3. If buffer of resultant relation is full, pause join operation and open the buffer for read operation. 
  4. Apply selection condition and discard attributes which are not required.
  5. When the buffer becomes empty restart join operation.
Plan 2

Consider there are 100 boats, therefore 1/100 of boats qualify for R; and there are 10 ratings, as the condition is >5, therefore 1/2 of the tuples from S will qualify. 

Hence it is better to first discard the tuples which do not satisfy the condition. This will result is less number of tuples qualifying for the Join operation. The relational algebra tree for the query plan 2 is as shown in the figure


Can we create Index?

As the condition is equal to (=) and the attribute bid is a candidate key and is part of the relation R, hence a hash index is appropriate. If it is given that the maximum number of reservations on a boat can be 3 then there is possibility of collisions on R.bid.  Hence in this case clustered hash index is the best solution.
The total cost can be calculate as

R.bis = 100

S.rating>5
Cost of selection with clustered hash index
+
Cost of serial Scan
+
Cost of hash join i.e. 3 * (M + N )

Plan 3

As only related attributes for the query is Sid, sname and rating from relation S, hence we can first discard other attributes and then go for the selection operation.


If at the end, everything need be sorted by sid then it makes sense to use Sort-Merge join.

No comments:

Post a Comment

Your comments are very much valuable for us. Thanks for giving your precious time.