When most people hear the term “Large Database” they assume it is an aggregation of mostly static data that is being mined and searched. But, there are also databases full of scientific calculations and rules. These large databases that are used heavily for calculations pose a specific set of challenges when trying to achieve maximum throughput and multiple users. There have been several projects where we have been brought in to assist or evaluate procedural code because the required or expected performance was not being achieved. Most projects wait until the end of the development cycle to do heavy performance testing or stress testing of a system. This procrastination places a number of constraints on what remedies can be used to address the issues, as it is too late to change the database design, use newer features or even possibly change configuration parameters. To learn a better way to do database development check out my article on SQL with Tea.
With these constraints in place, we are limited to isolated code changes or design changes that do not affect data retrieval from the front-end. We have consistently been able to achieve optimal results after reviewing the procedural code and suggesting or making changes. Database engines are powerful and smart if you can provide only the pertinent information or ask the right question.
Before attacking any procedural code, the first step in performance evaluation is looking at the performance of the existing queries as written. In SQL Server the use of the profiler can help identify poorly performing queries and using the explain plan to see what the optimizer is doing with the queries can be very enlightening. After reviewing the explain plan, we can look for ways to increase performance or decrease the data set size. If you are not familiar with these tools, you should be. You can follow the links to get started with each of the tools.
I wanted to share the 5 top areas, based on our experiences, where small changes can make a large difference. While you are performing your query plan analysis keep these thoughts toward the front.
Limit the Data Reach:
Stored procedures that do heavy processing may need to pull information from a lot of rows in the database before executing. There may even be the need to do interim calculations before a final calculation can be performed. It is important to make sure that you are not touching data that you are not going to need. Not using proper join statements or incomplete filtering (i.e. where clauses) will cause a query to return too much data, causing extra processing. It is generally faster to issue two queries to the DB instead of one query and then exclude data in procedural code. A key point here is that with less data being returned there is a lower chance of locking or even the dreaded deadlock.
Query Based Indexing:
Many software developers know the value of an index on a database table, but are not sure of how to set-up the indexes to get the maximum benefit from them. Over-indexing is just as bad as under-indexing a table and there is no hard or fast rule on how many you should have. Setting up an index is as easy as reviewing your queries to see how you are selecting your data and how you are joining tables, and then using those fields in the proper order. Of course there are other indexing options available that can also help. Three of these are specifying uniqueness, sort order, or the creation of a covering index to eliminate the table access all together. If you are not familiar with this concept or would like to read more about it, Microsoft has a good article about it here.
Using Temporary Tables:
For databases with a high quantity of users and large tables, the use of temporary tables can be a large performance boost. Most databases have Global or Local temporary tables where the scope of the table is different. Temporary tables act just like real tables but usually contain a limited and potentially a pre-modified set of data. Population of a temporary table can be done using a ‘Create Statement’ or using a ‘Select Into’ statement. The temporary table will not have any indexes (at first) but indexes can be added and specialized for how the temporary table will be used. This can eliminate the over-indexing mentioned above. Local temporary tables are user specific which could help eliminate locking issues that might be seen if the real table was used. A word of caution when using temporary tables — make sure you have the needed space in your tempdb (SQL Server) to host the temp tables.
Use of Common Table Expression:
The use of common table expressions (CTE’s) in a stored procedure can make the queries easier to write and read and can make debugging easier when evaluating the data in a data set. There are many differences between CTE’s and temp tables but we can save most of that for another post. The top three differences are that they are not materialized, cannot be indexed, and are only good for the next query after they are defined. CTE’s can be nested and referenced multiple times in the primary query. Once again, make sure you limit your CTE’s to the proper data set and only the columns that you need.
Dynamically Eliminate Steps:
We continually see stored procedures being called that perform no processing due to lack of data (think processing and totaling tax deductions when no deductions have been entered). If there is a good possibility that the data does not exist, a quick check to see if it does can eliminate the overhead of calling a procedure. This overhead would also include any processing that might take place before a data collection query is executed in the called procedures (i.e. variable, cursor, or CTE declaration, temp table creation, etc). The overhead of a data existence select could heavily outweigh the call to the procedure.
If your calculations perform a lot of batch totaling into a parent table, a simple ‘isModified’ column that gets set at the time of data entry, can eliminate recalculating batch header information if the underlying detail records for the batch have not been modified or added to.
While this covers the top 5 performance robbers that we have seen, there are many other reasons why performance could be lacking, or why locking could be happening. Each application is different and has different scaling and performance requirements. The sooner you can identify that an issue exists the sooner you can start to address them and the more options that will be available to you. Test-driven development can help spot performance issues, you can read more about it in my posting SQL with Tea.