Indexing is a way to optimise the performance of a database by minimising the number of disk accesses required when a query is processed. This post show the power of indexing, how to improve database performance, and where we can go wrong with indexing.
Time-series data is the lifeblood of any SCADA system. It is a sequence of data points (tags), and measured at successive time instants. Time-series data is used to monitor machines and processes in any kind of industry, and it is created when a data point (for example from a PLC) is read and time-stamped by the SCADA system. Using a modern SCADA system like Ignition that fully utilises SQL databases offers such such major advantages over traditional SCADA that it can literately redefine the role of SCADA in the organisation.
When you combine Ignitions’s licensing model (unlimited number of tags) with a business need for monitoring machines and processes in a high resolution you easily end up with a huge amount of data. That is not in itself a problem – data storage today comes at a very low cost – but you can easily run into database performance problems.
One way to improve database performance is through the use of indexing. Well-designed indexes are part of database optimisation because they are the key factor in achieving optimal query performance and response time. The most important part is to understand what to index and how the indexing is going to boost the query response time! A proper index can be created only when you know exactly what your query and data access patterns look like.
To learn more about the power of indexing (and its limitations) we created a test environment. The remaining part of this post provides the details of this test, and for the busy, lazy person the following guidelines might suffice:
- Indexes consume extra memory
- Insert, Delete and Update are costly with indexes
- Cardinality (number of distinct values in a column) is important! High cardinality -> better performance
- Use indexes in important queries
- Index to avoid sorting
- Create indexes for uniqueness
- Create indexes for foreign keys
- Consider adding columns for index only access
- The order is important
- Avoid over-indexing tables.
The data that is used for the example queries is for Air Pollution in Seoul, South Korea. The used data is downloaded from Kaggle, which provides public datasets used for science and education projects.
The database used for the test is PostgreSQL. For every execution plan we used DataGrip which is Integrated development environment (IDE) for SQL.
- Measurement_item: 7 columns, 6 rows
- Measurement_station: 5 columns, 25 rows
- Measurement: 5 columns, 3 885 066 rows
ERD – Entity Relationship Diagram
The WHERE clause
The SQL WHERE clause is used to specify a search condition while fetching data, and it thus falls into the core functional domain of an index: finding data quickly. The main thing we can note here is that a poorly written WHERE clause is the first ingredient of a slow query. We will start with the simplest yet most common WHERE clause: the primary key lookup. For this example we will use the table “measurement”. Initially we have not created primary key on this table.
select average_value, instrument_status
where time = ‘2017-11-29 13:00:00.000000’
We got the result of this query in 550 milliseconds with the following execution plan:
A query that is not indexed results in a full table scan, where the database processes each record of the table to find all records meeting the given requirements. Even if the query selects just a few rows from the table, all rows in the entire table will be examined. Let’s try to add a primary key on the table “measurement”.
alter table measurement add constraint measurement_pk unique (time, station_code, item_code);
The PRIMARY KEY constraint uniquely identifies each record in a table. Primary keys must contain UNIQUE values, and cannot contain NULL values. Primary keys are always indexed by default. That means that whenever we create primary key on a table, an index on that column(s) is created.
Now we run the query again, and we get the result in 90 milliseconds with the following execution plan:
The Index Scan means that the previously created index is used to search the data. There are many ways where we can go wrong with indexing. Note that the column order of a concatenated (composite) index has great impact on its usability so it must be chosen carefully.
select average_value, instrument_status
where item_code = 1
The index we created previously is composed of three columns (time, station_code, item_code). In Example 1.2 we are using column item_code and we see that it is in the index. But this query executes in 300 milliseconds with the following plan:
The execution plan reveals that the database does not use the index. Instead it performs a FULL TABLE SCAN. As a result, the database reads the entire table and evaluates every row against the where clause. The key to remember is that indexes only speed up queries if you access a continuous portion of this list. Here we access the third column and the index is useless because item_code is located in completely different section depending on station_code and time. If we get to the point to realise that we need to execute many queries with condition upon only column item_code, then we can add another index only for that column. A perfect query for the index we created would be one, where every column of the index is used for the where clause. Let’s give an example for that!
select average_value, instrument_status from measurement where time > '2017-11-29 13:00:00.000000' and station_code = 118 and item_code = 1;
Now we got the results in 84 milliseconds – a considerable improvement!
An index whose definition contains functions or expressions is a called function-based index. Instead of copying data directly into the index, a function-based index applies the function first and puts the result into the index.
select time, average_value
where ceiling(average_value) = 50
We got the results of this query in around 300 milliseconds and in the execution plan we see that full table scan is used:
As the next step let’s create a function-based index:
create index func_index on measurement(ceiling(average_value))
The execution time is now around 93 milliseconds with the following execution plan.
The JOIN operation
SQL join clause is used to combine rows from two tables, based on a related column between them. In most case it returns records which have matching value in both tables. Before creating indexes, let’s give two examples.
select time, average_value, station_name from measurement m join measurement_station ms on m.station_code = ms.station_code
This query executed in 300 milliseconds.
If your query involves WHERE conditions in addition to the join, the database server might apply the WHERE condition before trying to find the matching rows in the second table.
select time, average_value, station_name from measurement m join measurement_station ms on m.station_code = ms.station_code where time > '2017-11-29 13:00:00';
This query executed in 753 milliseconds.
Let us have a look at SQL JOIN
The nested loops join is the most fundamental join algorithm. It works using two nested queries: the outer to fetch the results from one table and a second query for each row from the driving query to fetch the corresponding data from the other table. Query in Example 2.1 tries to find the rows that match on the station_code column and with the nested-loop join strategy, the database server must compare all the station_code values in one table with all the station_code values in the other. The biggest benefit often comes when you have a clustered index on the joining column in one of the tables. We create index on column station_code in the table measurement_station.
For query in Example 2.1 we get the results in 85 milliseconds (compared to the previous 300 milliseconds).
The hash join algorithm aims for the weak spot of the nested loops join: the many traversals when executing the inner query. The indexing strategy for a hash join is very different from the Nested Loops Join because there is no need to index the join columns. Only indexes for independent where predicated improve hash join performance. In Example 2.2 the TIME filter is the only independent where clause, so we create index on this column.
create index indexed_time on measurement(time)
Now, for query in Example 2.2 we get the results in 111 milliseconds (previously 753 milliseconds), with the following explain plan:
The sort-merge join combines two sorted lists like a zipper! A sort-merge join needs the same indexes as the hash join, that is an index for the independent conditions to read all candidate records in one shot.
Sorting and grouping
Sorting is a very resource intensive operation. An index provides an ordered representation of the indexed data. We can use indexes to avoid the sort operation to satisfy an order by clause.
select time, average_value, station_code
where time > ‘2019-01-01 00:00:00’
order by time, average_value
When executing this query we see that the database uses Index Scan with the index we previously created on the column time. But yet, sorting is used and the results are returned in more than 1 second! That is 1.000 milliseconds and very close to the phenomenon known as boredom!
The reason for this is that the database must, however, perform an explicit sort operation to satisfy the corresponding order by clause. We now just have to expand the index definition to include all the columns used in the order by clause:
create index indexed_time on measurement(time, average_value);
Now we get the results in 82 milliseconds and the database omitted the sort operation:
The performance of the commands, insert, delete and update is for the most part negatively influenced by indexes.
The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes. The insert statement is the only operation that cannot directly benefit from indexing because it has no where clause. If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason it has to add the new entry to each and every index on that table.
The delete statement has a where clause that can use all the methods described previously to benefit directly from indexes.
An update statement must relocate the changed index entries to maintain the index order. The database must remove the old entry and add the new one at the new location. The response time is basically the same as for delete and insert statements together.
The index-only scan is one of the most powerful tuning methods of all. It not only avoids accessing the table to evaluate the where clause, but avoids accessing the table completely if the database can find the columns in the index itself.
where item_code = 1 and instrument_status = 1
create index covering_index on measurement(item_code, instrument_status, average_value);
Notice that when creating the index we first put the columns used in the where clause and then the column used in the select clause. Order is important! The index covers the entire query so it is also called covering index.
With the help of the covering index the query returned the result in 78 milliseconds.
Let us change the query in Example 4.1. The only change that are we going to made is in the where clause, the AND operator will be changed to OR operator.
where item_code = 1 or instrument_status = 0
By the execution plan we see that our index is not even used.