Tag Archives: materialized views

Look at materialized views in PostgreSQL 9.3

I needed to use materialized views for the first time in order to handle performance problems that our team encountered when we were developing a reporting tool for one of our clients.

The database that was a MySQL5.1 already contained several views, this caused us a lot of performance bottlenecks especially as the data continued to grow. It took me some time to create and maintain a materialized views to replace normal views so we can get good performance, because MySQL don’t have out-of-box materialized view feature.

Once I started to learn PostgreSQL before couple of weeks, the most interesting thing that attracted me is support of creating and maintaining materialized views.

Database Views

Most people are more familiar with a concept of database views, which creates a virtual table from a result of select query, the view is used usually for:

  • security purposes: in case you want to give certain user a permission on a subset of table data rather than a whole table.
  • Hide the complexity of a query: by encapsulating all the complex query parts in the view definition.

But the views can be a major performance bottleneck in databases such as MySQL – especially when built upon other views, so the concept of materialized views came to light in order to hold that advantages of views with eliminating the performance troubles it may cause.

What are Materialized views and its benefits?

The definition of materialized view (Abbreviated as MV) in wikipedia is

A materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary based on aggregations of a table’s data

so in short, materialized views are objects that stored in the databases, (like an actual table) that holds a precomputed result of a select query as its data. Being an object stored actually on disk will allow queries based on it to take benefit from caching resultsĀ  without re-executing the complex underlying query each time you use the materialized view, and more important implement indices on it.

The concept in theory is simple and a MV can be build initially (as a simple table) easily usingĀ  simple query like that:

create table test_mv 
    select * 
    from test 
    where 
    col1 = 1
    //.... any more complicated conditions
    group by col_name

but the most important issue is how to get the updated data into the materialized view (refresh the MV). The materialized views can be refreshed either by complete refresh (which is executing the underlying query again) or incrementally (via executing the query of subset of records that has been changed recently)

Unfortunately MySQL don’t have built in functionality to that materialized view till now, although there are some interesting open source projects like flexviews.

Example database structure

In order to try out the materialized views in PostgreSQL, I have migrated a subset of testing data that I have to Postgres, and here is the structure:

I have a users table that have the following structure:

mt4data=> \d+ users;
                                                    Table "public.users"
       Column       |            Type             |              Modifiers               | Storage  | Stats target | Description
--------------------+-----------------------------+--------------------------------------+----------+--------------+-------------
 login              | numeric                     | not null                             | main     |              |
 GROUP              | character(16)               | not null                             | extended |              |
 zipcode            | character(16)               | not null                             | extended |              |
 address            | character(128)              | not null                             | extended |              |
 phone              | character(32)               | not null                             | extended |              |
 email              | character(48)               | not null                             | extended |              |
 comment            | character(64)               | not null                             | extended |              |
 id                 | character(32)               | not null                             | extended |              |
 agent_account      | numeric                     | not null                             | main     |              | 
 modify_time        | timestamp without time zone | not null                             | plain    |              |
-- Other columns

and here is the structure of trades table, which holds all trades done by users in trading system, the sample data that I tested on for this article contained about 700,000 rows:
mt4data=> \d+ trades;
                                              Table "public.trades"
      Column      |            Type             |          Modifiers          | Storage  | Stats target | Description
------------------+-----------------------------+-----------------------------+----------+--------------+-------------
 ticket           | numeric                     | not null                    | main     |              |
 login            | numeric                     | not null                    | main     |              |
 symbol           | character(16)               | not null                    | extended |              |
 cmd              | numeric                     | not null                    | main     |              |
 volume           | numeric                     | not null                    | main     |              |
 open_time        | timestamp without time zone | not null                    | plain    |              |
 open_price       | double precision            | not null                    | plain    |              |
 close_time       | timestamp without time zone | not null                    | plain    |              |
 profit           | double precision            | not null                    | plain    |              |
 comment          | character(32)               | not null                    | extended |              |
 modify_time      | timestamp without time zone | not null                    | plain    |              |

now, I want to generate some reports based about special type of trades, which are “deposits”, here is the query:

select trades.TICKET AS ticket,
    trades.LOGIN AS login,
    users."GROUP" AS group_name,
    trades.VOLUME AS volume,
    users.AGENT_ACCOUNT AS agent_account,
    users.ZIPCODE AS zipcode,
    users.STATUS AS status,
    trades.CLOSE_TIME AS close_time,
    trades.PROFIT AS amount,
    trades.COMMENT AS comment ,
    trades.MODIFY_TIME
    from
    trades, users
where
    (users.LOGIN = trades.LOGIN)
        and (trades.CMD = 6)
        and (trades.PROFIT > 0)
and (
 trades.comment like 'DPST%' or
 trades.comment like 'Bonus%' or
 trades.comment like 'Credit%' or
 trades.comment like 'Deposit%'
)

as you can see, the query used to retrieve this data contains many conditions, and can be troublesome to execute. Here is the complexity of query as show in the result of explain query:
 Nested Loop  (cost=4874.69..67486.46 rows=42 width=127)
   ->  Bitmap Heap Scan on trades  (cost=4874.41..67189.44 rows=42 width=72)
         Recheck Cond: (cmd = 6::numeric)
         Filter: ((profit > 0::double precision) AND ((comment ~~ 'DPST%'::text) OR (comment ~~ 'Bonus%'::text) OR (comment ~~ 'Credit%'::text) OR (comment ~~ 'Deposit
%'::text)))
         ->  Bitmap Index Scan on i_cmd  (cost=0.00..4874.40 rows=70186 width=0)
               Index Cond: (cmd = 6::numeric)
   ->  Index Scan using login_index on users  (cost=0.28..7.06 rows=1 width=60)
         Index Cond: (login = trades.login)
(8 rows)

The query needed about 25 seconds to execute on my data sample, so on the actual data it will take much more time.

In Postgres 9.3, I can create the materialized view by issuing:

create materialized view deposits as 
select trades.TICKET AS ticket,
    trades.LOGIN AS login,
    users."GROUP" AS group_name,
    trades.VOLUME AS volume,
    users.AGENT_ACCOUNT AS agent_account,
    users.ZIPCODE AS zipcode,
    users.STATUS AS status,
    trades.CLOSE_TIME AS close_time,
    trades.PROFIT AS amount,
    trades.COMMENT AS comment ,
    trades.MODIFY_TIME
    from
    trades, users
where
    (users.LOGIN = trades.LOGIN)
        and (trades.CMD = 6)
        and (trades.PROFIT > 0)
and (
 trades.comment like 'DPST%' or
 trades.comment like 'Bonus%' or
 trades.comment like 'Credit%' or
 trades.comment like 'Deposit%'
)

the initial build of my materialized view took almost 20 seconds on my sample data (which is almost similar to initial build on MySQL).

here is I added my required indices on the new MV:

mt4data=> create index login_deposit_index on deposits  using btree(login);
CREATE INDEX
mt4data=> create unique index ticket_deposit_index on deposits  using btree(ticket);
CREATE INDEX                              
mt4data=> create index close_time_index on deposits  using btree(close_time);
CREATE INDEX

the number of records in the deposits materialized view, is:
mt4data=> select count(*) from deposits;
 count
-------
   176
(1 row)

now I can issue some queries on the materialized view with low cost, like this query that get the number and total amount of deposits per group:
mt4data=> explain select group_name, count(*), sum(amount) from deposits group by group_name;
                           QUERY PLAN
-----------------------------------------------------------------
 HashAggregate  (cost=7.08..7.60 rows=52 width=25)
   ->  Seq Scan on deposits  (cost=0.00..5.76 rows=176 width=25)
(2 rows)

as explain command result states, the cost indicates fast execution of the query.

The most advantage of materialized views in PostgreSQL 9.3 is the refresh of the MV data. prior to this version, the refresh process was done by monitoring changes on the underlying tables using triggers (which can add performance overhead in environments with high write rate) or by writing custom script or stored procedure that checks modification based on timestamps, this can be troublesome and may be little bit complicated based on your tables and query structure.

After I modified the underlying trades table by inserting another chunk of data, I will refresh the materialized view by using “refresh materialized view” command as following:

mt4data=> refresh materialized view deposits;
REFRESH MATERIALIZED VIEW

now, the materialized view is updated, and I can see the new number of records:
mt4data=> select count(*) from deposits;
 count
-------
   731
(1 row)

the rebuild of the materialized view took about 3-5 seconds on my machine, which is fast (compared to a custom procedure I wrote in Mysql to perform similar functionality).

According to PostgreSQL documentation, the refresh that is performed on MV now is a complete refresh, meaning that the query used in the MV definition is re-executed and the new data is re-filled, but incremental refresh is expected in future releases (or as a patch) which should make refresh process much more efficient.

Finally,

As you need see in this post, PostgreSQL 9.3 offers simple and efficient way to handle materialized views, although more performance improvement can be attained once incremental refresh feature is implemented. I hope MySQL can offer some similar feature in the near future, since its very useful feature and needed frequently by database developers and administrators.