Numerous web services, including Lob, rely on PostgreSQL for fast and trusted relational data storage. For simple services with a relatively low number of rows in the database, PostgreSQL’s default settings work nicely right out of the box. However, as table sizes grow into the millions of rows, more customized settings are required to ensure optimal performance. Our database’s continual and rapid growth revealed performance degradation of Postgres that required us to implement aggressive auto-vacuum and auto-analyze settings. In this post I’ll go into detail about our specific problem, the tools we used to diagnose it, and how it was ultimately solved.
Automate your business with our suite of direct mail APIs.
Make sure your largest database tables are vacuumed and analyzed frequently by setting stricter table-level auto-vacuum settings. Below is an example which will VACUUM and ANALYZE after 5,000 inserts, updates, or deletes.
ALTER TABLE table_name SET (autovacuum_vacuum_scale_factor = 0.0);
ALTER TABLE table_name SET (autovacuum_vacuum_threshold = 5000);
ALTER TABLE table_name SET (autovacuum_analyze_scale_factor = 0.0);
ALTER TABLE table_name SET (autovacuum_analyze_threshold = 5000);
At Lob, we’ve built an internal website to track business metrics, facilitate our customer support team, and track the order status of our Postcard API and our Letter API. Aggregating some of this data, like monthly revenue for example, requires complex database calls to be made on tables containing millions of rows. We’re accustomed to these types of queries taking several seconds, and for the most part we’re fine with the speed, given that the site is not customer facing. Performance is not our primary concern here.
As our tables continued to grow, however, queries began to take 20 seconds or more. The internal website became unusable.
We began to diagnose the cause of the slow queries by running them on the database with different parameters and monitoring execution times. We looked for patterns to relate the types of queries to their latency. An interesting correlation between a date filter and performance surfaced. A query that fetched all rows inserted over a month ago would return in ~1 second, while the same query run on rows from the current month was taking 20+ seconds.
With this discovery, the next step was to figure out why the performance of these queries differed by so much. PostgreSQL’s EXPLAIN statement was an essential tool. When Postgres receives a query, the first thing it does is try to optimize how the query will be executed based on its knowledge of the table structure, size, and indices. Prefixing any query with EXPLAIN will print out this execution plan without actually running it.
When we compared the outputs of EXPLAIN on the fast and slow queries, the problem became immediately evident. When joining two tables on an indexed foreign key, Postgres was performing an efficient Hash Join for the fast running query, and an inefficient Nested Loop for the slower query.
Here’s the output for the fast query:
Sort (cost=43509.92..43523.14 rows=5291 width=20)
Sort Key: (date(charges.date_created))
-> HashAggregate (cost=43116.55..43182.69 rows=5291 width=20)
-> Hash Join (cost=186.57..43076.07 rows=5397 width=20)
Hash Cond: ((charges.account_id)::text = (accounts.id)::text)
-> Seq Scan on charges (cost=0.00..42774.71 rows=5409 width=41)
Filter: ((NOT deleted) AND (date_created > '2015-03-16 00:00:00'::timestamp without time zone) AND (date_created < '2015-03-24 00:00:00'::timestamp without time zone))
-> Hash (cost=121.32..121.32 rows=5220 width=21)
-> Seq Scan on accounts (cost=0.00..121.32 rows=5220 width=21)
Filter: (NOT admin)
And here’s the explain output for the slow query:
GroupAggregate (cost=42961.29..42961.32 rows=1 width=20)
-> Sort (cost=42961.29..42961.30 rows=1 width=20)
Sort Key: (date(charges.date_created)), charges.type
-> Nested Loop (cost=0.00..42961.28 rows=1 width=20)
Join Filter: ((charges.account_id)::text = (accounts.id)::text)
-> Seq Scan on charges (cost=0.00..42774.71 rows=1 width=41)
Filter: ((NOT deleted) AND (date_created > '2015-03-17 00:00:00'::timestamp without time zone) AND (date_created < '2015-03-22 00:00:00'::timestamp without time zone))
-> Seq Scan on accounts (cost=0.00..121.32 rows=5220 width=21)
Filter: (NOT admin)
Note that both of these queries were identical, except for the date ranges filtering the rows.
As mentioned above, when Postgres builds the query plan, it optimizes based on what it knows about the structure and size of the database. However, its knowledge of the database is not always up-to-date. Without accurate insight about the database tables, suboptimal query executions can be planned. In our case, the query optimizer created slower query plans for the newest rows. This explained how the same query was fast for older rows, of which the database had accurate knowledge, and slow for the youngest rows.
The solution was to VACUUM and ANALYZE the table. Vacuuming cleans up stale or temporary data, and analyzing refreshes its knowledge of all the tables for the query planner. We saw an immediate decrease in execution time for our complex queries, and as a result, a much more user-friendly internal website.
VACUUM ANALYZE table_name;
You can check the last time your tables were vacuumed and analyzed with the query below. In our case, we had tables that hadn’t been cleaned up in weeks.
SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_all_tables
WHERE schemaname = 'public';
To prevent our tables from continually getting messy in the future and having to manually VACUUM ANALYZE, we made the default auto-vacuum settings stricter. Postgres runs a daemon to regularly vacuum and analyze itself. Tables are auto-vacuumed when 20% of the rows plus 50 rows are inserted, updated or deleted, and auto-analyzed similarly at 10%, and 50 row thresholds. These settings work fine for smaller tables, but as a table grows to have millions of rows, there can be tens of thousands of inserts or updates before the table is vacuumed and analyzed.
In our case, we set much more aggressive thresholds for our largest tables, using the commands below. With these settings, vacuuming and analysis occur after a table sees 5,000 inserts, updates, or deletes.
ALTER TABLE table_name
SET (autovacuum_vacuum_scale_factor = 0.0);
ALTER TABLE table_name
SET (autovacuum_vacuum_threshold = 5000);
ALTER TABLE table_name
SET (autovacuum_analyze_scale_factor = 0.0);
ALTER TABLE table_name
SET (autovacuum_analyze_threshold = 5000);
The threshold to auto-vacuum is calculated by:
vacuum threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * number of rows in table
Similarly, the threshold to auto-analyze is calculated by:
analyze threshold = autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * number of rows in table
This is one of the many challenges we’ve tackled at Lob as our traffic and business continue to grow. Since Lob is ushering the analog processes of direct mail into the digital age with our APIs, our engineering team faces a unique and rewarding set of obstacles.
For more information about PostgreSQL vacuum and analyze settings, visit: https://www.postgresql.org/docs/9.2/routine-vacuuming.html.