Postgres for the win! November 4, 2006 | 11:23 am

Had something of a victory this week, and I’ve really got to give kudos to PostgreSQL for it. There’s something magical about taking a peice of technology, pushing it well beyond what you would rationally expect it to be able to handle, and having not just work, but work well. These sorts of experiences tend to make one a rapid fan of the technology in question. Well, postgres just did that for me.

Here’s the set up. I got the following problem dumped on me tuesday. See, there’s this table we wanted to have in the database. Now, each row isn’t that large- we needed a file date the record was created, a symbol (4-8 characters), an expiration date, and two volatility quotients (floats). So we’re only looking at 40-50 bytes per record. But here’s the problem: we’re already generating something like a million records a day, and this is expected to increase, and they’re wanting to put six months worth of data, probably more, into the database. So I’m looking at a single table that’ll start with over 100 million rows, and could easily grow to half a billion rows. Worse yet, I need to access the table by at least two different ways- by file date and by some combination of symbol and expiration date, with the latter being time critical (it’s an interactive application- a human is waiting on this data).

Now here’s the punchline- the machine this monstrosity is going on. It’s an HP DL-145 2.2GHz Dual Core Opteron with 4G of memory, a nice machine- but only a single 70G 7200RPM SATA drive as storage. Doing this on $300,000 worth of hardware I’d expect- but doing it $3,000 worth of hardware? On a machine we just had kicking around? What do they think I am, a miracle worker?

Um, yeah. Fortunately for me, I was using Postgresql. One miracle, to order, comming up. Follow me for the details of how I did the (seemingly) impossible.

First, and most importantly, remember that the solution to this sort of problem is to redirect the warp drive plasma conduits through the shield generator and into the transporter. No wait, that’s star trek. The most important thing is to partition your table. This is a neat trick Postgres allows to make a host of smaller tables look like one big table. In my case, I can make the single 120 million+ row table look like 20 6 million row tables. Better yet, as the table grows towards half a billion records, I can keep subdividing it to keep the per table size sane.

I have to say the nicest feature of Postgres is it’s excellent documentation. The online manual is well written, well organized, helpfull, and pretty much complete- four attributes I don’t expect of technical manuals, wether open or closed source. In the case of Postgres, it really is the friendly manual. Thirty seconds of googling brought up this page, which includes detailed step by step instructions on how to partition a table in postgres. With complete examples- they didn’t forget or leave out steps. And they explain things well enough that you understand what you’re doing. For the documentation alone I could love this database.

After a little thought and some pointed questions to the future users, I decide to partition the table on the symbol. This is easy, makes for an obvious naming scheme, and actually speeds up the user’s queries. Of the two, I’d rather make the data inserting slower to speed up the user queries than the other way around (although it turns out that in this case I get to have my cake and eat it too, which I’ll get to a little later). Specifically, I’m paritioning based on the first character of the symbol. But here’s the cool part- if the user asks for the data for IBM, say, postgres knows it only needs to look in the table of the I’s. It’s skipping 95% of the data right from the start.

The other thing I took advantage of to help speed up queries is Postgres’ advanced query optimizations, specifically, bitmask joins. See, what Postgres can do is convert an index into a bitmap of what records satisfy a condition. It can then combine these bitmaps with the standard bit operations, like and and or. This means if you have an index on foo and a seperate index on bar, Postgres can combine both indexes to optimize queries where foo=x or bar=y, which can’t be done on a single index. In addition, single-column indexes are cheaper to maintain than multi-column indexes.

Jumping to the end of the story here for a moment, the combination of partitioning the tables and bitmap joins on the indexes means that the common query done by the user (“give me all the samples for IBM with an expiration date of 15 Nov”) completes in well under a second at this point. Better yet, I can keep the table sizes limited by increasing the granularity of the partitioning as the data grows. For 100 million rows, I currently have 20 partitions- as the table grows to half a billion rows, I can increase this (as needed) to 100 partitions and keep the partition size the same.

So with the user query speed well in hand, I turn my attention to the data import speed. This looks to be a problem due to the nature of the input file I have to deal with. It’s not just that it’s a million records- it’s that it’s a million unsorted records. Or rather, they’re sorted by when trades occurred, which isn’t at all helpfull. So MSFT records are followed by IBM records, expiration dates jump all over the place, and so on. If I try to just insert them directly into the main table, what’ll happen is that I’ll insert the IBM records, which means I need to read in the I table and it’s indexes. Then I insert a MSFT record, which means I need to read in the M table and it’s indexes, which will basically crowd the I table and it’s indexes out of memory. So the next time I try to insert another IBM record, I’ll be reading that table back into memory all over again. Inserts will take frimping forever. This is a problem not only because I need to insert a million new records each day, but also because I’ve got six months of back data I need to insert. And bosses going “how soon can we get this up and running?”

I toy briefly with the idea of dropping and recreating the indexes, but a) this is pretty slow even still, b) it encodes the structure of the table and the number and names of indexes into the load script, which is bad style and a maintainance nightmare, and c) it turns out not be necessary. Returning once again to the friendly manual, I devise the following approaching for importing the data:

1) Create a temporary table with the same columns as the main table. I don’t even need to encode what these columns are (a maintainance headache) because I can use the Postgres LIKE argument to the CREATE TABLE command. My create table command is just “CREATE TEMP TABLE temptable (LIKE maintable)”.

2) I use the COPY command to blast data into this temporary table at tens of thousands of rows a second. This is especially nice as the data file I’m reading from is effectively already in CSV format, so very little modification of the data needs to be done- I’m mainly just reading data from the disk and blasting it into the database.

3) I then do an INSERT from a SELECT, with an ORDER BY on the select, to copy the contents of the temp table into the main table. This is just “INSERT INTO maintable SELECT * FROM temptable ORDER BY symbol”. The ORDER BY causes postgres to sort the inserts for me. So first it inserts all the symbols starting with ‘A’, reading in the A table and it’s indexes. Then it inserts all the symbols start with ‘B’, which reads in the B table and flushes out the A table- but that’s OK, as we’re done with the A table anyways. I’m making maximal use of the disk cache this way.

4) When the insert is done, I drop the temptable and exit.

Doing the inserts this way is reasonably simple, maintainance friendly, and fast- inserting a million new rows only takes 5-10 minutes. Which means inserting the six months of backlog took me like 10-20 hours. The merely difficult we do immediately, the impossible you’ll have by the time you get in tommorow.

At this point, the only complaint I have is that Postgres is doing so much with so little that I’ll never get cool hardware to play with. I’ll be stuck with cheap low-end boxes forever. Sigh.

The next person who claims “Postgres is slow” is going to get an earfull from me. Maybe Postgres 7.0 was slow, I don’t know- but I know for damned sure that Postgres 8.1 isn’t. There are, I think, three things that make Postgres seem slow:

1) The default configuration. I joke that the default configuration of Postgres was designed for a 386 with 16 meg of memory. This isn’t exactly true (it’s more like a 486 with 32M of memory), but it makes my point. The default configuration of postgres is seriously minimal- it’s good for getting the database up and making sure it’s running, but it’s not going to provide decent performance at all. For that, you need to go in and tune it a little bit, telling it how many resources to use. Which is good- you may be running the database on a machine also running other things, and may want to limit the resource utilization of the database. But if you’re reading a performance comparison between postgres and some other database, and it starts with “we tested both databases with their default configuration…” you can stop reading at that point.

2) Lack of transactional awareness. Transactions change things. Especially with Postgres, where every command that isn’t part of another transaction is it’s own transaction, and transactions are ACID. So, in Postgres, when you do a single INSERT statement, it is it’s own transaction, and doesn’t complete until it’s written out to disk. So you can only do as many INSERTs per second as you can do seeks per second. This means you’re only going to get a few hundred INSERTs per second if you’re lucky, if you’re not aware of transactions. Which could easily give you the impression that Postgres is dog-slow. However, once you’re aware of transactions, you can speed things up enormously by doing more per transaction. Grouping 100 inserts into a single transaction makes inserts about 100x times faster, as the cost of the transaction is spread over all 100 transactions- and this still keeps you ANSI SQL compliant. If you’re willing to ditch ANSI SQL and be Postgres specific, you can use things like the COPY command, which is signifigantly faster yet.

3) Most performance tests don’t really push the scalability of a database. OK, your database is really fast on tables with 10,000 rows, so what? On modern hardware, just about any database will be fast enough on small tables. Performance becomes important when you’re dealing with large things- either really large tables or lots and lots of connections. If the latter, transactions stop being optional- the database has to be consistent despite the asynchronous actions of the user. And if it’s the former, well- doing well on small tables does not necessarily mean that you do well on large tables. A table with 100 million rows in it is a radically different beast than a table with 10 thousand rows.

I’ve pushed postgres on performance. And rather than finding it slow, I’ve found it shockingly fast and scalable. The traditional reward for a job well done is usually another, harder, job. But I’m not worried. I’m using Postgres. The worst that’ll happen is that I’ll get an excuse to actually get new hardware.

Tags:

  • Araneae

    “it’s own transaction” should be “its own transaction”

  • davi koscianski vidal

    Sorry ressurrecting this post.

    You said (and I quote you again) that “If I had to do it over again, I wouldn’t have partitioned this table. … Basically, I was prematurely optimizing.”.

    Currently I have a 85 million rows table that I’m considering partition, but only because I don’t see an alternative (I can’t delete the records, I can “delete” them into another table… Like a partition). (As a test, I deleted ~ 25 million records and I saw a performance gain of 8 times)

    So, my question is: what would you have done?

    EDIT: this isn’t the first time I read your post. The first time I did read it was last year when I was considering partition for this very same table, but we had so few rows that I dismissed it. So, thank you very much for sharing.