PostgreSQL vs. MySQL
Building better databases
By Brian Jepson
To many people, PostgreSQL and MySQL seem like similar, alternative databases. Both are quickly gaining popularity. Based on the track records of older versions, there's a lot of debate over the speed of PostgreSQL and the durability of MySQL. But times have changed and each database has progressed. On both counts, the two packages are the closest they've ever been, so when deciding which to use in a Web application, a developer doesn't always have a clear winner.
MySQL's claim to fame is that it provides a reasonable set of features, such as built-in SQL functions, that follow the 80/20 rule: It has the 20 percent of SQL capabilities that are needed for 80 percent of database applications. Developers of simple applications can live without the remaining features, such as stored procedures and subqueries, or can work around them with creative client-side programming.
PostgreSQL, on the other hand, provides more features than MySQL. These include more SQL functions, server-side procedural languages, and sophisticated methods for date manipulation. PostgreSQL also offers object-relational capabilities and geometric data types. If you're developing an application that has highly complex business rules, PostgreSQL lets you handle business logic on the database server.
The ACID Test
A good way to differentiate databases and test overall quality is to perform an ACID test. ACID is an acronym that describes four properties of a robust database system: atomicity, consistency, isolation, and durability. These features are scoped to a transaction, which is a unit of work that the programmer can define. A transaction can combine one or more database operations, for example:
1. Atomicity is an all-or-none proposition. Suppose you define a transaction that contains an UPDATE, an INSERT, and a DELETE statement. With atomicity, these statements are treated as a single unit, and thanks to consistency (the C in ACID) there are only two possible outcomes: either they all change the database or none of them do. This is important in situations like bank transactions where transferring money between accounts could result in disaster if the server were to go down after a DELETE statement but before the corresponding INSERT statement.
2. Consistency guarantees that a transaction never leaves your database in a half-finished state. If one part of the transaction fails, all of the pending changes are rolled back, leaving the database as it was before you initiated the transaction. For instance, when you delete a customer record, you should also delete all of that customer's records from associated tables (such as invoices and line items). A properly configured database wouldn't let you delete the customer record, if that meant leaving its invoices, and other associated records stranded.
3. Isolation keeps transactions separated from each other until they're finished. Transaction isolation is generally configurable in a variety of modes. For example, in one mode, a transaction blocks until the other transaction finishes. In a different mode, a transaction sees obsolete data (from the state the database was in before the previous transaction started). Suppose a user deletes a customer, and before the customer's invoices are deleted, a second user updates one of those invoices. In a blocking transaction scenario, the second user would have to wait for the first user's deletions to complete before issuing the update. The second user would then find out that the customer had been deleted, which is much better than losing changes without knowing about it.
4. Durability guarantees that the database will keep track of pending changes in such a way that the server can recover from an abnormal termination. Hence, even if the database server is unplugged in the middle of a transaction, it will return to a consistent state when it's restarted. The database handles this by storing uncommitted transactions in a transaction log. By virtue of consistency (explained above), a partially completed transaction won't be written to the database in the event of an abnormal termination. However, when the database is restarted after such a termination, it examines the transaction log for completed transactions that had not been committed, and applies them.
PostgreSQL is ACID compliant. The standard table handler for MySQL is not ACID compliant because it doesn't support consistency, isolation, or durability. However, the default table handler supports atomicity using table locks. And fortunately, there are table handlers available for MySQL that provide various degrees of compliance. As I'll describe later, NuSphere's Gemini table handler is fully compliant. So are the Berkeley DB and InnoDB table handlers included in the most recent versions of MySQL. If you want to use these table handlers, you need to either obtain a version of MySQL in which they're specifically compiled, or compile MySQL with these handlers enabled. The MySQL documentation for CREATE TABLE gives instructions on how to use one of these alternative table handlers (at www.mysql.com/documentation).
Because of its limited feature set, MySQL is very fast. You can make it blindingly fast if your application can adapt to things like in-memory tables. As far as durability is concerned, you might lose some data if the plug is pulled in the middle of a transaction. With PostgreSQL, you have many features and can be confident that your data is safe. Yet, if you implement all of the features at once, your application's performance suffers. Fortunately, PostgreSQL has recently made significant performance improvements.
Although PostgreSQL and MySQL are open-source products, there are many commercial distributions of both databases, each with its own features. There are also a number of proprietary databases. For a listing of such databases, check the sidebar, "
When Open Source Isn't Enough").
I'm comfortable with the open-source versions because of the level of support I can find in the development community. But then again, I'm not afraid of compiling and installing my own binaries, applying patches by hand, or working with experimental versions of software. Many people aren't that entrenched in the community or would rather call a support number than browse an unregulated message board, in which case the commercial distributions offer an advantage.
No commercial distribution will eliminate your need for a database administrator (DBA). However, a commercially-supported database lets your DBA concentrate on what's inside your database, instead of worrying about which combination of C library, kernel, and phase of the moon you need to keep things running smoothly when your database is at its busiest. There are several commercial distributions you might consider: Great Bridge, PostgreSQL, NuSphere, AbriaSoft, and MySQL AB are among the most popular distributions.
Great Bridge offers a commercially supported distribution of PostgreSQL for several Linux distributions. It has a boxed set that includes a CD and manual. You can also download the distribution from Great Bridge's Web site. In addition to its PostgreSQL distribution, Great Bridge offers support and consulting services for PostgreSQL users.
Great Bridge also hosts a PostgreSQL developer site. The site offers free project hosting for open-source projects related to PostgreSQL, including a CVS archive, bug tracking, mailing lists, and other collaborative tools. According to a January 2001 Linux Weekly News interview with Bruce Momjian, vice president of Great Bridge, all of the code that Great Bridge develops is released under an open source license.
PostgreSQL, Inc. was formed in 1999 to provide database hosting and support services for PostgreSQL users. It also develops new features for PostgreSQL and contributes portions of its profits to the PostgreSQL project. In December 2000, the company released an open-source replication server for PostgreSQL.
The Postgre SQL company has left open the possibility that it may develop proprietary software in the future. However, it has made a public commitment that any proprietary software it develops will become open source within two years.
NuSphere sells a distribution of MySQL along with various support options. It also offers training and consulting services. NuSphere MySQL includes an enhanced table handler called Gemini that's based on the Progress Software (NuSphere's parent company) database engine. Gemini tables provide robust transaction support, including a transaction log that automatically recovers updates that were in progress during an abnormal termination.
Although Gemini tables aren't part of the MySQL source distribution, the MySQL documentation states that they'll be incorporated into a future version of the database. Until that time, NuSphere is the sole provider of Gemini tables.
NuSphere's MySQL distribution includes a printed manual, some pocket references, and software for Linux, Solaris SPARC, and Windows. You can also download NuSphere from the NuSphere Web site. In addition to the MySQL database server, its distribution includes the Apache Web server (with SSL support), PHP, Perl, and graphical administration tools.
AbriaSoft offers the Merlin server and desktop product that combines MySQL, Perl, PHP, and Apache. SSL support is available in the server version, which also includes a variety of administration and development packages. Merlin runs on either Windows or Linux, and is available for purchase or download. AbriaSoft offers a variety of support packages, consulting services, and training.
MySQL AB distributes MySQL under the GNU General Public License (GPL), and it also provides commercial training, consulting, and support. Commercial licenses are available for MySQL in case you want to do something with it that the GPL doesn't let you do (such as embed MySQL in non-free software). Although this sounds like it conflicts with the GPL, the copyright holder (in this case, MySQL AB) can make its software available under more than one license.
One of the larger areas in which PostgreSQL and MySQL differ is in the functions you can use in a SQL statement. Because SQL is a standard data query language, you would expect to be able to apply it equally across different databases. Unfortunately, adherence to a standard is one place where SQL databases need the most improvement. They all agree on the fundamentals, such as what a basic SELECT, INSERT, UPDATE, or DELETE statement looks like. However, once you step beyond the basics, implementations start to diverge, differing on syntax as well as on which features are actually supported.
Typically, you can accomplish the same things with any databaseit just takes a little work. For instance, date formats can be tricky. Many database systems are somewhat flexible in the date formats that they support, but not all systems agree on the same format. For a comprehensive look at the date issue, read my November 2000 Web Techniques feature article, "Same Time Next Month?"
As I mentioned earlier, PostgreSQL supports a richer SQL dialect than MySQL. One way in which they differ is PostgreSQL's support for SQL subqueries. Developers often use subqueries to perform complex set manipulations. Whereas the SELECT statement lets you generate dynamic data sets based on complex criteria, subqueries take this a step further by letting dynamic data sets interact with each other in more sophisticated ways. For example, suppose you have two tables,
salary, and those two tables are linked on a key called
employee_id. You want the same key in both tables so that you can always find an employee's salary by joining both tables on that key. Suppose you want to reduce the salaries of your highest paid employees by 75 percent? To do this, you need to retrieve the maximum salary, store this value somewhere, and use it when performing the update, as shown in the Java code in
This approach isn't great, because the two operations (retrieval and deletion) aren't combined into an atomic unit. In theory, it would be possible for someone to get a raise between the time you retrieved the top salary and the time you downgraded those people, of whom everyone has been jealous all along. If you use a subquery, you can combine the entire operation into a single statement (see
Although MySQL can't perform subqueries, its temporary tables can help you mimic subqueries in SELECT statements, because a SELECT statement can join against multiple tables (without subqueries, you can't introduce extra tables into an UPDATE or DELETE). Select the subquery and add it to a temporary table, and then perform the outer query as a join against the temporary table. MySQL can accomplish the same thing as the PostgreSQL subquery in
Example 3 using temporary tables, as shown in
Example 4. In some instances, you can use local variables to emulate the behavior of subqueries. If your subquery will only return one row, you can use the approach in
Example 5 (thanks to email@example.com for this example).
MySQL's temporary tables are safe in multi-user situations because they're private to each connection. That is, if two users create a temporary table named
max_salary, they won't collide with each other.
On a related note, although everyone should be concerned about the risks of corrupted data that ACID compliance addresses, remember that you're not protected from bogus data. What's to keep you from storing invoices for customers that don't exist? And what happens if someone accidentally deletes a customer who has unpaid balances? Your system would probably forget to ever bill them again, and that's certainly not good for business. Although ACID compliance keeps the tables from becoming so corrupted that the database can't read them, the integrity of that data depends on whether you enforce certain connections implied by their relationships. (For instance, never let someone delete the record for a customer with unpaid balances.) Choosing a database that's ACID compliant and meets your needs will take you a long way, but it won't do all of the work for you. It does buy you time to think about the more serious problems, however, like how your applications will protect against bogus data.
Door One or Door Two?
Both databases have many similarities, so how do you choose which is best? I have experience using both databases as well as recommending the use of one or the other (and dealing with the wrath of people I've steered in the wrong direction).
If you're looking for a database to prop up a Weblog (see "Blog Rolling Competitions" in the January 2001 issue of Web Techniques) or portal, you'll find that many such packages rely on MySQL. It should be possible to port them to PostgreSQL, but if you're looking for a turnkey package, chances are you're not interested in doing too much porting work.
If you're migrating from Oracle, Sybase, or Microsoft SQL Server, I suggest PostgreSQL. Like those databases, PostgreSQL has triggers, stored procedures, and a rich set of built-in functions (including many functions for date manipulation). Also, PostgreSQL procedural language is easy to learn if you're familiar with Oracle's PL/SQL and SQL Server's Transact-SQL.
I use MySQL on my own Web site. MySQL seems to be focused on the needs of Web developers, and PostgreSQL seems oriented to a wider range of application developers. I'm often approached by some of my wife's computer science graduate students who are developing complex database applications (some for the Web, some not). For their projects, I almost always recommend PostgreSQL. In theory, you could adapt your application to either database system, but your life will be easier if you don't go against the grain.
(Get the source code for this article here.)
Brian is a writer for O'Reilly and Associates. You can follow some of his activities at www.jepstone.net or send him email at firstname.lastname@example.org.