This is one part actionable advice, one part question for the audience.
Advice: When you are making a new application that requires persistent storage of data, like is the case for most web applications, your default choice should be Postgres
.
sqlite
?sqlite
is a pretty good database, but its data is stored in a single file.
This implies that whatever your application is, it is running on one machine and one machine only. Or at least one shared filesystem.
If you are making a desktop or mobile app, that's perfect. If you are making a website it might not be.
There are many success stories of using sqlite
for a website, but they mostly involve people who set up their own servers and infrastructure. Platforms as a service-s like Heroku, Railway, Render, etc. generally expect you to use a database accessed over network boundary. It's not wrong to give up some of the benefits of those platforms, but do consider if the benefits of sqlite
are worth giving up platform provided automatic database backups and the ability to provision more than one application server.
The official documentation has a good guide with some more specifics.
DynamoDB
, Cassandra
, or MongoDB
?Wherever Rick Houlihan is, I hope he is having a good day.
I watch a lot of conference talks, but his 2018 DynamoDB Deep Dive might be the one I've watched the most. I know very few of you are going to watch an hour-long talk, but you really should. It's a good one.
The thrust of it is that databases that are in the same genre as DynamoDB
- which includes Cassandra
and MongoDB
- are fantastic if - and this is a load bearing if:
This is because this sort of database is basically a giant distributed hash map. The only operations that work without needing to scan the entire database are lookups by partition key and scans that make use of a sort key.
Whatever queries you need to make, you need to encode that knowledge in one of those indexes before you store it. You want to store users and look them up by either first name or last name? Well you best have a sort key that looks like <FIRST NAME>$<LAST NAME>
. Your access patterns should be baked into how you store your data. If your access patterns change significantly, you might need to reprocess all of your data.
It's annoying because, especially with MongoDB
, people come into it having been sold on it being a more "flexible" database. Yes, you don't need to give it a schema. Yes, you can just dump untyped JSON into collections. No, this is not a flexible kind of database. It is an efficient one.
With a relational database you can go from getting all the pets of a person to getting all the owners of a pet by slapping an index or two on your tables. With this genre of NoSQL, that can be a tall order.
Its also not amazing if you need to run analytics queries. Arbitrary questions like "How many users signed up in the last month" can be trivially answered by writing a SQL query, perhaps on a read-replica if you are worried about running an expensive query on the same machine that is dealing with customer traffic. It's just outside the scope of this kind of database. You need to be ETL-ing your data out to handle it.
If you see a college student or fresh grad using MongoDB
stop them. They need help. They have been led astray.
Valkey
?The artist formerly known as Redis
is best known for being an efficient out-of-process cache. You compute something expensive once and slap it in Valkey
so all 5 or so HTTP servers you have don't need to recompute it.
However, you can use it as your primary database. It stores all its data in RAM, so it's pretty fast if you do that.
Obvious problems:
DynamoDB
-likes, you need to make concessions on how you model your data.Datomic
?If you already knew about this one, you get a gold star.
Datomic
is a NoSQL
database, but it is a relational one. The "up-front design" problems aren't there, and it does have some neat properties.
You don't store data in tables. It's all "entity-attribute-value-time" (EAVT) pairs. Instead of a person row with id
, name
, and age
you store 1 :person/name "Beth"
and 1 :person/age 30
. Then your queries work off of "universal" indexes.
You don't need to coordinate with writers when making queries. You query the database "as-of" a given time. New data, even deletions (or as they call them "retractions"), don't actually delete old data.
But there are some significant problems
Clojure
, a relatively niche language, its API sucks.XTDB
?Clojure
people make a lot of databases.
XTDB
is spiritually similar do Datomic
but:
The biggest points against it are:
Okay that's just one point. I'm sure I could think of more, but treat this as a stand-in for any recently developed database. The best predictor something will continue to exist into the future is how long it has existed. COBOL been around for decades, it will likely continue to exist for decades.
If you have persistent storage, you want as long a support term as you can get. You can certainly choose to pick a newer or experimental database for your app but, regardless of technical properties, that's a risky choice. It shouldn't be your default.
Kafka
?Kafka
is an append only log. It can handle TBs of data. It is a very good append only log. It works amazingly well if you want to do event sourcing type stuff with data flowing in from multiple services maintained by multiple teams of humans.
But:
Kafka
services) its another piece of infrastructure you need to monitor.ElasticSearch
?Is searching over data the primary function of your product?
If yes, ElasticSearch
is going to give you some real pros. You will need to ETL your data into it and manage that whole process, but ElasticSearch
is built for searching. It does searching good.
If no, Postgres
will be fine. A sprinkling of ilike
and the built-in full text search is more than enough for most applications. You can always bolt on a dedicated search thing later.
MSSQL
or Oracle DB
?Genuine question you should ask yourself: Are these worth the price tag?
I don't just mean the straight-up cost to license, but also the cost of lock-in. Once your data is in Oracle DB
you are going to be paying Oracle forever. You are going to have to train your coders on its idiosyncrasies, forever. You are going to have to decide between enterprise features and your wallet, forever.
I know its super unlikely that you will contribute a patch to Postgres
, so I won't pretend that there is some magic "power of open source" going on, but I think you should have a very specific need in mind to choose a proprietary DB. If you don't have some killer MSSQL
feature that you simply cannot live without, don't use it.
MySQL
?This is the one that I need some audience help with.
MySQL
is owned by Oracle. There are features locked behind their enterprise editions. To an extent you will have lock-in issues the same as any other DB.
But the free edition MySQL
has also been used in an extremely wide range of things. It's been around for a long time. There are people who know how to work with it.
My problem is that I've only spent ~6 months of my professional career working with it. I genuinely don't know enough to compare it intelligently to Postgres
.
I'm convinced it isn't secretly so much better that I am doing folks a disservice when telling them to use Postgres
, and I do remember reading about how Postgres
generally has better support for enforcing invariants in the DB itself, but I wouldn't mind being schooled a bit here.
import openai
.You're right. I can't think of any downsides. Go for it.