Databases Need Better DX

In the first half of this year, I spent a fair amount of time hacking together a variety of web application ideas and, in the process, had my first proper introduction to typical relational database system (RDBMSs) which back up almost every such application. After working with Postgres primarily, and interacting with it via the Elixir library Ecto I've come to feel that there could well be an opportunity to create a more user-centric database system - something better suited to the demands of modern web application development.

Databases are the backbone of almost all production applications. Without a database, an "application" is just the code which could run, sometime, somewhere.. an abstract definition of business logic. Paired with a database, the code is transformed into a living application that delivers value to its users.

Postgres, and many of the other popular databases in use today were first created in the late 1980s and early 1990s. I won't argue that this is necessarily a massive problem in and of itself - all these databases have been under rigourous and intense development ever since, so it would be naive to claim "they're old, therefore they suck". The advanced age of these systems does, however, suggest that the technology was not optimally architected for the common workloads of 2021. In 1996, when Postgres was first released, I doubt many people could have anticipated the modern environment of the web - especially the delivery of full-blown applications through the browser, hosted on servers paid for by startups and owned by tech giants like Amazon and Google. Another trend is the one away from the waterfall methodology of software development towards agile development over this period. Founders and engineers have worked out the value of rapid and highly iterative design cycles, as famously espoused in the Lean startup methodology.

In most other areas of the developer's toolkit, we have seen large strides since the late 1980s towards productivity-enhancing tooling and processes which support these newer approaches. Here's a (very incomplete) representative overview:

Version control / source control - Git was initially released in 2005, and has quickly ascended to become the dominant system¹. It certainly wasn't the first source control management system (SCM) but it was the first that was extremely easy to use and proved a huge boon for developers. Creating and merging branches is a cheap and efficient operation in Git, and this was probably the main innovation.
CI/CD pipelines - Continuous Integration/Continuous Deployment (CI/CD) consists of a wide variety of practices and commercial tools which allow developers to automate many important software development and "devops" practices, such as running test suites and building "preview" versions of code before releasing to production. These processes aim to make it as efficient as possible for developers and operations professionals to ensure that the whole system runs smoothly and identifies bugs and other issues as early as possible. Without CI/CD practices, the rapid iterative feedback loops mentioned above would be far harder to implement.
Programming languages - Over the last couple of decades, the programming languages and runtimes used to build web applications have evolved considerably. Although PHP still frequently ranks highly by traffic, this seems to mainly be a function of that fact that many large websites were originally built with it, such as Facebook and Wordpress. Few new projects now select PHP as the server-side language. JavaScript has enjoyed a huge renaissance since 2009, when the Node.js server-side runtime for running Javascript was released. This was further bolstered in 2015 with the release of the Javascript language version called ES6 which added many new features promoting robustness and expressiveness in the language. In 2012, Microsoft released TypeScript, a superset of Javascript, which added the safety of static typing to the language. This has rapidly risen to prominence in modern appliaction development, both on the front-end and server-side. Javscript/Typescript frameworks like React have democratised complex and highly interactive UI designs, and made these more easily deliverable for the ubiquitous web broswer, obviating the need for desktop-based application downloads. Ruby, with its Rails framework has enjoyed huge popularity. Rails emphasises simplicity and developer productivity amongst its primary design goals.

The important trend running through these examples is the focus on speed and flexibility of development. As mentioned, modern applications are built by startups striving to implement the 'lean' methodology, which emphasises rapid development cycles, experimentation and frequent user testing. Most of the workflow used by developers has risen to the challenge of facilitating these goals, but I'd like to argue that there could still be considerable room for improvement in this regard in the domain of databases.

I'd like to note at the outset that I'm aware most, if not all, of these suggestions are theoretically implementable with third-party extensions / libraries / services that build on a core RDBMS like Postgres. My assertion is that a freshly designed system implementing all these as first-class considerations in the core product is most desirable and would be more robust.

Data types

First-class supported data-types in a Postgres database are limited, despite the ubiquity of requirements to store such things as:

images (e.g. for user avatars)
documents (e.g. pdf, word, etc)
audio
video

Any application which deals in these large binary data formats must track them manually. A typical data processing flow would be something like:

receive an avatar image from a user
generate a unique ID
send the image off to a storage system like AWS S3, with filename set to the ID
create an entry in a relational database system like Postgres, listing the image ID, the user to whom it relates, and any other metadata

To actually accomplish this would require custom built application code, or the use of a library written in your server's language for abstracting the implementation details. When I had to do things like this for a project I worked on in Elixir, the available library was quite basic, built by a single individual and not well-maintained. (Admittedly, Elixir is a relatively nascent language, generally still lacking in robust libraries.)

But stepping back, this should be a first-class abstraction provided by the RDBMS. Developers should be able to treat their database as just that: a storage repository for their application's data. The tricky implementation details of handling proper ACID transactions when dealing with users' files is not the sort of thing which should naturally be occupying the time of development teams whose primary focus should be domain-specific application code. The best workflow from the developer's perspective would be to simply specify a storage repository for large binary data, such as an S3 bucket or another cloud provider, or on premise disk locations, and let the database system handle the details.

Indexing

A very common use case in modern applications is to provide full-text and other heavy-duty indexing capabilities. This allows applications to offer users features like "search-as-you-type" and "faceted search"². One of the most popular implementations for this type of search-engine style indexing is Elasticsearch. Developers are generally forced to a third-party application to index data which already lives in a primary database system like Postgres. This leads to obvious problems like keeping the two databases in sync. Furthermore, Elasticsearch is notoriously tricky to use effectively. It is optimised for scaling and works well as a distributed system. But it exposes a myriad of tuning parameters and configuration options, and is yet another separate service to install on every development and production machine. Managed version of the services can get expensive quickly. Developers have to learn to navigate its labyrinthine API.

When I played around with Elasticsearch for one side project, it took a day or two of reading to get off the ground, and then a further few days to get some basic features implemented. I had accidentally opted for a managed service in a different datacenter to the AWS S3 bucket where the underlying data was stored and where the server was running. This was clearly my own mistake, but it made indexing slow, and it made queries slower than necessary. I wanted to run a test environment separately to a production environment, and figured that I could save on duplicating costs and complexity by just making a second set of indices with the prefix test_ on each name. I managed to accomplish most of my desiderata by using a third-party library for Elixir, again written by a single individual and lacking in features.

For a small startup development team looking to move quickly on prototyping a new project, all of the issues above would have been solved if the indexing functionality was included with the core database system and treated as a first-class feature catering to very common modern use cases.

Relational schemas

Most useful information stored in an application is in the form of entities which have meaningful relationships to each other. For instance, one could have a list of users, and a list of addresses. The addresses are related to the users in that each user lives at an address. Each user could also be related a group of other users if, for example, they were allowed to "follow" each other on the site. Each user could have associated posts in a discussion forum application, and a huge list of "upvotes" could be associated with both users and posts.

To be most effective, relational databases need to remain "normalised". This word is used to represent a variety of techniques for restructuring the tables and columns in the database to eliminate data redundancy, thereby making it possible to ensure ACID transactions and data integrity. Here's an example of an un-normalised database.

Database 1

id	first_name	last_name	address	movie_recommendations
1	Alice	Smith	1 Trumpington Street	The Godfather, Vertigo, Shawshank Redemption
2	Bob	Jones	37 Grange Road	Vertigo, Forrest Gump, The Good The Bad and The Ugly
3	Clare	Jones	37 Grange Road	The Matrix, Goldfinger, The Godfather

The problems with this table are:

The same address is repeated twice. This could be because Bob and Clare both live at the same address. If we ever wanted to update this address, it would need to be updated in multiple locations at once
There are multiple movies recommended in each row. It is hard to search who has recommended a specific movie when the data is laid out in this way. Furthermore, the same movies can be recommended in multiple places and therefore, if a change was made to the data related to a specific movie (e.g. correcting a spelling mistake), we would have to search everywhere to fix each occurence of that movie's name.

There are a variety of other problems like these. Fixing them consists of splitting the data into more tables, and using "keys" to relate entries to other entries in other tables. For instance, we could produce the following:

Database 2

user_key	first_name	last_name	address_key
1	Alice	Smith	1
2	Bob	Jones	2
3	Clare	Morris	2

address_key	address
1	1 Trumpington Street
2	37 Grange Road

movie_key	movie
1	The Godfather
2	Vertigo
3	Shawshank Redemption
4	Forrest Gump
5	The Good, The Bad, and The Ugly
6	The Matrix
7	Goldfinger

user_key	movie_key
1	1
1	2
1	3
2	2
2	4
2	5
3	6
3	7
3	1

The final table encodes the movie recommendations. At the expense of having to perform some "join" operations in order to reconstruct the relationships, we have restructured the data so that it contains no duplication and encodes relationship through the use of "foreign keys". The process indicated above is a very basic example of the famous problem of database normalization, about which there exists an extensive theoretical literature. It's well understood and a common consideration in application design, but it still typically requires a lot of up-front developer effort on each new project.

With RDBMSs like Postgres, the user must manually declare each table, stating which columns should appear in each, and carefully enforcing the normalisation rules by hand. As the application grows and the data model evolves, this may require restructuring tables, moving columns, and therefore implementing actual data migrations in the live production system. All of this could take hours or days.

However, all the information for how to accurately normalise relational data resides in the programmer's domain knowledge and it seems to me that this could be encoded in declarative statements which define the relationships between various types. Many ORM tools (like ecto in Elixir) allow this sort of definition in the code, but the database is effictively "unaware" of it, except insofar as the relationships are well-chosen by the programmer to produce the database schema.

If a declarative language could be created which could then be automatically parsed into a relational database schema, it would have the following advantages:

Developer time would be freed from tediously translating domain knowledge into CREATE TABLE and ADD column commands
Databases would keep themselves normalised, and therefore optimised for data integrity, without direct intervention by the programmer (except that the programmer has to at least accurately define the constraints, which seems reasonable enough)
The system could automatically figure out the correct schema and data migrations to perform when the data model changes between commits
Developers would not have to rely on an ORM to implement features like these in their language. The database system would have predictable behaviour in all language environments
A static code analysis system could be created to enforce various consistency constraints in the data model declaration, aiding the programmer to write a robust application (much like static type checking)

Below is a very rough hypothetical example of how such a language might define the relationships shown in the table above. Please read this like pseudocode - I haven't thought very hard about it yet. If I'm not totally missing some key point, an algorithm should easily be able to recreate the normalised version of the tables above from this declaration. I expect this would extend to more complicated examples without too much fuss.

user {
    first_name: string
    last_name: string
    has_one address: address
    has_many movie_recommendations: movie
}

address {
    address: string
}

movie {
    title: string
}

Infrastructure as Code

One aspect of the rise of CI/CD pipelines and cloud hosting platforms like AWS and Heroku is the trends towards Infrastructure as Code (IAC). Instead of treating the computing infrastructure on which our applications run as a whole separate domain, it's possible to simply declare the resources we wish to use in the code repository itself (usually in a YAML, TOML or JSON file). When the code is pushed to the remote repository and begins its CI/CD pipeline journey, the cloud hosting provider is automatically triggered to read the declaration files and build the infrastructure for us (if it differs from what is currently deployed) before running the new code on that infrastructure.

A system like this provides several advantages: 1) a clear and concise description of the state of the server infrastructure you are using, 2) coupling the software which runs on the infrastructure tightly with the infrastructure itself (eliminating annoying errors where the code doesn't find the services it expects to be running, and the frequent difficulties encountered by developers when trying to move working code off their development machine and into a test or production environment), 3) versioning with the codebase itself, 4) easy rollbacks.

To start developing a new web application, a programmer will typically have to manually install the latest version of Postgres on her machine (upgrading from an earlier version can be archaically tedious and error prone, as I once discovered). She would then have to check that her machine had the right Postgres users set up. Finally, she would have to launch a terminal and run the psql program, followed by issuing one-off commands like CREATE DATABASE my_app along with definitions for any further configuration required for that particular database instance.

After writing a few lines of application code, she might wish to run a test suite in a dedicated test environment on her local machine. This would entail replicating the full setup of the database, but with a new name like my_app_test. All of this would then need replicating on the machines of every other developer she collaborates with, as well as on the live CI/CD test environments, preview environments and the main production environment.

I'm aware that a lot of this can be abstracted and handled by a good database library / ORM. We used ecto when developing some server-side projects in Elixir, which was very good. Overall though, this feels like a core abstraction of absolutely critical importance in the modern usage of database systems for these types of applications, and it should be a feature that ships with the database system itself. Its important to realise that I'm not really advocating anything more "innovative" than a close focus on optimising developer experience here.

Schema Migrations

Closely related to the last example are schema migrations. As developers work on a project, they may add new tables or columns, change the structure of the existing relations, or otherwise alter the state of the database. RDBMSs like Postgres expose a procedural API for doing such things. You have to execute one-off commands like AlTER table ADD email_address or DROP table to achieve these changes. This results in the state of your database (and the various versions of it you have running in different environments) being poorly defined and hard to reason about. In modern application development, different branches of the codebase can expect different database structures. For instance, my colleague might have worked on a speculative branch that includes a new table for tracking posts on blog comments. If I try to run this code, it won't work properly on my dev machine unless I've updated my local version of the database to reflect the fact that the code expects to see a posts table.

Again, this problem has been solved by good database libraries like ecto. They implement a somewhat involved system called 'migrations' which track the current state of the database in a special table in the database itself. Then, every time the server is launched, ecto checks which version is needed by that version of the code and if it differs to the current version, it incrementally steps forwards or backwards through each atomic change in the database structure (which have all been stored in timestamped files in the codebase itself, and thus are versioned alongside the code to which they relate) unil the database on the machine matches the database the code expects to be running.

This is quite an ingenious system, and allows the "Infrastucture as Code" paradigm to be used throughout the development process. Developers simply declare the state of the database they want in the codebase, and know that the system will ensure the right database is always running at the right time, without having to issue loads of ALTER SQL commands.

Once again though, this seems like such critical functionality that it ought to be a primary feature of the database system itself. Switching from one programming language to another generally necessitates switching to a new database manipulation library, implemented in that langauge. It's not at all clear why programmers should have to learn a new library's take on the schema migration problem, or have to put up with the lack of a good library in their chosen language. Schema migrations are a critical aspect of productive and accurate software development today, and programmers should be able to trust an optimised, native implementation which is tightly coupled to the database they choose to use, not the programming langage they write their server in.

Developer experience

Developer experience (DX) startups have been doing extremely well lately: software developers love to be early adopters of new software tools, so they tend be eagerly receptive. Some randomly chosen examples from a very long list are: Twilio, Stripe, Atlassian, CircleCI, MongoDB... As I've already alluded to, the key to what I'm suggesting here is that it's specifically the DX of database usage that is lacking, not necessarily many of the core underlying technologies which power database storage and retrieval. There are several precedents of tech giants building distributed versions of database system with better abstractions, using MySQL or other common, battle-tested database engines under-the-hood as the core building block: Facebook's TAO, Twitter's Gizzard and AWS's DynamoDB are a few examples. An approach like this would help eliminate the need for a startup working on improving the database DX from having to reinvent the (already very clever) wheels of core database engines, allowing the focus to remain on the new logic for managing that data more intelligently in the ways I've proposed here.

Conclusion

Countless hours of development time could be saved by building a system which collocates all the "data storage and retrieval" functionality into a single, straightforward service that focusses on promoting developer productivity first and foremost. Although most database systems are highly sophisticated at the nuts and bolts level of their "database engines", the interface presented to a working developer leaves a lot to be desired. Luckily, that's one of the easiest parts to fix, and could pay big dividends for a team who choose to focus on it.

It's hard to find statistics which meaningfully compare the popularity of VCSs, but most sources seem to think that Git became the predominant system within a decade of its release. Other distributed version control systems like Mercurial (which I haven't personally used), are frequently described as being just as capable as Git, but perhaps didn't catch on to the same degree purely because of network effects. Mercurial was also first released in 2005.↩
Faceted search is the sort of search system that allows a user to navigate a variety of criteria, often with selection dropdowns or options showing the possible filters. It's the kind of functionality typically found when looking for a hotel room, booking a flight or a buying a house, where you are seeking an entity which matches against a variety of different criteria at once.↩