Thinking in Database

Too often, particularly in smaller teams, software developers create database assets, with varying degrees of success — with database specialists limited to larger systems. This article aims to demystify some of the thinking in working with databases.

Changing Databases Is Costly

This isn’t a total backtrack on my position in this article — we can make working with databases lower risk and more efficient, as I’ve described in Test Driven Database Development. But, making changes to our database is undoubtedly more costly than making application changes.

Suppose our application is an accounting system with an internal desktop application, integration with shipping label software, customer website and mobile app, and internal management and reporting portal. When we first created the customer table, we added a phone_number field as a simple 20 character string, allowing free text entry.

We then implement a text messaging service. But, we get complaints. Mr Jones doesn’t get his text — his number has been input as (UK)07111 111 111 which of course, isn’t a valid number. Mr Smith doesn’t get his text either — his phone number is in Ireland, but in the absence of an international code, we’ve been texting someone in the UK with the same number, who is also complaining. Then, of course, Mr Harris has input a work number of 01234 123 456 ext 22 which also doesn’t work.

So, we decide to update our phone number to a better data design:

Separate international dialling code, number and optional extension
Separate fields for mobile, home and work numbers

However, of course, this isn’t simple:

We don’t know whose number is mobile, home or work
In simple cases, we can parse valid formats to get the components — but in the cases described, we may struggle

So how do we solve this problem?

I’m not going to cover every possible technique in safely making database changes — check out Ambler and Sadalage’s brilliant book Refactoring Databases for a comprehensive treatment.

However, in our worked example, I’d:

Introduce new optional fields alongside the existing
Choose and implement an approach to map old to new
Modify calling applications to use the new fields, probably alongside the old
After a reasonable period — maybe 6 months — deprecate the old

Introducing the new fields is self explanatory. The tricky bit is mapping old to new, and modifying the calling applications to use the new fields.

In some cases, we might have a simple transform we can run on the data. Working the other way — combining phone number components into a single field — this would be a simple UPDATE customer SET new_phone = int_code + " " + phone_num + " " + ext, for example. But, where we’re parsing free text into structured fields, that becomes much harder.

If most of our phone numbers are reasonably consistent, we might use the Pareto Principle to come up with a mapping rule for a large proportion. But what about the rest? Manual checking? Outsourcing to a 3rd party?

The solution I’d probably settle with is:

Mapping the 80% or so we can do easily
Adding logic into our calling applications so that the next time one of the 20% places an order, we ask them to confirm / fix up their phone number

So, this means we’ve got X discrete steps:

Adding the new fields
Implementing an update script which handles most data (and doesn’t fail in a damaging way for the remainder)
Updating calling applications to use new formats for new or updated customers
Updating calling applications to read the old and write the new in a cleanup operation
Remove the cleanup code from calling application — but when? Is 6 months long enough? What about customers who don’t place an order in that period?
Make a business decision on whether and when to, and actually remove, the residual invalid data
Update the calling applications to eliminate the old fields
Actually remove the old fields from the database
Reconsider whether the new fields should be nullable or required.

This is many times more costly than adding new database artifacts. So, this gives us rule #1 — Changing databases is costly — so it’s worth some extra effort to get our data design (particularly fundamentals) right first time.

Related to this, rule #2 is It’s easier to relax constraints than introduce them.

Cleaning Up After Ourselves

Very often, in the short-to-medium term, doing it right is more costly than just bunging in an extra field and hoping the old stuff gradually falls into disuse. In our example, we might be tempted to bung in a new text_number field — customers who want texting about their order can fill in the new, validated field, and we avoid grappling with our previous poor design choice.

Even if we start off in the data migration and cleanup workflow with the best of intentions, implementing steps 1–4 above, it can be hard to justify spending the time to work on steps 5–9 when our new field is working and business priorities have moved on.

The problem about not doing the cleanup is:

Our database gets messier and less manageable, with redundant data
Our application code gets messier and harder to maintain, with special case logic around data cleanup or workarounds which are no longer needed
Querying and reporting on our data becomes harder, because we have to handle multiple special cases
We gradually (or rapidly) lose organisational memory, making cleanup feel riskier over time. “I’m sure we don’t use that field anymore, but taking it out might break the legacy applications…”

So, we need to normalise rule #3 amongst our team, and software managers need to be prepared to advocate for it with the wider business Doing cleanup of your database pays off more quickly than you think — so plan it early and implement it as soon as you can.

More Costly Database Changes

Let’s think about a more costly database change. Our customer reference. We choose to make the primary key of our customer table a 6 character string. Mr Jones becomes JON001, Mrs Jones becomes JON002 and so on. Which is fine until the thousandth Mr, Mrs or Ms Jones…

There are workarounds — moving to hexadecimal IDs would allow us to support another 3096 Jones. Or of course, we could just use sparser combinations, and the thousandth Jones becomes JOO001. But these feel a bit kludgey. We need to change the field.

The problem, of course, is that because this is the primary key, it’s now referenced by 30 other changes. So changing this field is a massive problem.

When it comes to primary keys, YAGNI doesn’t apply. If we think we might have 30,000 customers, we’re going to allow an int32 (capacity a shade over 2bn) rather than an int16 (capacity 32,767) to provide room for growth. If we expect 1m orders per year, an int64 might feel excessive, but we might choose to allocate the larger capacity once we take into account international growth through local distributors using a shared system, abandoned baskets, etc etc.

As well as the size, consider carefully what to use as a primary key. Some leading database luminaries advocate strongly for using “natural keys” — things which naturally uniquely describe objects, such as a social security number for a person. However, take care with natural keys — often things we think might be natural, turn out not to be. For example, whilst most countries assign their citizens a unique social security number, not all countries assign a social security or equivalent ID number, and there are circumstances where social security numbers may not be unique. A numeric key or GUID will be the safest bet in most circumstances.

This gives us rule #4 — Try especially hard to get primary keys right — they’re especially expensive to change

Thinking in Sets

When writing software, sometimes we need to act on a single item. But often, we take a collection of items — customers, orders, products — and do things to them individually using iteration. For example, processing sale payments, we might iterate over the collection of yesterday’s sales and check for incoming payments in the bank, updating each sale individually.

We can do something similar in most database procedural languages by using a cursor inside a stored procedure. A cursor allows us to act on each item individually.

However, this isn’t the most efficient way to work with databases. Relational databases are built on the concept of set theory. Without getting too deep into the maths, thinking in sets means we need to write a single rule which can apply to all members of a set — rather than updating each item in an iterative loop, we update all items “in bulk” by giving a rule which applies to the set.

In an iterative example, we might have the following code (which is in SQL Server T-SQL):

DECLARE @SaleID INT, @AmountDue DECIMAL, @TotalPaid DECIMAL;
DECLARE SalesCursor CURSOR FOR 
    SELECT SaleID, AmountDue FROM Sales WHERE PaymentStatus = 'Pending';

OPEN SalesCursor;
FETCH NEXT FROM SalesCursor INTO @SaleID, @AmountDue;

WHILE @@FETCH_STATUS = 0
BEGIN
    SELECT @TotalPaid = SUM(PaymentAmount) FROM Payments WHERE SaleID = @SaleID;
    IF @TotalPaid >= @AmountDue
        UPDATE Sales SET PaymentStatus = 'Paid' WHERE SaleID = @SaleID;
    FETCH NEXT FROM SalesCursor INTO @SaleID, @AmountDue;
END;

CLOSE SalesCursor;
DEALLOCATE SalesCursor;

In contrast, if we think in sets, we can simplify this to:

UPDATE Sales
SET PaymentStatus = 'Paid'
WHERE SaleID IN (
    SELECT SaleID
    FROM Payments
    GROUP BY SaleID
    HAVING SUM(PaymentAmount) >= (SELECT AmountDue FROM Sales WHERE Sales.SaleID = Payments.SaleID)
);

The latter is more concise but more importantly, will run much more efficiently in a database, which is optimised for set-based rather than iterative operations.

Set-based operations will be familiar for some developers — for example, those working in libraries such as Python’s Numpy — but may feel less natural for most developers. However, it’s well worth the investment to get your head around if you’re working with non-trivial databases. Rule #5 therefore — think in sets.

Three-Bears Normalisation

Normalisation is a huge topic I couldn’t hope to do justice to in this article, and is a key tenet of database theory in most Computer Science courses. In short, normalisation is the organisation of data into tables to minimise duplication and redundancy.

So for example, in a denormalised database, each order might have a customer name and address within the order table. Assuming we have returning customers, tho — which is a fairly typical aim — this will lead to a lot of duplication. So a logical step in normalising the database would be to split customers into a separate table, with each order linked to one customer and each customer linked to multiple orders.

Normalisation passes through 6 levels or “forms” from 1st Normal Form to 5th Normal Form. Yes, fifth — the 4th is called Boyce Codd Normal Form or BCNF — giving a sequence of 1NF, 2NF, 3NF, BCNF, 4NF and 5NF.

Whilst this is all very well in theory, and I’d strongly advocate anyone working with databases having a good working understanding of normalisation, it’s not quite so simple in practice.

Suppose we normalise customer addresses such that we havecities, counties and countries tables. This means each address lookup requires us to join 4 tables together, which will impact performance and complexity. It also means we either need a definitive list of every city, county and country — which is not as straightforward as it sounds — or to allow customers to input their own city leading to duplication (Stratford, Stratford-upon-Avon, Stratford-on-Avon) and/or confusion (Stratford, East London vs Stratford, Warwickshire).

So, what we’re aiming for is “three bears normalisation” — not too little, not too much — just the right amount. The right amount will vary on your database and how you’re using data, so rule #6 is to learn the rules of normalisation well enough you’re not afraid to break them as needed.

Conclusion

Whilst there’s plenty more I could say on “thinking in database”, 6 rules is plenty for now, and covers the fundamental structure of our database and code.

The best of the rest — which I may follow up in a future article — are validation, using triggers, transactions, tuning and performance, indices and disaster recovery.

We do also need to recognise that sometimes specialism is a good thing. In the scenario I described — an accounting system with multiple applications linking in to a central database — the development teams would be well advised to have a database team or database architect whose primary focus is the database. If thinking in database isn’t your thing, it’s important to know when to call on a specialist.

If you’ve found these guiding principles on thinking in database useful, you might also find my articles on when to centralise business logic in your database, and wrapping process around your database development with test driven development useful.

About the Author

Rob Walters is a software architect with over 2 decades’ experience writing software, maintaining software and leading software projects. An early career in bespoke software development gave Rob broad experience in a variety of projects ranging from web portals to complex AI research applications (long before AI became fashionable!).

The majority of Rob’s work time is devoted to his role as Software Architect at PatronBase — with a primary focus on R&D — and where his primary tools of choice are C# / .NET / WPF, PHP / Laravel, SQL Server, Azure DevOps and a sprinkling of Terraform / OpenTofu thrown in.

A limited amount of Rob’s time is available for consultancy or architecture through Sett, where Rob’s wider team is available for greenfield or legacy bespoke software projects either taking the lead or as part of a larger team.

In addition to software architecture and project management, Rob spends most of his time making, growing or fixing things, and blogs on this as well as topics around neurodiversity and inclusion.