I was recently going through my notes of SQL Antipatterns and was shocked to realize how actual this book still is. I'm going to share my summary of the book. I guarantee you'll find at least a few antipatterns in your current database design.
Logical Database Design Antipatterns
Jaywalking
Antipattern: Storing a delimiter-separated value string in a varchar field instead of creating an intersection table. It might be easier, but it makes queries harder. And updating the field is also a pain. And your field might have all kinds of consistency errors (for instance, ids are normally numbers, but now you have a varchar. The DB cannot ensure consistency for you).
Legitimate Uses of the Antipattern: when the data you're storing in the varchar is not needed to be used in the queries (gets displayed as-is).
Solution: create an intersection table.
Naive trees
Antipattern: When you have multiple level parent -> child relationship and you represent it using an Adjacency list (you have a parent id in every row). It becomes impossible to get the whole tree in one query (because by default you can only get the next level).
Legitimate Uses of the Antipattern: when you're certain that you will only have 1 level of queries. The alternatives are harder to implement.
Solutions:
- Path Enumeration (storing the string of ancestors as an attribute of each node, just like the filesystem directories). Drawback: limitations of the Jaywalking antipattern.
- Nested Sets (add to each node nsleft and nsright: the nsleftnumber is less than the numbers of all the node’s children, whereas the nsright number is greater than the numbers of all the node’s children. These numbers have no relation to the id values). General tree reading becomes easy, but editing the tree becomes hard because all the nsleft and nsright need to be reprocessed.
- Closure Table: an extra table that holds all the relations between ancestors and descendants (including node indirect relationships and a reference to itself). The only drawback is the extra table size for big trees. But other operations are very easy to implement.
Reference: "Joe Celko’s Trees and Hierarchies in SQL for Smarties"
ID required
Antipattern: Using a primary key that is not the appropriate primary key for this table (e.g., there is a natural primary key like social security number or there could be a compound key in there). If a table has a unique column then, most probably that could be the primary key.
Solution: declare a primary key on a field that is easy to index and makes the most sense for that particular table
Keyless entry
Antipattern: when you don't use the constraints offered by the DB (especially foreign keys).
Legitimate Uses of the Antipattern: never, unless your DB doesn't support it
Solution: always declare constraints
Mixing data with metadata
Antipattern: For example, by appending the year onto the base table name, we’ve combined a data value with a metadata identifier.
This is the reverse of mixing data with metadata from the Entity-Attribute-Value and Polymorphic Associations antipatterns. In those cases, we stored metadata identifiers (a column name and table name) as string data.
In Multicolumn Attributes and Metadata Tribbles, we’re making a data value into a column name or a table name. If you use any of these antipatterns, you create more problems than you solve.
Entity-value-attribute
Antipattern: when you have a table with the columns: entity_id, attribute_name, attribute_value
It appears when you have OOP inheritance and the children have various fields. If you use it, you lose referential integrity and data type validation.
Legitimate Uses of the Antipattern: you shouldn't use this in a relational DB. Just use a non-relational DB or one of the solutions below.
Solutions:
- Single table inheritance: one table with all the attributes of the children (it's going to be a sparse table
- Concrete table inheritance: completely independent tables for each subtype
- Class table inheritance (one table for the parent properties and tables with the child-specific properties)
- Semistructured data (one table with the parent properties and an extra blob field with the child-specific attributes). Disadvantages similar to the Entity-value-attribute.
Polymorphic associations
Antipattern: When you need to reference from a single table, multiple parent tables (e.g., comments to features/bugs). You're using the pattern if you store in a column the name of your type of parent table.
Solutions:
- Create intersection tables
- Create a common super table (with only ids) and reference that with your new table. The children will reference the super table (ex: comments references issues; bugs references issues, features references issues)
Multicolumn attributes
Antipattern: When an entity's attribute can have multiple values, you create multiple columns called attribute1, attribute2, etc.
Solutions:
- Create a dependent table (a table that references the initial table by id and has the attribute as the other column)
- Store each value with the same meaning in a single column.
Metadata tribbles
Antipattern: When to support scalability and performance, you clone tables or new columns to support new partitions: ex: sales_2013, sales_2014, etc. (tables of the same schema with different entities). Basically, in your database, you will have a table per "some data value".
Legitimate Uses of the Antipattern: when splitting the database sensibly makes database administration tasks easier after the database size passes a certain threshold.
Solutions:
- Using Horizontal Partitioning: rows are separated into partitions. Mostly managed by the DB.
- Using Vertical Partitioning: splitting a table by columns. Databases usually do this for TEXT and BLOB.
Physical Database Design Antipatterns
Rounding errors (when you want to store floating point numbers in the DB)
Antipattern: When you use float, double, or any other related DB type. The problem is how the number is represented internally by the DB. It will do all kinds of unpredictable roundings.
Legitimate Uses of the Antipattern: scientific applications.
Solution: use NUMERIC or decimal
31 flavors
Antipattern: When you want to restrict a column to specific values (e.g., the status of a ticket), you define the allowed values in the table's schema or as a trigger.
Legitimate Uses of the Antipattern: when you have very few values that will never change. But you might still get it wrong (e.g., gender).
Solution: create a lookup table.
Phantom files
Antipattern: When you want to store Store Images or Other Bulky Media (when you reference them via the DB), and you use the FS directly, instead of the DB. These need to be specifically backed up, and the DB does not manage them.
Legitimate Uses of the Antipattern: when the data files might affect how your DB behaves.
Solution: Use BLOB Data Types As Needed
Index shotgun
Antipattern: When you want to Optimize the performance of the DB, and you throw indexes here and there without thinking too much about it (no indexes, too many indexes, queries that cannot benefit from indexes).
Legitimate Uses of the Antipattern: none
Solution:
- Measure the query times
- Get the query execution plan (QEP)
- Analyze the QEP and find where the index needs to be added after creating the index, test
- Optimize: covering (compound) indexes could speed up things even more; use in mem vs. on disk indexes
- Rebuild: because indexes become fragmented as rows are deleted/created/updated
Query Antipatterns
Fear of the unknown
Antipattern: thinking that null is just another value. Actually, SQL treats null as a special value, different from zero, false, or an empty string. As a general rule, all operations with null will return null. Or using a value to represent null.
Solution: Treat null as a unique value. Declare Columns NOT NULL when it makes sense. Also, consider if using default makes sense.
Ambiguous groups
Antipattern: What you're already doing: Get Row with Greatest Value per Group. When you're using group by and are trying to get extra information from the row that you selected (like other columns that are not included in the group by expression), in other words, referencing Nongrouped Columns in the select list of the query.
Every column in the select list of a query must have a single value row per row group. This is called the Single-Value Rule. Columns named in the GROUP BY clause are guaranteed to be exactly one value per group, no matter how many rows the group matches.
Legitimate Uses of the Antipattern: none because most DBs will throw an error
Solutions:
- Don't add an extra column to the select list
- Using a Correlated Subquery (own note: seems pretty hard to implement)
- Using a Derived Table: use the group query as a subquery of a query that selects the columns that you need. But merging their results with left join https://stackoverflow.com/questions/7745609/sql-select-only-rows-with-max-value-on-a-column
- Using an Aggregate Function for Extra Columns -> apply a function to the extra column in the select list so that you ensure that there's only one value coming back (e.g., max or GROUP_CONCAT)
Random selection
Antipattern: When you must fetch a random sample from one of your tables, you sort data randomly -> ORDER BY RAND(). This operation does a full table scan.
Legitimate Uses of the Antipattern: only when you know for sure that the size of the data that you're randomizing will not be bigger than 50-100 rows.
Solutions:
- Pick a Random primary Key Value Between 1 and MAX; only when primary keys are contiguous
- Choose Next Higher Key Value: just like before, except you'll be picking the next available key (assumes that the keys are non-contiguous). The results will not be evenly distributed
- Get a List of All Key Values. Choose One at Random (do it in the application; does not scale well)
- Choose a Random Row Using an Offset: select a random between 0 and total_rows and use it in an OFFSET command
- Search the docs of your DB of choice
Spaghetti query
Antipattern: When you want to achieve everything in only one SQL query (solving a complex problem in one step,
- You might create a cartesian product without wanting to do that
- It's going to be hard to maintain
Legitimate Uses of the Antipattern: when you're going to use that query as a datasource in a 3rd party app
Solutions:
- Split your initial query into multiple smaller ones
- Use the union label (to combine smaller queries)
- Writing SQL Automatically—with SQL -> use concat to create a list of queries to run; you can also do this with a script in bash
Readable passwords
Antipattern: Store Password in Plain Text
Solution: hash your passwords with a salt directly in your application.
SQL injection
Antipattern: when writing dynamic SQL queries, you end up executing unverified input as code.
Solutions:
- Parameterize Dynamic Values (use query parameters)
- Filter Input
- Do your best not to write standard SQL. Rely on your data access framework as much as possible
Diplomatic immunity
Antipattern: when you want to code fast, completely skipping engineering best practices. Make SQL a Second-Class Citizen, not giving the DB the same importance that the code gets in terms of quality.
Solution:
- Forget about self-documenting code. It's a myth. Use Entity-relationship diagrams, break them down into functionalities and mention triggers and stored procedures
- Use source control
- Have tests that test the interaction with the DB
Magic beans
Antipattern: when designing an MVC application and you don't consider enough how to separate the app logic between the M, V, and C. (Own note: I'll probably need to re-read this chapter from the book)
Solution: Active record is a design pattern that maps objects to DB tables. The Model Has an Active Record -> basically, build services that compose the models. Aim to make your model a domain model, not a database model.
Rules of normalization
Should be considered only in the context of your app. How much do you want to normalize/denormalize? It's a tradeoff. And it would be best if you benchmarked it.
The objectives of normalization:
- To represent facts about the real world in a way that we can understand
- To reduce storing facts redundantly and to prevent anomalous or inconsistent data
- To support integrity constraints
The forms
- First Normal Form: no repeating groups (no multicolumn attributes, no multiple values in one column)
- Second Normal Form: when you're repeating values on a column, instead of creating a dependent table
- Third Normal Form: single responsibility principle for tables (put the column in the table that it belongs to)
- Advanced 3rd form (Boyce-Codd Normal Form): table doesn’t contain any field (other than the primary key) that can determine the value of another field. Example: teacher, subject, student. The subject is always dependent on the teacher.
- Fourth Normal Form: BCNF without compound keys
- Sixth normal form: It’s typically used to support a history of changes to attributes (e.g., an audit log).
See no evil
Antipattern: discarding DB error messages or not looking at the raw SQLs that get run when debugging the application.
Solution: Recover from Errors Gracefully -> log potential exceptions every time.
Antipatterns that are not that current anymore?
Implicit Columns
Antipattern: when you want to reduce typing (like explicitly mentioning all columns in the select list), you use:
- For insert without specifying the columns, when you refactor you might break the insert or, even worse, the database
- A select with * will fetch and display all the data
Legitimate Uses of the Antipattern: when you're writing queries to try things out
Solution: Name Columns Explicitly
Poor man’s search engine
Antipattern: when you want to do a full-text search, you use pattern-matching predicates (e.g., LIKE, REGEXP).
Solution: don't use SQL. Elasticsearch has made this one easy. If you have to use SQL (for simple cases, each vendor has a solution for this already)
Pseudokey neat-freak
Antipattern: when you want to have contiguous keys in a table (e.g., when you have deletions in the DB), you want to fill in the gaps.
Instead of allocating a new primary key value using the automatic pseudokey mechanism, you might want to make any new row use the first unused primary key value
Renumbering Existing Rows -> changing the existing keys so that they are contiguous.
Legitimate Uses of the Antipattern: none.
Solution: just let it be. You can see the primary key id as UUIDs. The fact that they are consecutive is just a coincidence of the implementation.
Top comments (1)
NICE.