Rap2h

Posted on Jun 28, 2022

SQL tips for beginners; what I learned after 10+ years of occasionally writing queries and schemas as a web dev

Rap2h · 2022-06-28T20:56:27Z

Index nearly everything, index early . Not indexing results in poor performances. Indexing too late can create downtime or inconsistent data. Do not consider removing/avoiding index to preserve disk space, buy more space instead. Remember that index may have an impact on write performance though. PostgreSQL is your best choice , it has a skilled community, including StackOverflow heroes . There is no MySQL, MariaDB, or SQLite user group where I live. ORM are great for simple CRUD operation . Don't use it to query specific stats or any complex query, use raw queries. BTW, for TS/JS devs, Prisma 3 is cool, Sequelize is obsolete. SQL syntax is case insensitive, use lower case everywhere (and underscores), do not name your tables with camelCase. ALSO, YOU DON'T HAVE TO YELL. select id from users is the same as SELECT id FROM users . For string comparison, case sensitivity depends on collation . Never miss a foreign key . Implicit links and relations between tables should be avoided. Follow a pattern when naming tables and columns . Example: users , posts , comments (plural). post_user , comment_post , comment_user for junction tables (singular + alphabetical order). id for primary key, id_user , id_comment , and id_post (singular + id first) for foreign key. Reading a database schema is hard for newcomers. Be super consistent when naming tables. 1 model ≠ 1 table . Unfortunately, when designing a database you have to think at a lower level than "models" and "objects". You will have to deal with technical tables. Use the right data type . Do not create "custom" data types defined by an implicit contract. JSONB is great but don't overuse it. Also, learn to query it it's complex yet powerful . When you iterate a lot on designing a schema, you can use JSONB to avoid modifying the schema over and over. PostgreSQL does not scale automatically. Expect migrations and downtime. CTE is more readable than subqueries , you can read it from the top to the bottom. Also, you can understand the way it was designed. Keep tables schema simple , avoid patterns such as EAV that makes tables hard to understand and read; try to keep most tables as simple as vertical CSV sheets, with one line for each record. Don't be afraid of writing long and verbose queries to address a specific need, it's often more efficient. The order of the words in a query is not the order in which they are evaluated. Know the order of execution : from, where, group by, having, order by, select, limit. 120 tables are way too much , even for big products. You can build a robust application with 25 tables. Thank you for reading 🙏

#webdev #programming #sql #postgres

Index nearly everything, index early. Not indexing results in poor performances. Indexing too late can create downtime or inconsistent data. Do not consider removing/avoiding index to preserve disk space, buy more space instead. Remember that index may have an impact on write performance though.
PostgreSQL is your best choice, it has a skilled community, including StackOverflow heroes. There is no MySQL, MariaDB, or SQLite user group where I live.
ORM are great for simple CRUD operation. Don't use it to query specific stats or any complex query, use raw queries. BTW, for TS/JS devs, Prisma 3 is cool, Sequelize is obsolete.
SQL syntax is case insensitive, use lower case everywhere (and underscores), do not name your tables with camelCase. ALSO, YOU DON'T HAVE TO YELL. select id from users is the same as SELECT id FROM users. For string comparison, case sensitivity depends on collation.
Never miss a foreign key. Implicit links and relations between tables should be avoided.
Follow a pattern when naming tables and columns. Example: users, posts, comments (plural). post_user, comment_post, comment_user for junction tables (singular + alphabetical order). id for primary key, id_user, id_comment, and id_post (singular + id first) for foreign key. Reading a database schema is hard for newcomers. Be super consistent when naming tables.
1 model ≠ 1 table. Unfortunately, when designing a database you have to think at a lower level than "models" and "objects". You will have to deal with technical tables.
Use the right data type. Do not create "custom" data types defined by an implicit contract.
JSONB is great but don't overuse it. Also, learn to query it it's complex yet powerful. When you iterate a lot on designing a schema, you can use JSONB to avoid modifying the schema over and over.
PostgreSQL does not scale automatically. Expect migrations and downtime.
CTE is more readable than subqueries, you can read it from the top to the bottom. Also, you can understand the way it was designed.
Keep tables schema simple, avoid patterns such as EAV that makes tables hard to understand and read; try to keep most tables as simple as vertical CSV sheets, with one line for each record.
Don't be afraid of writing long and verbose queries to address a specific need, it's often more efficient.
The order of the words in a query is not the order in which they are evaluated. Know the order of execution: from, where, group by, having, order by, select, limit.
120 tables are way too much, even for big products. You can build a robust application with 25 tables.

Thank you for reading 🙏

Latest comments (32)

Trae Zeeofor • Oct 14 '24

Newbie here. Thanks for this!

HAP • Jul 8 '22

Index nearly everything, index early

Don't go nuts with this. Too many indexes can slow down writes.

Also:

When things get complex (and they often will) remember that you can create views and perform reads (and sometimes writes) with those views with a model/SQL.
Be careful with cascade actions. This can make a seemingly simple action incredibly long-running. If you have header tables linked to detail tables and your detail tables have millions of records in them or cascade to other tables that do, I encourage you to use soft deletes at the header-level instead. Once that is done, you can determine if and when to actually delete the data.
If you have queries that have to perform multiple aggregations against a large set of data, you may want to investigate partitioning.
Don't be afraid of VACUUM operations. It is often better to have more frequent, shorter vacuum executions vs fewer, longer ones.
For more modern versions of PostgreSQL, investigate your server specs (the server itself, that is: memory CPU, etc) if you have enough, explore increasing some work memory and adding more parallel workers.
Check out using the nice data types PostgreSQL provides such as range types. Those can be used with special indexing to provide complex operations like range overlap and in some cases nearest neighbor, if I remember correctly.
If you are writing SQL that performs a manipulating operation (INSERT, UPDATE, DELETE), you can use the RETURNING clause to return the (full or partial) rows that were affected.
Think of doing you operations in batches where possible. This can be more efficient when processing large sets of data.
When writing your code, commit at the highest level possible. Don't commit in a low-level loop, unless you really like your code to run slow and be I/O intensive.
Don't forget you can cast certain types to others with a double colon (::). Ex:
```
select integer_col::text from table;
```
Understand that sometimes anti-join patterns using EXISTS or NOT EXISTS with a subquery can sometime execute faster that a LEFT/RIGHT JOIN where col is null.

-- 40-year DB developer

Rap2h • Jul 12 '22

Wow! Thank you so much for your answer! Your comment is actually a gold mine and contains original content and really useful thoughts! It should be an actual article by itself.

I agree with all points. The last point in particular saved my dev life more than once: it could be considered a detail but actually, it is a game-changer! (and it's not well-known)

Let me know if you write an article with your contents, I will be glad to share it (or review it before publication if you want)! It could be a "SQL tips for ~~beginners~~ intermediate" :D

Thank you!

danjelo • Jul 6 '22

Great article, agree on many things!

Regarding indexing I would be at bit wary though when to add them. Most DB's has queryplans and such to determine when to use an index, guess it is based on rowcount and cardinality and other things.

Also composite indexes are worth mentioning,
i.e an index of (columnA, columnB) might be used for
WHERE columnA = value AND columnB = value
and also:
WHERE columnA = value

Rap2h • Jul 12 '22

Thank you for your feedback.

I guess you are right about index (you are not alone, many comments are similar, here and on reddit). Thank you for the precision about composite indexes!

Sanam • Jul 5 '22 • Edited

Let me correct some of these,

Use uppercase for the sql syntax (this is standard practice in dba world)
Use 25 tables instead of 120 is conflicting, Database should be normalizes and its always good to split the SOC table wise and play with id. This point basically disagree the database architecture pattern.

Rap2h • Jul 12 '22 • Edited

Thank you for your feedback! (for first point, I guess case is also a matter of opinions: dev.to/redhap/comment/203hn)

Sanam • Sep 22 '22

Yes it is. Everyone have own style of formatting and preferences :)

Sergey Lukin • Jul 2 '22

Agree with all points with a small note on "Don't be afraid of writing long and verbose queries to address a specific need, it's often more efficient.": in my experience, having small one-thing-per-query SELECTs in the code leads to much more robust performance at scale as caching is more explicit and leads to less stress on DB overall. My 2 cents.

Rap2h • Jul 12 '22

Thank you for your feedback!

We may have a different use case! Maybe my use case is more for things like specific stats (let say when a kind of a big introspection in database is required to only display a chart by month for the last 6 months). I agree with you for other cases: sometimes it's better to write a first query, then pass the resulting ids to a second query.

Costin Manda • Jul 2 '22

I have to say that I don't agree with many of your points.

Indexing is a subtle operation that needs understanding of what and why one does it. The space and write performance costs are important enough to invalidate your assumption on indexing everything.
I would rather invite database designers to separate the tables that require reading from those that are heavily written in and construct some mechanism for synchronizing them. Indexing for reading is very different from indexing for writing and table structure is very different for reading and writing as well.

Also... PostgreSQL being the best. What trolling :) I actually loved PostgreSQL 20 years ago, but then I've worked on SQL Server almost exclusively. It's amazing how much stuff they have added every year to it. I expect most other engines have had a similar spectacular growth. Again, here I would say that the engine should fit your requirements.

About SQL being case insensitive in keywords or the styling... I find that irrelevant. There are tools that can change the SQL code style automatically. Basically you can write whatever you want and then shape it to fit... your company's code style. I know people who come into a company where everyone writes in one way and they stubbornly write in another way because "this is the way". They are just annoying, not right.

I do agree with ORMs being a shortcut to simple things and when you want something done right or specific you need to "get your hands dirty".
I personally would go even so far as to say ORMs are harmful. Because even if one follows coding guidelines and abstract the DAL correctly, maintaining code that handles data in multiple ways is difficult and messy. If ORMs are good only for simple things, that means that SQL queries for the same effect would be simple. Therefore there is no need to use one.

Thanks for sharing your viewpoint and inviting a discussion on them.

Rap2h • Jul 12 '22

Thank you for your detailled feedback! I'm glad you don't agree with most of my points since it helps me reviewing my thoughts, so thank you!

As a disclaimer, this article has (at least) two big flaws:

It is mostly opinions (and I'm too assertive in my points)
It is advices for beginners (so when I write "do this" it's not that simple in a real world where each situation is different).

Here is a try to answer all your points!

Indexing is a subtle operation

You are right! This is the most posted comment I received both here and on reddit, so I guess I'm wrong. I wrote this because beginners (in my experience) often forget to index things, so it was something like "hey, don't forget to index obvious things you query all the day".

PostgreSQL being the best. What trolling :)

Yes, sorry about that. I should have write: "as a beginner who don't know what SQL DBMS to choose, PostgreSQL could be a good candidate".

About SQL being case insensitive

I have no strong opinion on this. When I use Datagrip, and when columns and table names have upper case, it does not auto-correct, so it's annoying. Your point is valid: a new dev that would come and says "let's change all our habits because mine are better" would be ... uh... annoying I guess!

I do agree with ORMs being a shortcut to simple things

That's an interesting point!

Again, thank you for your feedback, I hope readers will see it too. BTW, I recommand the feedback by @redhap : dev.to/redhap/comment/203hk which is like yours: a deeper questioning about SQL querying than mine!

Hung Luong • Jul 1 '22

This could be just a personal quirk but I'd really prefer SQL keywords and functions to be uppercase, while table and column names as lowercase as that help me to easily distinguish the boilerplate from actual data

HAP • Jul 8 '22

I always used spacing for that. I find that it's easier (for me) to keep case changes to a minimum.
I prefer writing statements in this fashion:

select col,
       col2, 
       col3
  from table
  join table_2
    on table_2.col = table.col
 where table.col2 = 'value'
   and table.col3 > other_value;

This way I always have my actions on the left and my columns, values, expressions on the right.

But that's just me, possibly. I find it neat and clean and easier to focus on instead of case changes confusing my attention.

Kazaryan Fyodor • Jun 30 '22

Thanks for writing this post. I agree with almost all of the statements, except 3rd one. Could you explain in more detail, why using an ORM are not recommended in big projects? In my opinion, it helps to save code clean and readable, even helps to reuse redundant code compared to raw SQL queries.

Rap2h • Jul 12 '22

It's not for "big projects", it's for complex queries. It's hard to write complex query with query builder, and an optimized query is really sometimes 1K more rapid than 4 queries made to the ORM via its objects or a weird query built by the ORM via its query builder. You should still use your ORM on big projects, but for simple CRUD operations: Users.findById(123) is fine!

AISuperHub • Jun 29 '22

Thank you so much for the sql tips specifically on the PostgreSQL ones. I recently did some research and wrote an article to compare PostgreSQL and MySQL syntax. Sharing here to further improve knowledge.

tipseason.com/postgres-vs-mysql-sy...

Rap2h • Jun 29 '22

Hi @tipseason! Thank you for your feedback and the link to the article! (bookmarked ✨)

AISuperHub • Jun 29 '22

Thanks Raphael.

Arnaud Ambroselli • Jun 29 '22

1 model ≠ 1 table. Unfortunately, when designing a database you have to think at a lower level than "models" and "objects". You will have to deal with technical tables.

Being almost a newbie in SQL, I don’t understand the subtility here… can you tell a bit more what you mean ?

Toomas Jaska • Jul 4 '22

Aah. This threw me off thinking you mean one model ought not to be constrained in 1 table. What you really meant is every table isn't necessarily a representation of a model.

Rap2h • Jun 29 '22 • Edited

Thank you for your feedback! The most obvious case is junction tables (n..m relationship), i.e: users have n groups, groups contain m users, then you need a group_user table (a.k.a a junction table). This kind of table represents a relation between two models, not a model itself.

View full discussion (32 comments)