DEV Community

loading...

Are you sure you know what NoSQL is?

joeyhub profile image Joey Hernández Updated on ・19 min read

When NoSQL came around it was rapidly apparent that there was more to it than its name suggested.

SQL is just a language but it would commonly be used to refer to solutions that were more than just a difference in language.

Those new to the scene in their nativity were just happy they didn't have to learn SQL, after all, no one likes injections.

SQL is merely a language, the implementation of the language interpreter/compiler and underlying storage engine are two very different things.

In practice, NoSQL would typically not only refer to a different language but also different storage engines. Typically specialised engines that do not come with the consistency and integrity guarantees of typical relational SQL databases.

It is possible to create SQL implementations for many databases that might be considered NoSQL. Some do have SQL implementations although usually crippled as the engines do not support many of the common capabilities SQL expects and in reverse SQL might not be expressive enough to support all capabilities of a given storage engine.

The confusion is easily understood. SQL is a high level declarative language that does many things for you. It's easy then to mistake high level engine features, such as transactions or relational integrity checks and many other features that come with relational databases as part of SQL.

There's not only the proper separation of SQL from engines, with engines themselves offering many layers of capabilities and features but a lack of awareness that you don't necessarily have to enable or use all of those extra features that might impact performance or scalability.

Perhaps the worst mistake is that NoSQL was a bit like buying a better computer, that is, the same but fast and more extensible (for example, console versus PC).

Reality

  1. NoSQL isn't just replacing one syntax with another. In practice, the trend is not skin deep. It also means no ACID, no data integrity guarantees, no relations, etc. If you want it to achieve the equivalent the way you go about that can be far more complicated.
  2. Traditional heavy SQL users will be painfully aware of SQL's limits and despite that traditional SQL solutions are surprisingly capable and robust in spite of their limitations.
  3. The NoSQL solutions people would pursue were often a result of extreme situations.
  4. Many successful companies fronted NoSQL solutions. People made the mistake to assume they were successful because they use those solutions. In fact, they were able to experiment and do things in a strange way because they were successful.

Facebook grew out of PHP and MySQL which many elitists view as the Fisher-Price of web development and while they might have their head up their arses they're not entirely wrong.

In reality I've very rarely seen people get more than a fraction out of traditional database systems which often give you a lot out of the box. When I pick up a legacy codebase, half the time, to be generous, the problem was the developer reaching their limits and not the limits of their database system.

Many developers set themselves up to fail like this. They want to be hot shots fresh from the barrel but it doesn't work like that. I've heard people saying well the ORM should handle that but it rarely really works out.

The problem isn't that these solutions are necessarily always bad but that every key fits a lock and it takes real practical experience to master pairing solutions with problems. You can only go so far offloading the problem before you have to get your hands dirty.

In many cases where I have rescued legacy codebases where some of the limbs that need to be amputated to save them are attempts to make use of NoSQL. In virtually all of these cases, on paper the choices mostly make theoretical sense from an abstract perspective but from a more down to Earth perspective they don't pan out.

To be fair to NoSQL, be it SQL or not the hardest part to maintain of any legacy system often very quickly becomes the database. Code can be rewritten and replaced. I've seen ridiculous codebases grow to gigabytes and you can throw most of it away trivially but not the same for data which almost always grows on its own out of control to greatly exceed the size of a codebase.

For big companies that can afford to it's 50/50. Sometimes they release their solutions to their specific problems because they can and sometimes they release stuff because it's not really that amazing and it's role reversal, dominating from the bottom. If they can't be asked to maintain something then open source it or rather crowd source it.

It's a give and take world so you have to be discerning. It's not a hand out, you're expected to pick up the slack. You might be QA, a guinea pig.

Either way, I've not seen NoSQL solve any problems for the common case.

No SQL. Keep the engine.

As far as I'm concerned NoSQL is simply no SQL or more specifically using a more imperative rather than declarative language. This has been a missed movement because SQL can be a problem for performance itself regardless of the underlying engine.

Declarative is the result rather than the process. For example, if I ask for a cup of coffee without milk or sugar that's declarative. I'm concerned with the how rather than the what. If I ask for someone to take a cup, then grind beans into it and add hot water then it's imperative.

In declarative I explicitly ask for the result I want with the process being implied and in imperative I explicitly ask for the process with the result being implied.

Because SQL is declarative, how it gets a result is not something we usually have control over.

When you issue a declarative command then it must work out the imperative required to produce the result giving it an immediate performance disadvantage.

It will try to work out the fastest way to do it but wont always get it right. If you want it to get it right then you may need to give it more time to work out the solution. In some cases, it may be impossible for the computer to work out what a human can even in cases that might appear simple. In others it can end up spending so much time to work out the fastest way that it took up more time than it saved.

This alone might raise the case for being able to take prepared queries to an extreme, it should be possible to specify the level of optimisation. Unfortunately prepared query syntax is also often very simple so it would not work well for some queries with certain types of variability.

There are often cases where the fastest and easiest thing will be to tell the machine what to do up front but you can't do this with SQL. This isn't strictly true, in some cases they do allow you to put things in to tell it or suggest to it how to do something. These additional statements added for performance are patchy at best.

Sometimes SQL implementations leave a lot to be desired which is not necessarily down to declarative. There can be various reasons for this though a strong reason is that SQL is standardised and SQL vendors don't like to deviate from the standard too much to make their own things.

I'm going to focus specifically on MySQL as it's the one I have the most experience with, particularly its deficiencies. Keep in mind that virtually every other SQL implementation I have touched has deficiencies that you have to discover over again. Some more than others but with plenty owing to whats available through SQL rather than their internal engines.

There are cases with things missing for what reason isn't clear. For example, you normally can't issue an atomic UPDATE that also returns the row/rows in many SQL implementations even though the update statement must also select the rows anyway.

SQL is an interface, a language, a mask that can be removed. You could separate MySQL in two, for example the storage engine and the interface, leaving you with just My and SQL as separate components. It might be a bit more complex than that as there's still a middle layer but you get the picture.

NoSQL isn't only that but what if it were?

Who are you?

If you've spent a lot of time programming and seen a lot of diverse situations it starts to become clear that there's always two sides to things. For those without a great depth or breadth to their experience, it may be confusing exactly what problem they have or would have and what it really is any particular NoSQL solution solves.

Day after day I see some code with SQL that is terrible by people who clearly suck at SQL. They'll say joins are bad, this is bad and the other when in reality it's them, they can't even manage the declarative part. Sometimes this isn't helped by ORMs and other layers of abstraction that block them from getting to grips with SQL hands on.

Yet there are other times when I try to do something in SQL, something that should be simple and immediately my response is WTF SQL? There comes a point where the problem is no longer the operator but SQL.

This is sometimes apparent when you want certain results. For example, to insert a generated range. It becomes even more apparent when you have enough programming to be able to turn a query yourself from declarative to imperative and know what should be possible in regards to results as well as performance.

A real programmer is literally a person who translates between declarative and imperative.

Consider this simple query:

# Table a has unique index on col.
# Table b has unique index on parent and score.
SELECT        a.col, SUM(b.tries)
 FROM         a
 INNER JOIN   b
  ON          b.parent = a.id
  AND         b.score BETWEEN 5 AND 10
 WHERE        a.col IN(1, 2, 3)
 GROUP BY     a.col;

Skipping a lot of details most people would expect it to do something like this...

Select the rows matching from table a using the index on col and then for each of 
those find the rows from table b with the parent id the same as that as the id of 
the current row and the score between 5 and 10 using the index on parent and 
score.

What happens in reality is that but excluding and score. This makes the query slow though it can be hard to notice there's a problem until you have a lot of data. It uses indexes two thirds of the way each time dividing the result set so it can appear to be faster and optimal. It takes a fair amount of data to notice the difference between (((N / a) / b) / c) and ((N / a) / b) especially as a, b and c tend to increase as N does.

There's nothing wrong with the SQL that has been written apart from some questionable formatting. The problem isn't the SQL but the software that's reading it and executing it. Possibly some mistakes due to being anonymised.

This won't necessarily be obvious with a simple test. Tests such as explain and profiling can be difficult to parse or be sure about as they are very basic summaries. Explain might drop a hint by not saying that it's using an index yet it will list the index it's using and state that it's using that, which is half true.

Other methods may involve running the same query without any where on all or enough of the data to take a long time, for example minutes. Then run the query that's meant to fully use indexes. Very approximately, the smaller query should be inversely proportionately faster to the the proportion of rows processed. Other things can be done such as removing a condition starting from the end and seeing if it has the expected proportionate impact on speed.

I've seen so many people that are just bad at SQL to the level nothing's going to work for them but even if they do learn SQL so that the problem is no longer them, they'll still be left with a problem, SQL.

You can sometimes bipass SQL to prove it's SQL that's the problem.

# Warning: GROUP_CONCAT will truncate if the list is too long.
SET @ids = (
 SELECT GROUP_CONCAT(id)
 FROM user
 WHERE email_domain = 'example.com'
);
SET @sql = CONCAT(
 'SELECT SUM(price)',
 ' FROM invoice',
 ' WHERE user IN(', @ids,')',
 ' AND date BETWEEN NOW() - INTERVAL 1 WEEK AND NOW()'
);
PREPARE    stmt FROM @sql;
EXECUTE    stmt;
DEALLOCATE stmt;

This isn't identical but ends up being much faster even though I have to do the group by in memory and some other operations in the real world cases I have had. It is faster because it ensures the use of the index (user, date) rather than only (user). It would be even faster if MySQL did that in the first place. The above SQL is a ridiculous way of doing it.

If there's a lot of data in play, the performance difference is extreme. In this situation it might take some time to notice it. After two weeks the original query is twice as slow. After a year it's 52 times slower. This may be a slow creep that won't be immediately obvious.

Another deception is that even with this performance drain MySQL is relatively mature and fast in the worst case particularly being written in C. I've seen a lot of people implement and resort to alternatives to MySQL that don't pan out. When you work around the problem of badly written SQL, only to then turn it back into badly written voodoo SQL that's magically stupendously fast MySQL often completely wipes the floor with those specialised solutions often sold on being vastly superior to MySQL in terms of performance.

Every engine has it's own range of capabilities, deficiencies and limits though it's quite clear that the problem with MySQL is not the engine, it's the interface.

It's like ripping the manual controls from a super car, replacing it with an auto-pilot and then when you ask it to take you to a destination it takes the scenic route only and there's nothing you can do about it other than ask it in a way that's a different way of saying the same thing.

This is one of many many things in SQL that really aren't straight forward when it comes to performance. Over time hints and other things are added to straighten it out though often semi cryptic, ill explained and failing to still fully explain everything.

Many people annoy me with their "JOINS are bad" rants who tend to be the same people who can't hack for loops, especially nested but you look at this and it really does make joins look bad. The problem here isn't the join. It's the failure of SQL in a way that I believe is fairly fundamental. SQL is very strongly declarative. Even if someone for this case works out a better way to translate what we ask it for into the how to do it, it would be a lot less of a headache if we could also instead tell it what to do up front.

SQL implementations often have a kind of quasi imperative flavour though if you've ever worked with stored procedures it's one of the most unpleasant languages to work with and itself is quite limited in a number of respects.

You can't always intuit it.

While extending the above set of queries I came across a query that needed to look like this:

SELECT pk1, FLOOR(pk2 / 100) * 100 AS sub, COUNT(*)
 FROM table
 GROUP BY pk1, sub
 ORDER BY pk1, sub;

The pudding proof for this is:

SELECT SUM(pk2) FROM table ORDER BY pk1, pk2;

Takes about 7 seconds for run through, our expected worst case. Adding additional computation per row has very little impact on this meaning any results significantly distant might be suboptimal.

In this case the engine stores and can retrieve the keys pk1 and pk2 in row order. In this case, the ORDER BY is redundant and GROUP BY implementations do not need sorting before hand, however can be marginally more efficient if they know the results are sorted which is the case here.

From what I could tell, this wasn't being entirely leveraged by the engine. A powerful SQL parser would be able to determine that the operation on pk2 preserves order. In reality, it may be better to be able to tell SQL that this is the case though this won't get the full benefit still as it'll need to maintain checks that it's in order during the process.

The performance results for this and the explain:

+---------------------------+-----------+
| Status                    | Duration  |
+---------------------------+-----------+
| starting                  |  0.000117 |
| checking permissions      |  0.000007 |
| Opening tables            |  0.000011 |
| init                      |  0.000017 |
| System lock               |  0.000005 |
| optimizing                |  0.000004 |
| statistics                |  0.000022 |
| preparing                 |  0.000010 |
| Creating tmp table        |  0.000068 |
| Sorting result            |  0.000029 |
| executing                 |  0.000002 |
| Sending data              | 17.837853 |
| converting HEAP to ondisk |  0.289247 |
| Sending data              | 36.206385 |
| Creating sort index       |  0.452780 |
| end                       |  0.000009 |
| query end                 |  0.000006 |
| removing tmp table        |  0.000833 |
| query end                 |  0.000005 |
| closing tables            |  0.000006 |
| freeing items             |  0.000126 |
| cleaning up               |  0.000012 |
+---------------------------+-----------+

type | possible_keys | key  | rows     | filtered | Extra
ALL  | PRIMARY       | NULL | 70837841 |   100.00 | Using temporary; Using filesort

This is clearly doing it wrong, specifically it must be doing more than it needs as it's five times too slow. What if we ask MySQL to do even more?

SELECT pk1, FLOOR(pk2 / 100) * 100 AS sub, COUNT(*)
 FROM table
 GROUP BY pk1, sub WITH ROLLUP
 HAVING sub IS NOT NULL;

Gives:

+----------------------+-----------+
| Status               | Duration  |
+----------------------+-----------+
| starting             |  0.000081 |
| checking permissions |  0.000006 |
| Opening tables       |  0.000010 |
| init                 |  0.000016 |
| System lock          |  0.000005 |
| optimizing           |  0.000005 |
| statistics           |  0.000011 |
| preparing            |  0.000012 |
| Sorting result       |  0.000003 |
| executing            |  0.000002 |
| Sending data         |  0.000007 |
| Creating sort index  | 29.302380 |
| end                  |  0.000011 |
| query end            |  0.000008 |
| closing tables       |  0.000007 |
| freeing items        |  0.000140 |
| cleaning up          |  0.000015 |
+----------------------+-----------+

type | possible_keys | key  | rows     | filtered | Extra
ALL  | PRIMARY       | NULL | 70837841 |   100.00 | Using filesort

While it's still not the 10 seconds or less we might expect in the obviously optimal case, it's a curious result. The reason for this is beside the point. I don't believe many people would expect that asking for even more work to be done that's unneeded and needs to be undone after the fact so that it produces the same result to perform a third faster.

The original query was doing five times more than it needed even though we didn't ask it to do more than it needed but when we explicitly ask it to do more than it needs then it does less than it needs to do.

This is what we might call voodoo SQL. It looks badly written and technically is but produces a better result. It's voodoo because we haven't declaratively asked it to do it in a faster way instead we've asked it for even more results which should take more time and be less efficient as we throw out the extra result.

Normally if I saw SQL like that I'd think the person didn't know SQL. Why not just ask for what you want in a straight forward way?

Normally we'd optimise out the WITH ROLLUP and WITHOUT ROLLUP as it should do nothing. It's that same as how (a - b) + b can be just a.

In this case you might think the programmer is deficient until you run the query and time it.

The engine is where it's at but that's nothing to do with SQL.

Many people say map/reduce, which are two common array functions taking a callback that can be used to perform joins, group bys and similar operations. They're the same as for loops except providing a function instead to defer loop iteration and possibly permit parallel operations easily.

I find most such people don't know what they're talking about the same as people who say just use eventual consistency without really knowing what it entails. To them it's just a magic thing that solves the problem and they don't know how.

Map/reduce is at least an example of using a language other than SQL though I don't think it's always as simple as using those two operations.

There is something to be said for the easy of writing SQL and it doesn't escape the fact that map/reduce is not the whole kit and caboodle that relational databases often offer. It's not strictly fully imperative either at least in the sense that map and reduce aren't loops that you fully control.

In my current use case, I've replaced a system that's technically map/reduce and is very good on paper or in principle. In practice the map/reduce solution was not viable and was a huge resource hog.

It's not always like that. The underlying engine can potentially out do MySQL for certain extreme scenarios though it might not take thousands, millions or even billions but trillions or an order of magnitude more for it to pan out while catering to restricted use cases.

The solution had a similar problem. The underlying engine is neither here nor there but a layer had been placed over it, that while not SQL, effectively achieved something similar and was absolutely atrocious at mapping declarative commands to imperative even by the standard of SQL.

Specifically, I've been working on dealing with a Cassandra based system which while it has an SQL like interface, it's really a kind of fake SQL and doesn't need to be used.

This is an example of something where it really doesn't matter if it's SQL or not, the point is that the engine is different, intended to cater to specific situations and for certain capabilities requiring you to implement them yourself against an engine that in some respects is stripped down and unassuming.

I've seen some very respectable benchmarks and examples of Cassandra doing very well when properly used for its intended and optimal use case. My real world experience of this however is very different.

A system built on top of Cassandra performed horrifically and was easily out done by a simple SQL implementation. It only consisted of 1GB of data but the performance was awful. That can't be entirely attributed to Cassandra. The application on top also made certain mistakes.

However, even looking at Cassandra in isolation, sometimes poor implementations on top of it were poor in part because of Cassandra's limitations or sacrifices and in many cases I found Cassandra to be absolutely appalling on its own.

I can't say the setup I inherited was great, for example, a single node Cassandra isn't ideal but even with that in mind it was not sufficient to explain Cassandra's failings. At least some may be in part because of comparing C to Java and in others because it's still relatively immature.

An example of this, with 1GB of data, virtually any kind of operation took too long. That amount of data is so small I would not expect even a single node to struggle with it even if having to do run through.

I had a situation where to get any sense of the layout of the data I had to work out a way to pull random rows at an even distribution such as a few thousand to get a sense of what was even in the database. Attempting to use its own dump mechanism (similar to mysqldump) was such a nightmare to get working without it bailing after minute or so that I ended up making my own (a reflection of its immaturity).

We're only talking 1GB of data. Typically when I'm struggling with MySQL, it's because the data is in the several gigabytes and in the worst case it's only slow but doesn't as often break everything.

I found the performance of a single node to be sometimes awful. Despite the dataset being only 1GB, on a machine with 32GB of RAM I found another task I was trying to run kept stopping short with no error message only to find out the culprit was Cassandra taking up huge amounts of memory for some inexplicable reason.

I never went out of my way to supe it up but out of the box it has some very severe pain points for a light load.

In my case I needed to swap out everything from Cassandra to MySQL. That also meant everything is implemented in double. For example, importing data which is a heavy write load.

MySQL is awful for single inserts, that's a given and Cassandra could be seen to do moderately well. However, if I considered my concerns and dropped them something interesting happened.

For example, I did not need the data to be immediately available post write but rather within a reasonable time frame. So I could easily put a write buffer on top of MySQL (technically speaking a write buffer can overly reads as well to make data more immediately available). The same was in place for Cassandra, however it's more limited about the number of writes it can take at once.

In this case MySQL started clocking in close to 100K/s inserts on a single node while visibly leaving Cassandra behind in the dust. Although I haven't looked closely, I vaguely remember seeing examples of a million inserts a second setups with Cassandra and quickly dismissing them after seeing that was dependent on over two hundred large nodes costing about the same as calling a premium sex line costs per minute.

This might not be the final verdict as I've not taken everything into consideration or compared it (parallel operations, over network speed, row size, time doesn't always mean lower resource use, machine differences, etc) but it's a huge deviation from what people might anticipate. Fiddle with a few things and suddenly MySQL is running like the clappers to the point it suddenly becomes difficult to establish a decisive winner any more.

This is a perfect example of something that isn't really NoSQL but gets lumped in anyway. It's another topic. This one is about my write concerns and what I can afford to drop or solve in my own way rather than MySQL's to gain superior performance.

I don't think MySQL is the be all or end all. I am constantly dreaming up designs for better things or finding missing features (though it gets better over time, for example, at last we have things such as being able to skip locked rows). I use it heavily and am annoyed with its limitations but at the same time, for all of its weaknesses and defects I am finding it surprisingly hard to beat despite the rapid growth of alternatives and diversification (heck, there's now even two MySQLs).

I have also implemented a few or my own highly specialised database systems, usually very simple and in memory so I have a little insight at least on what might be truly achievable in terms of performance.

For writes, after I take a datadump, the cp command (and maybe flush/etc) is one way to get an approximation of the absolute maximum write achievable per node. It would take a few seconds to copy a datadump (not compressed or in binary format) which comes in at well over a million rows a second, more around ten million (though not all rows are equal in size either so a grain of salt is needed).

I am confident that compared to what Netflix has demonstrated, I could do the same at a tenth or less of the cost with a system on top of MySQL. A hundredth if replacing AWS (possibly only using it for load spikes though) which is extremely expensive and not economic for a lot of use cases. I'm confident because that's what I often do, working for small companies with limited budgets and also being naturally frugal.

I think between Cassandra versus MySQL, that's another subject. Despite the great disappointment of Cassandra, there are use cases where I suspect I would have an easier time with Cassandra than MySQL, at the very least logistically (typically clustering). There were also cases where Cassandra could do things more easily. For example, sampling random rows was at least in some ways more easy with Cassandra than with MySQL.

It's more important to temper expectations that alternatives are not necessarily drop in replacements and that there's no replacement for in depth knowledge, understanding and real world metrics.

Discussion

pic
Editor guide