CodingBlocks

83. Search Driven Apps

Jun 11 '18

We’re talking databases, indexes, search engines, and why they’re basically microwaves in this episode while Joe wears a polo, Allen’s quick brown fox jumps over whatever, and Michael gives out fake URLs.

Survey Says

Now that you had some time to digest the news, how do you feel about Microsoft's acquisition of GitHub?

Very excited! Looking forward to the awesome things Microsoft will add to GitHub.
I'm concerned but not enough to do anything about it. Yet.
I don't care. At all. Should I?
OMG the sky is falling. Why? How could we let this happen?
Already packed up my code and moved to GitLab.

vote

News

Thank you to everyone that left us a review:
- iTunes – vr6apparatus, The All Ighty Ollar, BonerGs, MatiasWasAlreadyTaken, galusssss
- Stitcher – Vothis, JavaRuns100BillionDevices, SoulSurvivor, Twentyone
The Cynical Developer (James Studdart) has Sean Martz on the show to talk about transitioning from a sys admin to a developer. (Cynical Developer Episode 79)
In the Orlando area on June 21? Come see Joe’s talk on Building Search Driven Apps with Elasticsearch at the Orlando Backend Developer’s Meetup.
- Sign up for Elasticsearch and Go for the Backend at Meetup
- Source code and Slides over on GitHub
Joe invents a new way to listen to podcasts … by topic.
- Search and listen by topic at https://qit.cloud/
- Source code available on GitHub

Search Driven Apps

What’s the problem? Why do we need search driven apps?

Search is a core tenant of modern computing
- Google, Shopping, Cortana, Spotlight, Contacts
- Can’t use the phone, listen to music, buy underwear or watch TV without searching
In some cases, the number of items we’re searching are small, can get away with a simple LIKE % search
But people aren’t very good at searching, and Google has spoiled us
- Auto-Complete helps with typos
  - What if there’s too much data? Or the user doesn’t know what the thing is called?
  - And what if there’s a lot of data?

Some interesting numbers:

Google servers 40k searches … per second (Ardor Seo)
Amazon sells over 500M products (ScrapeHero)
Splunk indexes 100’s of TB per day
5B videos watched on YouTube every day (MerchDope)

Q: How are these services doing it?
A: Lots and lots of joins, and LIKE clauses?

Relational Databases …

Don’t scale horizontally very well
Are designed for ACID instead of BASE
- ACID
  - From Wikipedia:
    - Atomicity – requires that each transaction be “all or nothing”:
    - Consistency – ensures that any transaction will bring the database from one valid state to another.
    - Isolation – ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially
    - Durability – ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
- BASE – Basic Availability, Soft State, Eventual Consistency
In regards to the CAP Theorem, Relational DB have an emphasis on Consistency
CAP Theorem
- From Wikipedia:
  - Consistency – every read receives the most recent write or an error
  - Availability – every request receives a (non-error) response – without guarantee that it contains the most recent write
  - Partition Tolerance – the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
Shredding is inefficient
Writing dynamic ANSI SQL is the pits

DB vs Search Engine…why not just do it all in SQL?

Why not just turn on full text search in SQL Server?
- Full-Text Search (Microsoft Developer Network)
Elasticsearch Search Reference (Elastic Docs)
Ability to search across all indexes, quickly and easily…
- Fast aggregations (facets)
- Ranged aggregations
- Paging
- Scale
- Migrations / Versions
- Simpler query requests

Ever written a SQL query that started out simple

And ended up a monster?
Start with a LIKE
Add some dynamic JOINs
In clauses lead to temp tables
Paging leads to support for multi-page actions …
CTE, Pivoting, OpenQuery

Writing dynamic SQL stinks.

How Inverted Index Searches solve the problem

So … why don’t we use a NoSQL implementation, denormalize the heck out of it, and then index the heck out of it, then map reduce the heck out of it – Well… you just described a basic search engine!

NoSQL DB: Horizontal Scaling
DeNormalize our data
Index everything!
Map/Reduce to search quickly
Declarative language

A little bit about indexes …

Forward index, and …
Inverted index (Wikipedia)
Let’s consider them both at the same time using a book analogy.
- A table of contents is a _forward index_. It tells you where to go in the book for a “document” (aka a chapter).
- The index you’ve seen in books is an _inverted index_. It tells you where to go in the book for all uses of the relevant words.
- Great answer on Stack Overflow.

But What’s a Reverse Index?

Let’s consider a row in a database the “document” we want.
A typical DB index would be a forward index. In other words, when we want row with ID = 123, we can seek right to it.
- Or put another way, a typical DB index acts like a table of contents for the “document” (i.e. row) we want
Busy databases/indexes can suffer from index block contention, especially those indexes for monotonically increasing sequences. Like numerical based primary keys.
- Consider that the index for a table is stored across several “blocks”. Let’s hypothetically say each block contains 100 keys.
- Let’s also consider our table has a integer primary key that sequentially increases. So we have keys like 123, 124, 125, etc.
- So, 0-99 are in the first block, 100-199 are in the second block, 200-299 are in the third block, etc.
- If 100 write requests happen at the same time, it’s possible they could all queue up to write to the same block.
- Best case, two blocks given that scenario
- This is what a reverse index aims to solve.
A reverse index simply reverses the key
- So for key = 123, it would go in the block as 321 (i.e. in our “300” block) and key = 124 would go in another block (in our “400” block) as 421, etc.
This way reads/writes are distributed.
In the case of 100 concurrent write requests, they would be spread across 10 different indexes given this scenario.

Back to Inverted Indexes

Take the sentence “The Quick Brown Fox Jumps Over the Lazy Dog”

Store the entire document
Throw away the “stop” words (like “the”)
Break the sentence down into tokens for a hash table
Each token contains an array of pointers to records
When the user searches for “lazy fox” we can break that sentence into tokens, and return the records that are pointed to by our indexes
Advanced rules apply: synonyms, antonyms, stemming, pluralization, scoring, etc
Example of synonym: “PWA” and “Progressive Web Apps” or “.Net” and “Dot Net” or “JS” and “JavaScript”
Because of the storage mechanism, counting is easy too – we don’t even need to fetch the document, just count pointers

Inverted Index Search Engines

Slow to Write, Fast to Read (Sort of…there’s near real time on many now)
NoSQL before it was called NoSQL
Scale Super Well
Wicked Fast Reads
Support Complex Filters
Declarative syntax

Popular Search Engines

Elasticsearch (and the Elastic Stack)
- Elasticsearch is open source (GitHub)
- Elasticsearch BV (the company behind Elasticsearch) recently open sourced X-Pack (Elastic blog announcement)
  - X-Pack adds additional functionality to Elasticsearch including security, alerting, monitoring, reporting, graph analytics, dedicated APM UIs, and machine learning.
Apache Lucene
Apache Solr (built on Lucene)
AWS CloudSearch
AWS Elasticsearch Service
Azure Search

Resources We Like

https://github.com/codingblocks/search-driven-apps
The Imposter’s Handbook (bigmachine.io)
37 Mind Blowing YouTube Facts, Figures and Statistics – 2018 (MerchDope)
How many Google searches per day on average in 2018? (Ardor Seo)
How Many Products Does Amazon Sell? – January 2018 (ScrapeHero)
ACID (Wikipedia)
CAP Theorem (Wikipedia)
Inverted Index (Wikipedia)
What’s the difference between an inverted index and a plain old index? (Stack Overflow)
Reverse Index (Wikipedia)
Introduction To Reverse Key Indexes: Part 1 (Richard Foote’s Oracle Blog)
How does database indexing work? (Stack Overflow)
Database Index (Wikipedia)
Elasticsearch from the bottom up (Blog, YouTube)
Stack Overflow / Stack Exchange performance statistics (Stack Exchange)

Tip of the Week

Pattern matching with is (Microsoft Docs)
Use git cherry-pick -n to pick multiple commits into a single new commit. (git-cherry-pick Documentation)
Looking for a Python fiddle? Try https://pyfiddle.io/.
Google’s Python Class (Google for Education)
Execute T-SQL from Visual Studio Code using the mssql extension
- Extension available in the Visual Studio Marketplace
- Source code available on GitHub
Use git bisect to find the commit that broke your code. (git-bisect Documentation)
Build, deploy, and manage your next web project with all of the features of an enterprise on Netlify.

Episode source

DEV Community

CodingBlocks

83. Search Driven Apps

Sponsors

Survey Says

News

Search Driven Apps

What’s the problem? Why do we need search driven apps?

Some interesting numbers:

Relational Databases …

DB vs Search Engine…why not just do it all in SQL?

Ever written a SQL query that started out simple

How Inverted Index Searches solve the problem

A little bit about indexes …

But What’s a Reverse Index?

Back to Inverted Indexes

Inverted Index Search Engines

Popular Search Engines

Resources We Like

Tip of the Week

CodingBlocks Follow

83. Search Driven Apps

Sponsors

Survey Says

News

Search Driven Apps

What’s the problem? Why do we need search driven apps?

Some interesting numbers:

Relational Databases …

DB vs Search Engine…why not just do it all in SQL?

Ever written a SQL query that started out simple

How Inverted Index Searches solve the problem

A little bit about indexes …

But What’s a Reverse Index?

Back to Inverted Indexes

Inverted Index Search Engines

Popular Search Engines

Resources We Like

Tip of the Week

CodingBlocks