Let me know if this sounds familiar...
It starts off so innocently...
After reading the Elasticsearch tutorial, I quickly put together a block of code that sends a simple string and gets back a load of useful data.
Deeper down the rabbit hole
Immediately I noticed the location of the "real" data. It's buried way down there in hits.hits._source
. So I updated the block of code to extract the useful array from the noisy metadata.
As I continued to develop, subtle searching distinctions started appearing and needed to be handled:
- Handling wildcards or special symbols in the user's search
- Do I use
query_string
ormatch
? -
term
vsterms
-
must
vsshould
And even if the correct search is identified, there are other features that should be part of a real application, like aggregations and highlighting, which lead to more concepts like post_filter
and .raw
fields. What was originally quite simple is starting to look more like a hairball.
I know, I'll create a QueryBuilder
Eventually, I wanted to get back to the simplicity that existed at the beginning. The best course of action was to create a module that accepts a few parameters, build the big query and process the results for return. It's a good case for modular development and I can use tests to verify it works correctly.
Now there are some tough questions to answer:
- In order to test this code, does every developer have access to the same Elasticsearch instance as I do? What about the CI server?
- It's easy to set up a test passing parameters to the module (1), but how to set up a spy at (2) to verify the hairball?
- Setting up the assertion at (4) is even worse. In the rush to quit thinking about it, I might have wrote a test that looked like this:
expect(response.hits.hits).toBeDefined()
which is almost as bad as no test at all. And to be a little pedantic, these aren't even unit tests, these are systems integration tests. The scope is too big for practical development.
Taking a step back
The second or third time I went through this process, I started to think about a better approach. The key is the CI server. I don't want Jenkins, Travis or CircleCI to hit a live instance. In the case of PII, it may not be allowed to access the data anyway. I want to have something that looks like Elasticsearch but is static and deterministic.
Putting on my functional programming hat, it looks like this:
A set of params (the circle) produces a pair of shapes, the request and the response. By enumerating the parameters to be tested, I can produce the pair of shapes for each test. That's a good theory, but how to put into practice? I need to figure out how to:
- Enumerate the set of parameters
- Store the request/response pairs
- Capture the request that goes with each set of parameters
- Capture the response
- Load the pairs into the tests
- Automate this process
- Detect when Elasticsearch changes
That's a lot, so let's get started:
Enumerating the set of parameters
If time and money were in infinite supply, we could exhaustively test every combination of parameters to ensure all possibilities are covered. Since we don't live in that reality, code coverage metrics are the next best thing. The first step is to create tests that cover the percentage of code you feel comfortable with. Don't worry about good assertions at this point, we are just looking to make sure all code paths are hit.
Preparing to store the request/response pairs
I use the following directory hierarchy to guide where each request/response pair goes:
+- <API root>
+- <module>
+- __mocks__
+- <index-name>
+- <endpoint>
Directory | Description | Examples |
---|---|---|
API Root | The root of the application. | bookstore |
module | A REST endpoint | books, orders |
__mocks__ |
A convention from Jest I liked | -- |
index-name | The Elasticsearch index | sales |
endpoint | The Elasticsearch endpoint. | _search, _suggest |
Create a Fake to encapsulate the behavior
With the directory structure defined, I'll create a class that hides some of the gory details. This should be located at the root of the project.
class FakeElasticsearch(object):
def __init__(self,
short_name,
subdir,
index_name,
endpoint='_search'):
self.short_name = short_name
self.path = os.path.join(
os.path.dirname(__file__),
subdir,
"__mocks__",
index_name,
endpoint
)
def buildPath(self, suffix):
return os.path.join(
self.path, self.short_name + suffix
)
def load_request(self):
fileName = self.buildPath('_req.json')
with open(fileName, 'r') as f:
return json.load(f)
def load_response(self):
fileName = self.buildPath('_resp.json')
with open(fileName, 'r') as f:
return json.load(f)
def save_request(self, body):
fileName = self.buildPath('_req.json')
with open(fileName, 'w') as f:
return json.dump(f, indent=2)
Capturing the existing request (Method #1 - Server-side)
Next step is finding the code that calls Elasticsearch, and temporarily inserting a few lines that save off the request.
@api_view(['GET'])
def search(request):
short_name = request.query_params
...
if settings.SAVE_REQUESTS:
fake = FakeElasticsearch(short_name, 'books', index_name)
fake.save_request(body)
resp = es.search(index=index_name, body=body)
...
Capturing the existing request (Method #2 - Browser-based)
When the code that builds the query and calls Elasticsearch runs inside a browser, capturing the request gets a little more complex. Basically, there are a few options:
Use the
Network
tab in developer tools to manually copy the request object and paste it into a JSON file in the correct location. This is a good option to start with, since it requires no extra code, but quickly gets unwieldy.Use the "slow log" feature of Elasticsearch to capture the queries it is receiving. Then search through the logs and save off the useful requests to a JSON file.
Quickly create an API endpoint and post the request there. This has all the advantages of the server-side version but does incur additional development time.
Automatically capture the response
I let curl
do the grunt work of hitting the Elasticsearch instance and capturing the response. The raw response contains some metadata that will be different between runs, like took
, so it needs to be stripped out. I used jq
, but awk
or sed
will work just as well.
Also, if you are a beginner with *nix like I am, the construct ${REQ%_req.*}
looks like it does nothing, but it is shell string manipulation and very handy to know.
URL="http://www.example.org:9200"
for REQ in $(find . -name "*_req.json")
do
PREFIX="${REQ%_req.*}"
RESPONSE="$PREFIX"_resp.json
INDEX="$(echo $PREFIX | cut -d "/" -f4)"
ENDPOINT="$(echo $PREFIX | cut -d "/" -f5)"
if [ "$ENDPOINT" = "_suggest" ]
then
JQ_PROCESS="{sgg}"
else
JQ_PROCESS="{hits, aggregations} | with_entries(select(.value != null ))"
fi
echo "Processing $REQ"
curl $URL/$INDEX/$ENDPOINT?pretty -s -d @$REQ | jq "$JQ_PROCESS" > $RESPONSE
done
Load into the tests
Once the request and response pairs have been captured, they can be integrated into the unit tests using the test fake to stand in for Elasticsearch.
class BookstoreTest(APITestCase):
...
@patch.object(Elasticsearch, 'search')
def test_authors_search(self, mock_search):
index_name = 'sales'
# (1)
fake = FakeElasticsearch('authors=King', 'books', index_name)
# (2)
body = fake.load_request()
resp = fake.load_response()
# (5)
mock_search.return_value = resp
# (3)
response = self.client.get(self.end_point, self.baseParams)
actual = json.loads(response.content)
self.assertEqual(response.status_code, 200)
# (4)
mock_search.assert_called_with(index=index_name, body=body)
# (6)
expected = resp['hits']['hits']
self.assertEqual(expected, actual)
Detect when Elasticsearch changes
The final piece of the puzzle comes from Martin Fowler, a contract test. The idea is to test the saved responses against a live instance to make sure that the assumed responses are still accurate. If it fails, maybe it is time to regenerate the requests and fix your code!
from elasticsearch import Elasticsearch
from fake_es import FakeElasticsearch
from nose_parameterized import parameterized
class TestElasticsearchContract(unittest.TestCase):
@unittest.skipUnless(settings.LIVE_ELASTIC)
def setUp(self):
self.client = Elasticsearch([settings.ES_HOST])
@parameterized.expand([
['authors=King', 'sales'],
['orders=cancelled', 'sales'],
['isbn=99999', 'books'],
])
def test_request_response(self, pair_name, index_name):
# (1)
fake = FakeElasticsearch(pair_name, 'books', index_name)
# (2)
body = fake.load_request()
expected = fake.load_response()
# (3)
actual = self.client.search(index=index_name, body=body)
# (4)
# <Elasticsearch returns the results of the query>
# (5)
self.assertDictContainsSubset(expected, actual)
Finally
It should be noted that the above technique can be applied to any external API and not just Elasticsearch.
Update March 31st, 2018: This blog post is also available as a Google Slides presentation
Top comments (0)