HTTPX is a modern HTTP client library for Python. Its interface is similar to the old standby Requests, but it supports asynchronous HTTP requests, using Python's asyncio library (or trio). In other words, while your program is waiting for an HTTP request to finish, other work does not need to be blocked.
In Part 1, we built a simple Wikipedia search tool using Python and HTTPX. Even though HTTPX was used, the tool was only synchronous. In other words, each HTTP request was sent sequentially, and subsequent requests only start after the previous one is complete. A lot of waiting in line.
Now, let's do what HTTPX is good for: asynchronous HTTP requests.
async
and await
Python's asyncio allows tasks to collaborate. When a task is busy waiting on input/output, it can give other tasks room to do their business.
To designate such a function, precede it with the async
keyword. To call such a function, precede the call with the await
keyword.
We can create another python module (a file), src/pypedia/asynchronous.py
with the following code that usesasync
and await
. It is nearly the same as the code from Part 1, with a few differences. Feel free to compare the two.
"""Proof-of-concept asynchronous Wikipedia search tool."""
import asyncio
import logging
import time
import httpx
EMAIL = "your_email@provider" # or Github URL or other identifier
USER_AGENT = {"user-agent": f"pypedia/0.1.0 ({EMAIL})"}
logging.basicConfig(filename="asyncpedia.log", filemode="w", level=logging.INFO)
LOG = logging.getLogger("asyncio")
async def search(query, limit=100, client=None):
"""Search Wikipedia, returning a JSON list of pages."""
if client:
close_client = False
else:
client = httpx.AsyncClient()
close_client = True
LOG.info(f"Start query '{query}': {time.strftime('%X')}")
url = "https://en.wikipedia.org/w/rest.php/v1/search/page"
params = {"q": query, "limit": limit}
response = await client.get(url, params=params)
if close_client:
await client.aclose()
LOG.info(f"End query '{query}': {time.strftime('%X')}")
return response
async def list_articles(queries):
"""Execute several Wikipedia searches."""
async with httpx.AsyncClient(headers=USER_AGENT) as client:
tasks = [search(query, client=client) for query in queries]
responses = await asyncio.gather(*tasks)
results = (response.json()["pages"] for response in responses)
return dict(zip(queries, results))
def run():
queries = [
"linksto:Python_(programming_language)",
"incategory:Computer_programming",
"incategory:Programming_languages",
"incategory:Python_(programming_language)",
"incategory:Python_web_frameworks",
"incategory:Python_implementations",
"incategory:Programming_languages_created_in_1991",
"incategory:Computer_programming_stubs",
]
results = asyncio.run(list_articles(queries))
for query, articles in results.items():
print(f"\n*** {query} ***")
for article in articles:
print(f"{article['title']}: {article['excerpt']}")
Note the use of httpx.AsyncClient
rather than httpx.Client
, in both list_articles()
and in search()
.
In list_articles()
, the client is used in a context manager. Because this is asynchronous, the context manager uses async with
not just with
.
In search()
, if the client is not specified, it is instantiated, not with the context manager, but with client = httpx.AsyncClient()
. When using this method, the responsibility is on us to close the client with await client.aclose()
. Bad news if we forget to do this.
Our two primary functions have been preceded by the async
keyword to indicate that they are async-friendly. In other words, they are willing to share control of the event loop when twiddling their thumbs.
If there was a need to call search()
individually, then we could do so with await search()
.
However, in this case, we need to concurrently run several calls to search()
.
asyncio.gather()
The list_articles()
function calls the awaitable search()
function using the function asyncio.gather()
. This will create tasks for the event loop and run them concurrently.
Conveniently, asyncio.gather()
returns a list of each task's return values, in the exact order the functions were passed in.
Note: put
await
beforeasyncio.gather()
, but do not putawait
before the functions passed to it. The awaiting of each call will be handled byasyncio.gather()
.
Event loop
I have already mentioned the event loop a couple times. I think of the event loop as the (there should be only one) task runner for asyncio applications. It handles the tasks.
Instantiating the event loop is done from the only non-awaitable function in our script. I named the function run()
, coincidentally, and it calls the high level function asyncio.run()
.
Put another way, a synchronous function cannot await
an asynchronous function. But it can asyncio.run()
]run it.
This creates a new event loop that then handles the various awaitable tasks, and returns the result of the called awaitable function.
Enable the command runner
Our run()
function executes whatever we want to have executed when called as a script. In this case, it creates a list of search terms, then sends the list to list_articles()
, then parses and prints the result.
With Poetry, the entry point for a script is defined in pyproject.toml
. So we add this to that file. Assuming you already had the synchronous syncpedia
defined, that section should now look like this:
[tool.poetry.scripts]
asyncpedia = "pypedia.asynchronous:run"
syncpedia = "pypedia.synchronous:run"
So, the script asyncpedia
will call the run
function of the asynchronous
submodule of the package pypedia
. And, as already defined, the script syncpedia
will call the run
function of the sync
submodule of the package pypedia
.
Try it out:
poetry run asyncpedia
Assuming all works well, titles and excerpts of many Wikipedia articles should scroll by.
Performance benefits of async
Unlike the script from Part 1, the calls to the Wikipedia API now happen asynchronously, sharing the event loop concurrently. One request, while waiting for Wikipedia to respond, can share control of the event loop with the others. This can be seen in the log file.
$ cat asyncpedia.log
INFO:asyncio:Start query 'linksto:Python_(programming_language)': 06:03:39
INFO:asyncio:Start query 'incategory:Computer_programming': 06:03:39
INFO:asyncio:Start query 'incategory:Programming_languages': 06:03:39
INFO:asyncio:Start query 'incategory:Python_(programming_language)': 06:03:39
INFO:asyncio:Start query 'incategory:Python_web_frameworks': 06:03:39
INFO:asyncio:Start query 'incategory:Python_implementations': 06:03:39
INFO:asyncio:Start query 'incategory:Programming_languages_created_in_1991': 06:03:39
INFO:asyncio:Start query 'incategory:Computer_programming_stubs': 06:03:39
INFO:asyncio:End query 'incategory:Python_implementations': 06:03:39
INFO:asyncio:End query 'incategory:Python_(programming_language)': 06:03:39
INFO:asyncio:End query 'incategory:Programming_languages_created_in_1991': 06:03:39
INFO:asyncio:End query 'incategory:Python_web_frameworks': 06:03:39
INFO:asyncio:End query 'incategory:Computer_programming_stubs': 06:03:39
INFO:asyncio:End query 'incategory:Computer_programming': 06:03:40
INFO:asyncio:End query 'linksto:Python_(programming_language)': 06:03:40
INFO:asyncio:End query 'incategory:Programming_languages': 06:03:40
Note that start/end times are no longer sequential (or, perhaps, predictable). They are intermixed.
On my machine, the synchronous version completes in about 7 seconds, while this asynchronous version only takes around 2 seconds to complete.
That is a performance improvement!
Success isn't success, though, until we have repeatable tests constructed, as we will in the next article.
Top comments (1)
No need to add so much boilerplate with logging, just use loguru instead