Welcome to the first post in LETSQL's tutorial series!
In this blog post, we take a detour from our usual theme of data pipelines to demonstrate how to create and publish a Python package with Poetry, using DataFusion as an example.
Introduction
Harlequin is a TUI client for SQL databases known for its light-weight extensive support for SQL databases. It is a versatile tool for data exploration and analysis workflows. Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets. However, Harlequin did not have a DataFusion adapter before. Thankfully, it was really easy to add one.
In this post, We'll demonstrate these concepts by building a Harlequin adapter for DataFusion. And, by way of doing so, we will also cover Poetry's essential features, project setup, and the steps to publish your package on PyPI.
To get the most out of this guide, you should have a basic understanding of virtual environments, Python packages and modules, and pip.
Our objectives are to:
- Introduce Poetry and its advantages
- Set up a project using Poetry
- Develop a Harlequin adapter for DataFusion
- Prepare and publish the package to PyPI
By the end, you'll have practical experience with Poetry and an understanding of modern Python package management.
The code implemented in this post is available on GitHub and available in PyPI.
Harlequin
Harlequin is a SQL IDE that runs in the terminal. It provides a powerful and feature-rich alternative to traditional command-line database tools, making it versatile for data exploration and analysis workflows.
Some key things to know about Harlequin:
- Harlequin supports multiple database adapters, connecting you to DuckDB, SQLite, PostgreSQL, MySQL, and more.
- Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets.
- Harlequin replaces traditional terminal-based database tools with a more powerful and user-friendly interface.
- Harlequin uses adapter plug-ins as a generic interface to any database.
DataFusion
DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
It ships with it its own CLI, more information can be found here.
Poetry
Poetry is a modern, feature-rich tool that streamlines dependency management and packaging for Python projects, making development more deterministic and efficient.
From the documentation:
Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on, and it will manage (install/update) them for you.
Poetry offers a lockfile to ensure repeatable installs and can build your project for distribution.
Creating New Adapters for Harlequin
A Harlequin adapter is a Python package that allows Harlequin to work with a database system.
An adapter is a Python package that declares an entry point in the harlequin.adapters group. That entry point should reference a subclass of the
HarlequinAdapter
abstract base class.
This allows Harlequin to discover installed adapters and instantiate a selected adapter at run-time
In addition to the HarlequinAdapter
class, the package must also provide implementations for HarlequinConnection
, and HarlequinCursor
. A more detailed description can be found on this
guide.
The Harlequin Adapter Template
The first step for developing a Harlequin adapter is to generate a new repo from the existing harlequin-adapter-template
GitHub templates are repositories that serve as starting points for new projects. They provide pre-configured files, structures, and settings that are copied to new repositories, allowing for quick project setup without the overhead of forking.
This feature streamlines the process of creating consistent, well-structured projects based on established patterns.
The harlequin-adapter-template
comes with a poetry.lock
file and a pyproject.toml
file, in addition to some boilerplate code for defining the required classes.
Coding the Adapter
Let's explore the essential files needed for package distribution before we get into the specifics of coding.
Package configuration
The pyproject.toml
file is now the standard for configuring Python packages for publication and other tools. Introduced in PEP 518 and PEP 621, this TOML-formatted file consolidates multiple configuration files into one. It enhances dependency management by making it more robust and standardized.
Poetry, utilizes pyproject.toml
to handle the project's virtual environment, resolve dependencies, and create packages.
The pyproject.toml
of the template is as follows:
[tool.poetry]
name = "harlequin-myadapter"
version = "0.1.0"
description = "A Harlequin adapter for <my favorite database>."
authors = ["Ted Conbeer <tconbeer@users.noreply.github.com>"]
license = "MIT"
readme = "README.md"
packages = [
{ include = "harlequin_myadapter", from = "src" },
]
[tool.poetry.plugins."harlequin.adapter"]
my-adapter = "harlequin_myadapter:MyAdapter"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
harlequin = "^1.7"
[tool.poetry.group.dev.dependencies]
ruff = "^0.1.6"
pytest = "^7.4.3"
mypy = "^1.7.0"
pre-commit = "^3.5.0"
importlib_metadata = { version = ">=4.6.0", python = "<3.10.0" }
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
As it can be seen:
The
[tool.poetry]
section of thepyproject.toml
file is where you define the metadata for your Python package, such as the name, version, description, authors, etc.The
[tool.poetry.dependencies]
subsection is where you declare the runtime dependencies your project requires. Runningpoetry add <package>
will automatically update this section.The
[tool.poetry.dev-dependencies]
subsection is where you declare development-only dependencies, like testing frameworks, linters, etc.The
[build-system]
section is used to store build-related data. In this case, it specifies thebuild-backend
as"poetry.core.masonry.api"
. In a narrow sense, the core responsibility of a
build-backend is to build wheels and sdist.
The repository also includes a poetry.lock
file, a Poetry-specific component generated by running poetry install
or poetry update
. This lock file specifies the exact versions of all dependencies and sub-dependencies for your project, ensuring reproducible installations across different environments.
It's crucial to avoid manual edits to the poetry.lock
file, as this can cause inconsistencies and installation issues. Instead, make changes to your pyproject.toml
file and allow Poetry to automatically update the lock file by running poetry lock
.
Getting Poetry
Per Poetry's installation warning
::: {.warning}
Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system. It should in no case be installed in the environment of the project that is to be managed by Poetry.
:::
Here we will presume you have access to Poetry by running pipx install poetry
Developing in the virtual environment
With our file structure clarified, let's begin the development process by setting up our environment. Since our project already includes pyproject.toml
and poetry.lock
files, we can initiate our environment using the poetry shell
command.
This command activates the virtual environment linked to the current Poetry project, ensuring all subsequent operations occur within the project's dependency context. If no virtual environment exists, poetry shell
automatically creates and activates one.
poetry shell
detects your current shell and launches a new instance within the virtual environment. As Poetry centralizes virtual environments by default, this command eliminates the need to locate or recall the specific path to the activate script.
To verify which Python environment is currently in use with Poetry, you can use the following commands:
poetry env list --full-path
This will show all the virtual environments associated with your project and indicate which one is currently active.
As an alternative, you can get the full path of only the current environment:
poetry env info -p
With the environment activated, use poetry install
to install the required dependencies. The command works as follows
- If a
poetry.lock
file is present,poetry install
will use the exact versions specified in that file rather than resolving the dependencies dynamically. This ensures consistent, repeatable installations across different environments. i. If you runpoetry install
and it doesn't seem to be progressing, you may need to runexport PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
in the shell you're installing in - Otherwise, it reads the
pyproject.toml
file in the current project, resolves the dependencies listed there, and installs them. - If no
poetry.lock
file exists,poetry install
will create one after resolving the dependencies, otherwise it will update the existing one.
To complete the environment setup, we need to add the datafusion
library to our dependencies. Execute the following command:
poetry add datafusion
This command updates your pyproject.toml
file with the datafusion
package and installs it. If you don't specify a version, Poetry will automatically select an appropriate one based on available package versions.
Implementing the Interfaces
To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter
module.
The first one is the HarlequinAdapter
.
#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinAdapter
class DataFusionAdapter(HarlequinAdapter):
def __init__(self, conn_str: Sequence[str], **options: Any) -> None:
self.conn_str = conn_str
self.options = options
def connect(self) -> DataFusionConnection:
conn = DataFusionConnection(self.conn_str, self.options)
return conn
The second one is the HarlequinConnection
, particularly the methods execute
and get_catalog
.
#| eval: false
#| code-fold: false
#| code-summary: implementation of execution of HarlequinConnection
def execute(self, query: str) -> HarlequinCursor | None:
try:
cur = self.conn.sql(query) # type: ignore
if str(cur.logical_plan()) == "EmptyRelation":
return None
except Exception as e:
raise HarlequinQueryError(
msg=str(e),
title="Harlequin encountered an error while executing your query.",
) from e
else:
if cur is not None:
return DataFusionCursor(cur)
else:
return None
For brevity, we've omitted the implementation of the get_catalog
function. You can find the full code in the adapter.py
file within our GitHub repository.
Finally, a HarlequinCursor
implementation must be provided as well:
#| eval: false
#| code-fold: false
#| code-summary: implementation of HarlequinCursor
class DataFusionCursor(HarlequinCursor):
def __init__(self, *args: Any, **kwargs: Any) -> None:
self.cur = args[0]
self._limit: int | None = None
def columns(self) -> list[tuple[str, str]]:
return [
(field.name, _mapping.get(field.type, "?")) for field in self.cur.schema()
]
def set_limit(self, limit: int) -> DataFusionCursor:
self._limit = limit
return self
def fetchall(self) -> AutoBackendType:
try:
if self._limit is None:
return self.cur.to_arrow_table()
else:
return self.cur.limit(self._limit).to_arrow_table()
except Exception as e:
raise HarlequinQueryError(
msg=str(e),
title="Harlequin encountered an error while executing your query.",
) from e
Making the plugin discoverable
Your adapter must register an entry point in the harlequin.adapters
group using the packaging software you use to build your project.
If you use Poetry, you can define the entry point in your pyproject.toml
file:
[tool.poetry.plugins."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"
An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code.
Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml
specification for entry points:
[project.entry-points."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"
Testing
The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that's pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration:
#| eval: false
#| code-fold: false
if sys.version_info < (3, 10):
from importlib_metadata import entry_points
else:
from importlib.metadata import entry_points
def test_plugin_discovery() -> None:
PLUGIN_NAME = "datafusion"
eps = entry_points(group="harlequin.adapter")
assert eps[PLUGIN_NAME]
adapter_cls = eps[PLUGIN_NAME].load()
assert issubclass(adapter_cls, HarlequinAdapter)
assert adapter_cls == DataFusionAdapter
To make sure the tests are passing, run:
poetry run pytest
The run command executes the given command inside the project’s virtualenv.
Building and Publishing to PyPI
With the tests passing, we're nearly ready to publish our project. Let's enhance our pyproject.toml
file to make our package more discoverable and appealing on PyPI. We'll add key metadata including:
- A link to the GitHub repository
- A path to the README file
- A list of relevant classifiers
These additions will help potential users find and understand our package more easily.
classifiers = [
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"Topic :: Software Development :: User Interfaces",
"Topic :: Database :: Database Engines/Servers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: Implementation :: CPython"
]
readme = "README.md"
repository = "https://github.com/mesejo/datafusion-adapter"
For reference:
- The complete list of classifiers is available on PyPI's website.
- For a detailed guide on writing
pyproject.toml
, check out this resource. - The formal, technical specification for
pyproject.toml
can be found on packaging.python.org.
Building
We're now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let's start with the build process:
poetry build
This command will create distribution packages (both source and wheel) in the dist
directory.
The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl
. This follows the standard naming convention:
-
harlequin_datafusion
is the package (or distribution) name -
0.1.1
is the version number -
py3
indicates it's compatible with Python 3 -
none
compatible with any CPU architecture -
any
with no ABI (pure Python)
To test the installation, create a new directory called test_install
. Then, set up a fresh virtual environment with the following command:
python -m venv .venv
To activate the virtual environment on MacOS or Linux:
source .venv/bin/activate
After running this command, you should see the name of your virtual environment (.venv)
prepended to your command prompt, indicating that the virtual environment is now active.
To install the wheel file we just built, use pip as follows:
pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl
Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl
with the actual path to the wheel file you want to install.
If everything works fined, you should see some dependencies installed, and you should be able to do:
harlequin -a datafusion
Congrats! You have built a Python library. Now it is time to share it with the world.
Publishing to PyPI
The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI)
To publish a package to TestPyPI using Poetry, follow these steps:
Create an account at TestPyPI if you haven't already.
Generate an API token on your TestPyPI account page.
-
Register the TestPyPI repository with Poetry by running:
poetry config repositories.test-pypi https://test.pypi.org/legacy/
-
To publish your package, run:
poetry publish -r testpypi --username __token__ --password <token>
Replace <token>
with the actual token value you generated in step 2. To verify the publishing process, use the following command:
python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>
This command uses two key arguments:
-
--index-url
: Directs pip to find your package on TestPyPI. -
--extra-index-url
: Allows pip to fetch any dependencies from the main PyPI repository.
Replace <package-name>
with your specific package name (e.g., harlequin-datafusion
if following this post). For additional details, consult the information provided in this post.
To publish to the actual Python Package Index (PyPI) instead:
Create an account at https://pypi.org/ if you haven't already.
Generate an API token on your PyPI account page.
-
Run:
poetry publish --username __token__ --password <token>
The default repository is PyPI, so there's no need to specify it.
Is worth noting that Poetry only supports the Legacy Upload API when publishing your project.
Automated Publishing on GitHub release
Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to
publish each time we create a release.
Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry:
Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like
PYPI_TOKEN
.Create a GitHub Actions workflow file: In your project's
.github/workflows
directory, create a new file likepublish.yml
with the following content:
name: Build and publish python package
on:
release:
types: [ published ]
jobs:
publish-package:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Poetry
uses: snok/install-poetry@v1
- run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}"
- name: Publish package
run: poetry publish --build --username __token__
The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package's dependencies and metadata.
Conclusion
Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community.
At LETSQL, we're committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources.
To subscribe to our newsletter, visit letsql.com.
Future Work
As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings.
Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let's redefine the boundaries of data science and machine learning integration.
Acknowledgements
Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.
Top comments (0)