Amundsen is an advanced data discovery and metadata engine designed to boost the productivity of data analysts, data scientists, and engineers when interacting with data.
It achieves this by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g., highly queried tables show up earlier than less queried tables).
In simple terms, it's like Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.
Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library, and one common library:
|amundsenfrontendlibrary||A Flask application with a React frontend.|
|amundsensearchlibrary||A search service that leverages Elasticsearch for search capabilities.|
|amundsenmetadatalibrary||A metadata service that leverages Neo4j or Apache Atlas as the persistent layer to provide various metadata.|
|amundsendatabuilder||A data ingestion library for building metadata graph and search index.|
|amundsencommon||A common library that holds shared codes among Amundsen's microservices.|
|amundsengremlin||A library that holds code used for converting model objects into vertices and edges in gremlin, used for loading data into an AWS Neptune backend.|
|amundsenrds||Contains ORM models to support relational database as metadata backend store in Amundsen.|
Check out their GitHub for more information.
Search for data within your organization with a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard.
Build trust in data using automated and curated metadata — descriptions of tables and columns, other frequent users, when the table was last updated, statistics, a preview of the data if permitted, etc. Easily triage by linking the ETL job and code that generated the data.
Update tables and columns with descriptions, reduce unnecessary back and forth about which table to use and what a column contains.
See what data fellow co-workers frequently use, own, or have bookmarked. Learn what the most common queries for a table look like by seeing dashboards built on a given table.
Ensure you have at least 3GB of disk space available to Docker. You'll need to install
Follow these guides to install Docker based on your operating system:
And here's a guide on how to install
docker-compose for all systems.
You can check your current Docker version with this command:
And to check your
docker-compose version, use this command:
We'll be using WSL2 for this guide, and we'll start by cloning this repo and its submodules:
git clone --recursive https://github.com/amundsen-io/amundsen.git
Next, enter the cloned directory:
If this is your first time, make sure you've allocated the necessary memory. The minimum needed for all the containers to run with the loaded sample data is 3GB.
If you're using WSL2, you can check your allocation through
.wslconfig. Follow this guide to set your
.wslconfig for WSL2.
As an example, here's the
.wslconfig I use:
[wsl2] memory=6GB processors=2 swap=3GB
If you've made changes to the configuration, restart your PC so they can take effect. If no changes were necessary, proceed to the next step.
For this demo, we'll be using Neo4j Backend. Run the following command:
docker-compose -f docker-amundsen.yml up
In a separate terminal window, change your directory to
databuilder to ingest the provided sample data into Neo4j:
Install the dependencies in a virtual environment. For this, we'll be using
pyenv, a tool for managing multiple Python versions, and its plugin
pyenv-virtualenv for managing multiple virtual environments. If you don't have these installed, check out these guides on how to install pyenv and creating a virtual environment with pyenv.
Check your pyenv versions:
Activate the environment in the current directory. In my case, my virtual environment is called
pyenv local amundsen_demo
Finally, upgrade the version of pip, the package installer for Python:
pip3 install --upgrade pip
Next, install the Python packages listed in the
requirements.txt file. This file contains a list of dependencies required by a Python project:
pip3 install -r requirements.txt
Then, install the Amundsen Data builder package using pip:
python3 setup.py install
We'll then load data into Neo4j and Elasticsearch databases without using an Airflow DAG (Directed Acyclic Graph) with this script:
The script consists of several jobs:
|run_csv_job||Reads table data from a CSV file, writes the data to another local directory as a CSV file, and then publishes the data to Neo4j, a graph database management system.|
|run_table_column_job||Similar to run_csv_job, but processes a CSV file containing column data instead.|
|create_last_updated_job||Creates a job that gets the current time, converts it into a predefined data model, and publishes it to Neo4j.|
|create_es_publisher_sample_job||Creates a job that extracts data from Neo4j and publishes it to Elasticsearch, a search and analytics engine.|
The script imports necessary modules, sets up configuration, and uses various extractors, loaders, and publishers from the Amundsen Databuilder library to perform the tasks mentioned above.
Now you can view the UI at http://localhost:5000 and try searching for
test. You should get some results.
You can also perform an exact-match search for a table entity. For instance, search for
test_table1 in the table field of the filter, and it'll return the matching records.
To verify the dummy data has been ingested into Neo4j, visit http://localhost:7474/browser/
and run this in the query box:
MATCH (n:Table) RETURN n LIMIT 25
You can verify the data has been loaded into the
metadataservice by visiting:
Finally, don't forget to stop your running multicontainer app after you've finished using it:
docker-compose -f docker-amundsen.yml down
- Installation Guide
- My Quick Start Amundsen Demo 2023 [No Sound No Commentary]
- Amundsen: A Data Discovery Platform From Lyft
- Slides: Amundsen: A Data Discovery Platform From Lyft
That concludes our quick start guide to setting up and running a demo of Amundsen. I hope you found this post helpful and informative. If you have any questions or if there's something you'd like to know more about, feel free to drop a comment below or reach out to me directly.
Remember, there's no limit to what you can achieve with the right tools and a little bit of know-how. Keep exploring, keep learning, and as always, keep pushing the boundaries of what's possible with data.
If you want to stay updated with my latest posts and activities, or if you just want to connect, follow me on Beacons:
Happy coding, everyone! 🚀