Outlines
Introduction ๐
Outlines
Amundsen is an advanced data discovery and metadata engine designed to boost the productivity of data analysts, data scientists, and engineers when interacting with data.
It achieves this by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g., highly queried tables show up earlier than less queried tables).
In simple terms, it's like Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.
Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library, and one common library:
Amundsen Libraries | Description |
---|---|
amundsenfrontendlibrary | A Flask application with a React frontend. |
amundsensearchlibrary | A search service that leverages Elasticsearch for search capabilities. |
amundsenmetadatalibrary | A metadata service that leverages Neo4j or Apache Atlas as the persistent layer to provide various metadata. |
amundsendatabuilder | A data ingestion library for building metadata graph and search index. |
amundsencommon | A common library that holds shared codes among Amundsen's microservices. |
amundsengremlin | A library that holds code used for converting model objects into vertices and edges in gremlin, used for loading data into an AWS Neptune backend. |
amundsenrds | Contains ORM models to support relational database as metadata backend store in Amundsen. |
Check out their GitHub for more information.
How does it work? ๐ ๏ธ
Discover trusted data ๐
Search for data within your organization with a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard.
See automated and curated metadata ๐
Build trust in data using automated and curated metadata โ descriptions of tables and columns, other frequent users, when the table was last updated, statistics, a preview of the data if permitted, etc. Easily triage by linking the ETL job and code that generated the data.
Share context with coworkers ๐ฅ
Update tables and columns with descriptions, reduce unnecessary back and forth about which table to use and what a column contains.
Learn from others ๐ฉโ๐
See what data fellow co-workers frequently use, own, or have bookmarked. Learn what the most common queries for a table look like by seeing dashboards built on a given table.
Check out Amundsen's website and their documentation for more information.
Installation โ๏ธ
Ensure you have at least 3GB of disk space available to Docker. You'll need to install docker
and docker-compose
.
Follow these guides to install Docker based on your operating system:
And here's a guide on how to install docker-compose
for all systems.
You can check your current Docker version with this command:
docker -v
And to check your docker-compose
version, use this command:
docker-compose -v
We'll be using WSL2 for this guide, and we'll start by cloning this repo and its submodules:
git clone --recursive https://github.com/amundsen-io/amundsen.git
Next, enter the cloned directory:
cd amundsen
If this is your first time, make sure you've allocated the necessary memory. The minimum needed for all the containers to run with the loaded sample data is 3GB.
If you're using WSL2, you can check your allocation through .wslconfig
. Follow this guide to set your .wslconfig
for WSL2.
As an example, here's the .wslconfig
I use:
[wsl2]
memory=6GB
processors=2
swap=3GB
If you've made changes to the configuration, restart your PC so they can take effect. If no changes were necessary, proceed to the next step.
For this demo, we'll be using Neo4j Backend. Run the following command:
docker-compose -f docker-amundsen.yml up
In a separate terminal window, change your directory to databuilder
to ingest the provided sample data into Neo4j:
cd databuilder
Install the dependencies in a virtual environment. For this, we'll be using pyenv
, a tool for managing multiple Python versions, and its plugin pyenv-virtualenv
for managing multiple virtual environments. If you don't have these installed, check out these guides on how to install pyenv and creating a virtual environment with pyenv.
Check your pyenv versions:
pyenv versions
Activate the environment in the current directory. In my case, my virtual environment is called amundsen_demo
:
pyenv local amundsen_demo
Finally, upgrade the version of pip, the package installer for Python:
pip3 install --upgrade pip
Next, install the Python packages listed in the requirements.txt
file. This file contains a list of dependencies required by a Python project:
pip3 install -r requirements.txt
Then, install the Amundsen Data builder package using pip:
python3 setup.py install
We'll then load data into Neo4j and Elasticsearch databases without using an Airflow DAG (Directed Acyclic Graph) with this script:
python3 example/scripts/sample_data_loader.py
The script consists of several jobs:
Amundsen Jobs | Description |
---|---|
run_csv_job | Reads table data from a CSV file, writes the data to another local directory as a CSV file, and then publishes the data to Neo4j, a graph database management system. |
run_table_column_job | Similar to run_csv_job, but processes a CSV file containing column data instead. |
create_last_updated_job | Creates a job that gets the current time, converts it into a predefined data model, and publishes it to Neo4j. |
create_es_publisher_sample_job | Creates a job that extracts data from Neo4j and publishes it to Elasticsearch, a search and analytics engine. |
The script imports necessary modules, sets up configuration, and uses various extractors, loaders, and publishers from the Amundsen Databuilder library to perform the tasks mentioned above.
Now you can view the UI at http://localhost:5000 and try searching for test
. You should get some results.
You can also perform an exact-match search for a table entity. For instance, search for test_table1
in the table field of the filter, and it'll return the matching records.
To verify the dummy data has been ingested into Neo4j, visit http://localhost:7474/browser/
and run this in the query box:
MATCH (n:Table) RETURN n LIMIT 25
You can verify the data has been loaded into the metadataservice
by visiting:
Finally, don't forget to stop your running multicontainer app after you've finished using it:
docker-compose -f docker-amundsen.yml down
Miscellaneous ๐งฉ
- Installation Guide
- My Quick Start Amundsen Demo 2023 [No Sound No Commentary]
- Amundsen: A Data Discovery Platform From Lyft
- Slides: Amundsen: A Data Discovery Platform From Lyft
The End ๐
That concludes our quick start guide to setting up and running a demo of Amundsen. I hope you found this post helpful and informative. If you have any questions or if there's something you'd like to know more about, feel free to drop a comment below or reach out to me directly.
Remember, there's no limit to what you can achieve with the right tools and a little bit of know-how. Keep exploring, keep learning, and as always, keep pushing the boundaries of what's possible with data.
If you want to stay updated with my latest posts and activities, or if you just want to connect, follow me on Beacons:
Happy coding, everyone! ๐
Top comments (0)