I strongly believe that open source is the future. In our modern-day software development cycle, there is a huge interest in the open source project. This is attributed to the fact that such an approach ends up saving cost in the development, making the development more flexible as well as encouraging innovation.
MongoDB: This is basically an open-source document database that can be used to store both structured and unstructured data, using JSON-like format to store documents
Jupyter Notebook: This is one of the commonly used open-source tools that has revolutionized the data science landscape. It is easy to create and share documents containing code, equations, and visualization. Recently, jupyter Notebook has evolved into jupyterLab, which adds additional functionality, such as a command line, terminal, and editor.
Pyspark: This is basically the Python API for Apache Spark, an open-source cluster computing framework. The popular concept of the spark is distributed computing.
To start, we will first install MongoDB, pyspark, and jupyterlab. These tools can easily be installed using docker-compose. All the services we will be using will be defined in this file. For the containers inside docker to communicate efficiently, we need to define a custom network. For this case, I defined custorm network as my-network. This network will isolate containers from the external networks. All services will be defined in the docker-compose file, and it will look as follows. To create a docker container, you should navigate to the directory housing the docker file and run docker compose up -d. That all for our setup
Loading data to MongoDB
It is simple to use MongoDB compass to import as well as export data to and from the MongoDB collection. MongoDB compass supports both CSV and json file formats. For our illustrations, we will use an electric vehicle dataset found in [(https://catalog.data.gov/dataset/electric-vehicle-population-data)] . This is what our Mongodb compass will look like after importing the data.
To verify that the data is correctly imported to the mongoDB, we can query this data from the terminal of our desktop machine. We will achieve this by following the following steps.
1.Select the database: this is done using the use command. For example, I created the database named EV and collection data; therefore, I will run use EV in the terminal.
2.Query Data: this can be achieved using the find method. For my case the method will be like db.data.find()
As seen above, we are sure that we have imported data correctly to our Mongodb. Now we can use spark to load the data to jupyter lab.
1.Import required libraries
3.Create a connection and read the data
4.Check data type
5.View the data
As seen from the above procedure, we can conclude that creating a data pipeline from Mongodb to pyspark is elementary. The procedure is straightforward and very efficient. Happy coding!!!
find complete project [https://github.com/ndurumo254/mongodb]
Top comments (0)