Table of contents
- Common data analysis tools and “the problem”
- Data analysis issue #1: storing and accessing data
- Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)
- Data analysis issue #3: storing and maintaining scripts for data analysis
- Data analysis issue #4: easily utilizing new or additional technologies
- Wrapping up
Time-series data is everywhere, and it drives decision-making in every industry. Time-series data collectively represents how a system, process, or behavior changes over time. Understanding these changes helps us to solve complex problems across numerous industries, including observability, financial services, Internet of Things, and even professional football.
Depending on the type of application they’re building, developers end up collecting millions of rows of time-series data (and sometimes millions of rows of data every day or even every hour!). Making sense of this high-volume, high-fidelity data takes a particular set of data analysis skills that aren’t often exercised as part of the classic developer skillset. To perform time-series analysis that goes beyond basic questions, developers and data analysts need specialized tools, and as time-series data grows in prominence, the efficiency of these tools becomes even more important.
Often, data analysts’ work can be boiled down to evaluating, cleaning, transforming, and modeling data. In my experience, I’ve found these actions are necessary for me to gain understanding from data, and I will refer to this as the “data analysis life cycle” throughout this post.
Excel, R, and Python are arguably some of the most commonly used data analysis tools, and, while they are all fantastic tools, they may not be suited for every job. Speaking from experience, these tools can be especially inefficient for “data munging” at the early stages of the lifecycle; specifically, the evaluating data, cleaning data, and transforming data steps involved in pre-modeling work.
As I’ve worked with larger and more complex datasets, I’ve come to believe that databases built for specific types of data - such as time-series data - are more effective for data analysis.
For background, TimescaleDB is a relational database for time-series data. If your analysis is based on time-series datasets, TimescaleDB can be a great choice not only for its scalability and dependability but also for its relational nature. Because TimescaleDB is packaged as an extension to PostgreSQL, you’ll be able to look at your time-series data alongside your relational data and get even more insight. (I recognize that as a Developer Advocate at Timescale, I might be a little biased 😊…)
In this four-part blog series, I will discuss each of the three data munging steps in the analysis lifecycle in-depth and demonstrate how to use TimescaleDB as a powerful tool for your data analysis.
In this introductory post, I'll explore a few of the common frustrations that I experienced with popular data analysis tools, and from there, dive into how I’ve used TimescaleDB to help alleviate each of those pain points.
In future posts we'll look at:
- How TimescaleDB data analysis functionality can replace work commonly performed in Python and pandas
- How TimescaleDB vs. Python and pandas compare (benchmarking a standard data analysis workflow)
- How to use TimescaleDB to conduct an end-to-end, deep-dive data analysis, using real yellow taxi cab data from the New York City Taxi and Limousine Commission (NYC TLC).
If you are interested in trying out TimescaleDB and PostgreSQL functionality right away, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)
Common data analysis tools and “the problem”
As we’ve discussed, the three most popular tools used for data analysis are Excel, R, and Python. While they are great tools in their own right, they are not optimized to efficiently perform every step in the analysis process.
In particular, most data scientists (including myself!) struggle with similar issues as the amount of data grows or the same analysis needs to be redone month after month.
Some of these struggles include:
- Data storage and access: Where is the best place to store and maintain my data for analysis?
- Data size and its influence on the analysis: How can I improve efficiency for data munging tasks, especially as data scales?
- Script storage and accessibility: What can I do to improve data munging script storage and maintenance?
- Easily utilizing new technologies: How could I set up my data analysis toolchain to allow for easy transitions to new technologies?
So buckle in, keep your arms and legs in the vehicle at all times, and let’s start looking at these problems!
Data analysis issue #1: storing and accessing data
To do data analysis, you need access to… data.
Managing where that data lives, and how easily you can access it is the preliminary (and often most important) step in the analysis journey. Every time I begin a new data analysis project, this is often where I run into my first dilemma. Regardless of the original data source, I always ask “where is the best place to store and maintain the data as I start working through the data munging process?”
Although it's becoming more common for data analysts to use databases for storing and querying data, it's still not ubiquitous. Too often, raw data is provided in a stream of CSV files or APIs that produce JSON. While this may be manageable for smaller projects, it can quickly become overwhelming to maintain and difficult to manage from project to project.
For example, let’s consider how we might use Python as our data analysis tool of choice.
While using Python for data analysis, I have the option of ingesting data through files/APIs OR a database connection.
If I used files or APIs for querying data during analysis, I often faced questions like:
- Where are the files located? What happens if the URL or parameters change for an API?
- What happens if duplicate files are made? And what if updates are made to one file, and not the other?
- How do I best share these files with colleagues?
- What happens if multiple files depend on one another?
- How do I prevent incorrect data from being added to the wrong column of a CSV? (ie. a decimal where a string should be)
- What about very large files? What is the ingestion rate for a 10MB, 100MB, 1GB, 1TB sized file?
After running into these initial problems project after project, I knew there had to be a better solution. I knew that I needed a single source of truth for my data – and it started to become clear that a specialized SQL database might be my answer!
Now, let’s consider if I were to connect to TimescaleDB.
By importing my time-series data into TimescaleDB, I can create one source of truth for all of my data. As a result, collaborating with others becomes as simple as sharing access to the database. Any modifications to the data munging process within the database means that all users have access to the same changes at the same time, opposed to parsing through CSV files to verify I have the right version.
Additionally, databases can typically handle much larger data loads than a script written in Python or R. TimescaleDB was built to house, maintain, and query terabytes of data efficiently and cost-effectively (both computationally speaking AND for your wallet). With features like continuous aggregates and native columnar compression, storing and analyzing years of time-series data became efficient while still being easily accessible.
In short, managing data over time, especially when it comes from different sources, can be a nightmare to maintain and access efficiently. But, it doesn’t have to be.
Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)
Excel, R, and Python are all capable of performing the first three steps of the data analysis “lifecycle”: evaluating, cleaning, and transforming data. However, these technologies are not generally optimized for speed or computational efficiency during the process.
In numerous projects over the years, I’ve found that as the size of my dataset increased, the process of importing, cleaning, and transforming it became more difficult, time-consuming, and, in some cases impossible. For Python and R, parsing through large amounts of data seemed to take forever, and Excel would simply crash once hitting millions of rows.
Things became especially difficult when I needed to create additional tables for things like aggregates or data transformations: some lines of code could take seconds or, in extreme cases, minutes to run depending on the size of the data, the computer I was using, or the complexity of the analysis.
While seconds or minutes may not seem like a lot, it adds up and amounts to hours or days of lost productivity when you’re performing analysis that needs to be run hundreds or thousands of times a month!
To illustrate, let’s look at a Python example once again.
Say I was working with this IoT data set taken from Kaggle. The set contains two tables, one specifying energy consumption for a single home in Houston Texas, and the other documenting weather conditions.
To run through analysis with Python, the first steps in my analysis would be to pull in the data and observe it.
When using Python to do this, I would run code like this 👇
import psycopg2
import pandas as pd
import configparser
## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')
## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'),
host=config.get('USERINFO', 'HOST'),
user=config.get('USERINFO', 'USER'),
password=config.get('USERINFO', 'PASS'),
port=config.get('USERINFO', 'PORT'))
## define the queries for selecting data out of our database
query_weather = 'select * from weather'
query_power = 'select * from power_usage'
## create cursor to extract data and place it into a DataFrame
cursor = conn.cursor()
cursor.execute(query_weather)
weather_data = cursor.fetchall()
cursor.execute(query_power)
power_data = cursor.fetchall()
## you will have to manually set the column names for the data frame
weather_df = pd.DataFrame(weather_data, columns=['date','day','temp_max','temp_avg','temp_min','dew_max','dew_avg','dew_min','hum_max','hum_avg','hum_min','wind_max','wind_avg','wind_min','press_max','press_avg','press_min','precipit','day_of_week'])
power_df = pd.DataFrame(power_data, columns=['startdate', 'value_kwh', 'day_of_week', 'notes'])
cursor.close()
print(weather_df.head(20))
print(power_df.head(20))
Altogether, this code took 2.718 seconds to run using my 2019 MacBook Pro laptop with 32GB memory.
But, what about if I run this equivalent script with SQL in the database?
select * from weather
select * from power_usage
startdate | value_kwh | day_of_week | notes |
---|---|---|---|
2016-01-06 01:00:00 | 1 | 2 | weekday |
2016-01-06 02:00:00 | 1 | 2 | weekday |
2016-01-06 03:00:00 | 1 | 2 | weekday |
2016-01-06 04:00:00 | 1 | 2 | weekday |
2016-01-06 05:00:00 | 0 | 2 | weekday |
2016-01-06 06:00:00 | 0 | 2 | weekday |
2016-01-06 07:00:00 | 0 | 2 | weekday |
2016-01-06 08:00:00 | 0 | 2 | weekday |
2016-01-06 09:00:00 | 0 | 2 | weekday |
2016-01-06 10:00:00 | 0 | 2 | weekday |
2016-01-06 11:00:00 | 1 | 2 | weekday |
2016-01-06 12:00:00 | 0 | 2 | weekday |
2016-01-06 13:00:00 | 0 | 2 | weekday |
2016-01-06 14:00:00 | 0 | 2 | weekday |
2016-01-06 15:00:00 | 0 | 2 | weekday |
2016-01-06 16:00:00 | 1 | 2 | weekday |
2016-01-06 17:00:00 | 4 | 2 | weekday |
This query only took 0.342 seconds to run, almost 8x faster when compared to the Python script.
This time difference makes a lot of sense when we consider that Python must connect to a database, then run the SQL query, then parse the retrieved data, and then import it into a DataFrame. While almost three seconds is fast, this extra time for processing adds up as the script becomes more complicated and more data munging tasks are added.
Pulling in the data and observing it is only the beginning of my analysis! What happens when I need to perform a transforming task, like aggregating the data?
For this dataset, when we look at the power_usage
table - as seen above - kWh readings are recorded every hour. If I want to do daily analysis, I have to aggregate the hourly data into “day buckets”.
If I used Python for this aggregation, I could use something like 👇
# sum power usage by day, bucket by day
## create column for the day
day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum', 'day_of_week' : 'unique', 'notes' : 'unique' })
print(agg_power)
...which takes 0.49 seconds to run (this does not include the time for importing our data).
Alternatively, with the TimescaleDB time_bucket()
function, I could do this aggregation directly in the database using the following query 👇
select
time_bucket(interval '1 day', startdate ) as day,
sum(value_kwh),
day_of_week,
notes
from power_usage pu
group by day, day_of_week, notes
order by day
day | sum | day_of_week | notes |
---|---|---|---|
2016-01-06 00:00:00 | 27 | 2 | weekday |
2016-01-07 00:00:00 | 42 | 3 | weekday |
2016-01-08 00:00:00 | 51 | 4 | weekday |
2016-01-09 00:00:00 | 50 | 5 | weekend |
2016-01-10 00:00:00 | 45 | 6 | weekend |
2016-01-11 00:00:00 | 22 | 0 | weekday |
2016-01-12 00:00:00 | 12 | 1 | weekday |
2016-02-06 00:00:00 | 32 | 5 | weekend |
2016-02-07 00:00:00 | 62 | 6 | weekend |
2016-02-08 00:00:00 | 48 | 0 | weekday |
2016-02-09 00:00:00 | 23 | 1 | weekday |
2016-02-10 00:00:00 | 24 | 2 | weekday |
...which only takes 0.087 seconds and is over 5x faster than the Python script.
You can start to see a pattern here.
As mentioned above, TimescaleDB was created to efficiently query and store time-series data. But simply querying data only scratches the surface of the possibilities TimescaleDB and PostgreSQL functionality provides.
TimescaleDB and PostgreSQL offer a wide range of tools and functionality that can replace the need for additional tools to evaluate, clean, and transform your data. Some of the TimescaleDB functionality includes continuous aggregates, compression, and hyperfunctions; all of which allow you to do nearly all data munging tasks directly within the database.
When I performed the evaluating, cleaning, and transforming steps of my analysis directly within TimescaleDB, I cut out the need to use additional tools - like Excel, R, or Python - for data munging tasks. I could pull cleaned and transformed data, ready for modeling, directly into Excel, R, or Python.
Data analysis issue #3: storing and maintaining scripts for data analysis
Another potential downside of exclusively using Excel, R, or Python for the entire data analysis workflow, is that all of the logic for analyzing the data is contained within a script file. Similar to the issues of having many different data sources, maintaining script files can be inconvenient and messy.
Some common issues that I - and many data analysts - run into include:
*Losing files
*Unintentionally creating duplicate files
*Changing or updating some files but not others
*Needing to write and run scripts to access transformed data (see below example)
*Spending time re-running scripts whenever new raw data is added (see below example)
While you can use a code repository to overcome some of these issues, it will not fix the last two.
Let’s consider our Python scenario again.
Say that I used a Python script exclusively for all my data analysis tasks. What happens if I need to export my transformed data to use in a report on energy consumption in Texas?
Likely, I would have to add some code within the script to allow for exporting the data and then run the script again to actually export it. Depending on the content of the script and how long it takes to transform the data, this could be pretty inconvenient and inefficient.
What if I also just got a bunch of new energy usage and weather data? For me to incorporate this new raw data into existing visualizations or reports, I would need to run the script again and make sure that all of my data munging tasks run as expected.
Database functions, like continuous aggregates and materialized views, can create transformed data that can be stored and queried directly from your database without running a script. Additionally, I can create policies for continuous aggregates to regularly keep this transformed data up-to-date any time raw data is modified. Because of these policies, I wouldn't have to worry about running scripts to re-transform data for use, making access to updated data efficient. With TimescaleDB, many of the data munging tasks in the analysis lifecycle that you would normally do within your scripts can be accomplished using built-in TimescaleDB and PostgreSQL functionality.
Data analysis issue #4: easily utilizing new or additional technologies
Finally, the last step in the data analysis lifecycle: modeling. If I wanted to use a new tool or technology to create a visualization, it was difficult to easily take my transformed data and use it for modeling or visualizations elsewhere.
Python, R, and Excel are all pretty great for their visualization and modeling capabilities. However, what happens when your company or team wants to adopt a new tool?
In my experience, this often means either adding on another step to the analysis process, or rediscovering how to perform the evaluating, cleaning, and transforming steps within the new technology.
For example, in one of my previous jobs, I was asked to convert a portion of my analysis into Power BI for business analytics purposes. Some of the visualizations my stakeholders wanted required me to access transformed data from my Python script. At the time, I had the option to export the data from my Python script or figure out how to transform the data in Power BI directly. Both options were not ideal and were guaranteed to take extra time.
When it comes to adopting new visualization or modeling tools, using a database for evaluating, cleaning, and transforming data can again work in your favor. Most visualization tools - such as Grafana, Metabase, or Power BI - allow users to import data from a database directly.
Since I can do most of my data munging tasks within TimescaleDB, adding or switching tools - such as using Power BI for dashboard capabilities - becomes as simple as connecting to my database, pulling in the munged data, and using the new tool for visualizations and modeling.
Wrapping up
In summary, Excel, R, and Python are all great tools to use for analysis, but may not be the best tools for every job. Case in point: my struggles with time-series data analysis, especially on big datasets.
With TimescaleDB functionality, you can house your data and perform the evaluating, cleaning, and transforming aspects of data analysis, all directly within your database – and solve a lot of common data analysis woes in the process (which I’ve - hopefully! - demonstrated in this post)
In the blog posts to come, I’ll explore TimescaleDB and PostgreSQL functionality compared to Python, benchmark TimescaleDB performance vs. Python and pandas for data munging tasks, and conduct a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).
If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).
If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.
Until next time!
This post was originally written by Miranda Auhl and published on the https://blog.timescale.com on September 9, 2021
Top comments (1)
great article