The call for action to respond to the coronavirus COVID-19 pandemic has led many public groups and private organizations on a quest to find new approaches or solutions in dealing with this urgent problem.
However, amongst the many challenges faced, the interpretation of data-driven insights depends on the quality and provenance of the data used. In this article, Datopian takes a closer look at some of the data and issues we came across on our own Open Data 2020 COVID-19 hackathon.
As the global community continues to face the coronavirus COVID-19 pandemic, there is a renewed momentum gathering around hackathons to quickly contribute towards addressing the crisis. Already, there have been at least 54 reported global hackathon events since mid-March - with Forbes reporting that high profile events such as the #BuildforCOVID19 Global Online Hackathon and The Global Hack, attracted more than 18,000 and 12,000 participants respectively.
At Datopian, we’ve been working with some covid-19 datasets - and like any data-driven journey, ours had its fair share of surprises along the way. Here are some of our stories and links, which we hope might help you with your next hackathon!
Every data hacker knows that the first step of a hackathon begins at the cleaning room - and this was no different! We spent a fair bit of time cleaning up and tidying the covid-19 datasets so that it can be ready-to-use at your next hackathon!
The John Hopkins University Center for Systems Science and Engineering (CSSE) has done amazingly in such a short amount of time with their data collection by providing near real-time data. They had around 20,000 stars on Github (the most popular data repo with covid data) and still growing! But there were still a few things we could contribute towards the data cleaning process, with over 1000 reported and open issues on the main dataset.
So we began with the John Hopkins datasets - the main aggregated dataset - and cleaned and normalized the data. Some of this involved tidying and data wrangling dates (some formats that are common in the US and UK aren’t international standards!) and consolidating several files into normalized time series. We also added some standard metadata, such as column descriptions and data packaged it. Frictionless Data offers a specification called Data Packages that helps format and describe a collection of data. You can even download it in alternative formats (e.g. JSON) from our DataHub. DataHub.io provides a user-friendly interface to showcase our datasets.
Our dataset includes time-series data tracking the number of people affected by COVID-19 worldwide, including:
- confirmed tested cases of Coronavirus infection;
- the number of people who have reportedly died while sick with Coronavirus; and
- the number of people who have reportedly recovered from it.
Our data is disaggregated by country and region/state, with additional aggregated files by country and worldwide. What you might find useful about our datasets is our commitment to using open source scripts, which allows you to audit or cross-check the data for yourself - or contribute towards improving it!
Interestingly, this has been our most popular dataset to date - with over 650 stars and developers building their own applications (including dashboards) based on our dataset!
At Datopian, we love to build reliable and auditable data pipelines! On our journey to finding some insights into the covid-19 datasets, we came across data sources with varying degrees of data quality and reliability - and re-discovered the old adage that the quality of a dataset is only as good as the quality of the data sources!
Here’s a list of the data sources used to create the main aggregated dataset:
- World Health Organization (WHO): https://www.who.int/
- DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia
- BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/
- National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml
- China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm
- Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html
- Macau Government: https://www.ssm.gov.mo/portal/
- Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0
- US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html
- Government of Canada: https://www.canada.ca/en/public-health/services/diseases/coronavirus.html
- Australia Government Department of Health: https://www.health.gov.au/news/coronavirus-update-at-a-glance
- European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases
- Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19
- Italy Ministry of Health: http://www.salute.gov.it/nuovocoronavirus
The schematic above describes the sources from where we collect our data using the dataflows library.
The need for communicating important global issues and developments in near real-time with data has never been more important than now. The current covid-19 pandemic has already had far-reaching implications on almost all sectors in our economy, and has forced governments to put in place a range of social restrictions.
How citizens cooperate and monitor the effectiveness of these measures depends much on the transparency and feedback of the progress made. And this is particularly true when the situation is changing exponentially, and it’s difficult to get a sense of how quickly things are evolving. So Data visualization has an important role to play here.
At Datopian, we created this dynamic dashboard, built using react.js, which shows the total number of cumulative confirmed cases, new cases per day and deaths. One of the key features of this visualization is that it allows you to select the country of your choice, to check on the latest covid-19 indicators. We also added a table that shows users a country summary and allows users to sort the data.
While the covid-19 pandemic continues to sweep across the globe, data hackers are continuing to generously give of their time and efforts in supporting humanity. There’s so much we can still do to contribute towards our open data ecosystem, and learn from these experiences.
If you want to join and add to our efforts, check out our Github:
- Datasets: https://github.com/datasets/covid-19
- Dashboard source code: https://github.com/datopian/covid-19
Datopian delivers outstanding solutions that enable your organization to realise your data’s potential. From hosted data portals powered by CKAN to specialised data engineering, from agile data practices to data strategy development, Datopian empowers you to transform data into insight.
© Datopian (CC Attribution-Sharealike (by-sa))