DEV Community

Cover image for Design considerations for large data import
Angha Ramdohokar
Angha Ramdohokar

Posted on

Design considerations for large data import

First thing that comes to mind after hearing term large data is some data which is bigger in volume. This can be also be defined like,

> Big Data is data that contains greater variety, arriving in increasing volumes and with more velocity. (3Vs)

The original relational database system made it so easy to work with data as long as the data size is small enough to manage.
However, when the data reach a significant volume, it becomes very difficult to work with because it would take a long time, or sometimes even be impossible, to read, write, and process successfully. This same problem occurs to me in a recent project. We had a large data system which is source of our data and we needed to import that data to another system.

Here are design considerations that I followed when working on large data import,

1. Data extraction

In data extraction, I did two types of extraction- data extraction for once and data extraction in sync

Data extraction for once
In this, we needed to extract all the data from source data system one time.
There are multiple ways from which you can extract data depending on source system.

In my case the source data was present in sql server.
Thus, I have followed following steps ,

  • I have executed sql scripts to get only the data I wanted from source system data tables.
  • Then saved that data in .csv files using save as options from results menu of sql server.

There is another option as well i.e. to generate scripts for data objects you want to extract data from.

Data extraction in sync
In this, we needed to extract the data in sync i.e. we need to extract data frequently after a specific time interval.

Thus, we have a stored procedure that gets executed through a web job and that stored procedure has a logic written to extract the data from source system and save it in a .csv file on the server.

We set that web job to be executing every five minutes and update the existing csv files with new ones.

2. Data transformation

In data transformation, we can have things that can transform the data in one form to another.
In first step, we extracted data and saved that to .csv file. Now in this step I have changed that data in a manner that it is of my target system data form.

I have done following things under data transformation,

  • Bad data removal
  • Conversion of date fields
  • Update columns to target data table columns
  • Delete duplicate data
  • Save the file in excel format for load

Transformation may include other aspects as well like integrating data from multiple sources to one format and then doing manipulation on them

3. Data load

In data load, we actually load extracted and transformed data in step 1&2.

For data load, in my case my destination system is azure sql server so I have used the SQL server import and export wizard and followed these steps,

  1. Choose a data source : Selecting the excel file with data
  2. Choose destination : Your destination database
  3. Specify the data table & select source tables
  4. Run the wizard

SQL server import and export wizard

You might face some issues with bad data if the data is not corrected properly or if the transformed data is not in desired format.

After successfully running the SQL server import and export wizard, you can see your data being imported successfully from source to destination systems.

You can load the data using sql scripts as well if you have extracted them using sql generate scripts feature mentioned in step 1.

After data load, we have checked performance of all the pages and endpoints. Looking at those numbers, we tried fix the performance issues with the help of following things,

  1. Adding non clustered indexing to required data tables.
  2. Selected only required data table columns instead of returning all from an endpoint.
  3. Adding offset and limit for all the GET endpoints.

We also added spinner to show up until all the endpoints are completed running.

These are the performance numbers after all the improvements,

After improvement

Conclusion

Now a days, applications that need to process large amounts of data are more likely to increase. So, it's essential to understand the problems and design a solution according to their need. Thus, these design considerations will be helpful in such scenarios.

Happy Reading !!

Top comments (0)