GSoC 20: Week 3

#python #gsoc #opensource

Hello everyone,
It's Niraj again and today, I will be sharing my code contribution of the third week of the GSoC.

Background

Currently, we have a module named cvedb which contains a function that downloads CVE datasets from the NATIONAL VULNERABILITY DATABASE (NVD) and store it to user's cache directory (~/.cache). After downloading the whole datasets, it pre-processes and retrieves necessary information from the datasets and populates local SQLite database with it.
It also contains other functions for deleting cached data, updating stale data, initializing database if it's empty etc.

What did I do this week?

I have made cvedb module asynchronous during this week. We are currently using multiprocessing to download data which isn't necessary since downloading is an IO bound task and for IO bound task asyncio is better solution. First I thought that storing whole datasets is unnecessary since we already have populated sqlite database from it. After populating database, we are only using it to check if dataset is stale and for that, we are finding SHA sum of the cached dataset and comparing it to the latest SHA sum listed in metadata from NVD site. We can save a significant amount of space by just storing SHA sum of each dataset from NVD site and compare it instead but my mentor has pointed out that she and others are sometime using this raw data for analysis. So, I decided to continue storing datasets for now and in future we may provide an option for disabling caching of whole datasets.

I have split class that creates database(CVEDB) from the class that provide methods for querying database(CVEScanner) for better maintainability. I am using aiohttp for downloading datasets now. I have also tried using aiosqlite for populating database but it was significantly slower than its synchronous counterpart. I suspect this behavior was result of lack of overlapping IO tasks. I can increase concurrency by spawning multiple tasks but that was increasing memory usage significantly. So, only way to keep memory in check was to update database synchronously. Well there is another way but that require downloading synchronously and instead of storing data to the disk use in memory queue as a pipeline to feed data to the tasks populating database. There are other ways to optimize performance like disabling database journaling, creating a big transaction of insert statement and committing it, multiple consumers(tasks that populates database). I will experiment with these techniques in the future but currently I am satisfied with the performance of aiohttp for asynchronous downloading and synchronous update of database with sqlite3 module. I haven't created a PR yet because upstream build has bugs and I have to wait until patch for that get merge.

What am I doing this week?

According to my timeline I would be completing my work on concurrency by making our last module scanner asynchronous but my mentors wants me to start doing my work on InputEngine since users are requesting for that feature. So, I might start that first. We are going to discuss future plans in virtual conference tomorrow.

I will update you about my future plans next week. So, stay tuned. See you next week.

DEV Community

GSoC 20: Week 3

Background

What did I do this week?

What am I doing this week?

Top comments (0)

Read next

Getting the Outreachy Internship

Detecting and redacting PII using Amazon Bedrock

Top 10 Trending Python Libraries in April 2024

Deploying Django Application on AWS with Terraform - Part 1