GSoC 20: Week 1

#python #gsoc #opensource

Hello everyone!

It's Niraj again. Today, I will be sharing my code contribution of the first week of the GSoC. If you haven't check my first blog for GSoC, checkout that first: GSoC 20: Beginning of a new journey

Background

Currently, We have three types of tests to prove that our checker works as expected:

mapping_test - test to show that our checker correctly detects version from the binary file.
package_test - test to show that our checker works correctly on real world package distribution.
filename_test - test to show that our checker correctly detects filename patterns.

mapping_test and package_test with test data can be found currently in test_scanner.py while filename_test for checker can be found in test_checker.py

What did I do this week?

I have completed my work on removing compiler dependency this week and opened a PR. We have been using C files to create binary files which contains same version string as can be found in the product for which we have made checker so that we can assert that our checker and scanner modules are working correctly and we are calling it mapping_test. Because Most of the strings generated by compiling C file is just the compiler dump which we are ignoring anyway. So, why don't we use struct or plain binary strings which will save time and space. I was experimenting on struct and I found out binary file produced by using struct is same as we generate from just writing binary strings on a file.

To make the basic test suite run quickly, we create "faked" binary files to test the CVE mappings. However, we want to be able to test real packages to test that the signatures work on real-world data. We currently have _file_test(renamed it to test_version_in_package in current PR) function that takes a url, package name, module_name and a version, and downloads the package, and runs the scanner against it. We call it package_test.

Note: You can find this mapping_test as test_version_mapping function and package_test as test_version_in_package function in test_scanner.py

Initially, I have proposed a file named mapping_test_data.py for mapping_test which contains list of dictionary of version, checker_name (module_name) and version_strings and a package_test_data.py file for package_test which contains list of tuple of url, package_name, module_name and version. For example:

# mapping_test_data.py
mapping_test_data = [
    {
        "module": "bash",
        "version": "1.14.0",
        "version_strings": ["Bash version 1.14.0"],
    },
    {
        "module": "binutils",
        "version": "2.31.1",
        "version_strings": [
            "Using the --size-sort and --undefined-only options together",
            "libbfd-2.31.1-system.so",
            "Auxiliary filter for shared object symbol table",
        ],
    },
]

# package_test_data.py
package_test_data = itertools.chain(
    [
        # Filetests for bash checker
        (
            "https://kojipkgs.fedoraproject.org/packages/bash/4.0/1.fc11/x86_64/",
            "bash-4.0-1.fc11.x86_64.rpm",
            "bash",
            "4.0.0",
        ),
        (
            "http://rpmfind.net/linux/mageia/distrib/4/x86_64/media/core/updates/",
            "bash-4.2-53.1.mga4.x86_64.rpm",
            "bash",
            "4.2.53",
        ),
    ],
    [
        # Filetests for binutils checker
        (
            "http://security.ubuntu.com/ubuntu/pool/main/b/binutils/",
            "binutils_2.26.1-1ubuntu1~16.04.8_amd64.deb",
            "binutils",
            "2.26.1",
        ),
        (
            "http://mirror.centos.org/centos/7/os/x86_64/Packages/",
            "binutils-2.27-43.base.el7.x86_64.rpm",
            "binutils",
            "2.27",
        ),
    ],

Although, this format is better than creating C file and also adding test_data in test_scanner.py, In this week's virtual conference, my mentors has pointed out that if we keep test data for all checkers in one file it will be hard to navigate it since number of checkers is going to increase as time goes. So, they told me to create separate test_data files with same name as checkers for each checkers which contains two attributes 1) mapping_test_data - which contains test data for our mapping test and 2) package_test_data - which contains test data for our package test. So, I created separate test_data file for each checker. For example, test_data file for bash checker looks like this:

# test_data/bash.py
mapping_test_data = [
    {
        "module": "bash",
        "version": "1.14.0",
        "version_strings": ["Bash version 1.14.0"]
    }
]
package_test_data = [
    {
        "url": "https://kojipkgs.fedoraproject.org/packages/bash/4.0/1.fc11/x86_64/",
        "package_name": "bash-4.0-1.fc11.x86_64.rpm",
        "module": "bash",
        "version": "4.0.0",
    },
    {
        "url": "http://rpmfind.net/linux/mageia/distrib/4/x86_64/media/core/updates/",
        "package_name": "bash-4.2-53.1.mga4.x86_64.rpm",
        "module": "bash",
        "version": "4.2.53",
    },
]

We also have to add new entry in the __all__ list of __init__.py file of test_data directory for the checker we are writing test for, if it doesn't exist because I am using this list to load these test_data files at runtime. Here is the truncated list:

__all__ = [
    "bash",
    "binutils",
    "bluetoothctl",
    "busybox",
    "bzip2",
    "cups",
    ...
]

After this PR will get merged, checker developer only need to create two files 1) checker class file under checkers directory and 2) test_data file under test_data directory. This will spare him some time of navigating whole test_scanner file (around 2500 lines) to just add test_data for the checker he/she has written.

What am I doing this week?

I am going to make extractor module asynchronous this week. I have started working on it and created some functions for it. At the end of the week I want to have asynchronous extractor module and asynchronous test_extractor.

Have I got stuck anywhere?

While I was working on this issue, I came to know that I also have to add some binary strings that compiler normally add because we are using linux file utility to check if file we are scanning is binary and It isn't currently flagging file generated by me as a binary file due to lack of signatures that normally can be found in a binary file.

After some research, I got to know about a magic signature that linux file utility uses to identify binary file and I have added it to the binary file I was creating. They call it ELF signature. Here is this magic hex signature that can be found in the beginning of most executable file:

b"\x7f\x45\x4c\x46\x02\x01\x01\x03"

DEV Community

GSoC 20: Week 1

Background

What did I do this week?

What am I doing this week?

Have I got stuck anywhere?

Top comments (0)

Read next

Beyond Static: Embracing Dynamic Variable Creation in Python

🛠️Non-AI Open Source Projects that are 🔥

NeuBeam - New Tailwind CSS Component Library for stylish Web Development

Working with pydantic in Python