DEV Community

Cover image for Applying Tests to Jupyter Notebook Functions and Refactoring Old Code
sc0v0ne
sc0v0ne

Posted on • Updated on

Applying Tests to Jupyter Notebook Functions and Refactoring Old Code

When I was at college I was never a big fan of presenting slides when it came to code, I always liked explaining theory along with practice. In 2022 I had to develop a related project on refactoring. Well before, I had already discovered the Jupyter tool for some previous work.

With guidance I had, I used the book Refactoring: Improving the Design of Existing Code, Martin Fowler.

Guys, if you could leave a like I would be very grateful. That way I can know if you're enjoying it. In addition to helping reach other readers. Thank you very much.

Continuing, the project I developed at the time can be found here.

GitHub logo sc0v0ne / replace_query_with_parameter

Replace Query With Parameter

Replace Query with Parameter

🎓 College: Faculdade Metodista Granbery

👨‍🏫 Teacher: Marco Antônio - Github | Linkedin

📗 Book: Refatoração - Aperfeiçoando o design de códigos existentes - Martin Fowler

FOWLER, Martin. “Replace Query with Parameter” no código. In: REFATORAÇÃO: Aperfeiçoando o design de códigos existentes. 2. ed. [S. l.: s. n.], 2019. cap. 11.

Método

Objetivo desse método e retirar a consulta dentro função que está em uma dependência indesejada e trazer uma função que traga sempre o mesmo resultado que se chama transparência referencial.

Código

O código que escolhi foi uma atividade dado em aula pelo Professor Ricardo. Essa função recebe um parâmetro de nomes de coluna do dataset, os dados são verificados pela sua severidade e depois ocorre sua substituição dos valores ausentes de acordo com o seu tipo.

O código pode ser encontrado na plataforma Kaggle

Ferramentas

    pip install ipytest
Enter fullscreen mode Exit fullscreen mode
    pip install
Enter fullscreen mode Exit fullscreen mode

What is the objective of the work?

The teacher defined each student a topic for refactoring. Mine was about Replace Query with Parameter, where the objective of this method is to remove the query within a function that has an unwanted dependency and bring a function that always brings the same result, which is called referential transparency.

Therefore, we have some issues to be resolved in the work, which are:

  1. Design or choose a code that refactoring can be applied to.
  2. Identify the associated bad smell.
  3. Write the test cases and execute
  4. Apply refactoring
  5. Run the test cases again
  6. Document each step for the presentation

You will notice a lot of difference from the old code to the current one, I created a new project on Github. I created a tag from the old project version. With this I will create the following as the updates that I have defined. I'm finding it really fun to redo this project, because I can see how I've evolved and I can also apply the new knowledge I gained. I changed the entire project and started it, you can see in the versions that it is v.1.0.0. From that point on, none of the code works with the old one.

With the objectives of the work, I will leave some explanations for you to understand.

  1. The code I chose was a function I had developed in the previous period's course. I looked at this function again and it looks horrible. Say it's not working, I'd be lying. But looking at this function you can see many changes. First, let's redo the process we had done before.
def treatment_missing_values_original():
    for col in df.columns:
        if col != 'age':
            mode_benign = df[(df['severity'] == 0)][col].mode()[0]
            mode_malignant = df[(df['severity'] == 1)][col].mode()[0]
            df.loc[(df[col].isnull())&(df['severity'] == 0), col] = mode_benign
            df.loc[(df[col].isnull())&(df['severity'] == 1), col] = mode_malignant

        else:
            mean_benign = df[(df['severity'] == 0)][col].mean()
            mean_malignant = df[(df['severity'] == 1)][col].mean()
            df.loc[(df[col].isnull())&(df['severity'] == 0), col] = mean_benign
            df.loc[(df[col].isnull())&(df['severity'] == 1), col] = mean_malignant


treatment_missing_values_original()

Enter fullscreen mode Exit fullscreen mode
  1. We have a bad smell in the code, I prefer to refactor first and then fix the bad smell. Because in the middle of the refactoring, perhaps the bad smell could be resolved, if not. He's going to stink and we'll fix him too.

  2. I wrote the test case for the code without refactoring and with refactoring. I left the files separate for you to check. You can test separately or all together, as you wish.

  3. I applied the refactoring, but remember not to do it like me. Because here I am explaining each step and testing it for you to see how it works. So I ended up purposefully messing up the code. Being able to see in the name of the functions that there is refc_01, refc_02, refc_03 ... This was to separate each step of the refactoring and testing. Following in a way that does not affect the code when refactoring.

  4. Executed the test cases after refactoring.

  5. Document this step and this article. I will do my best to be clear about the details and you will understand even more if you have knowledge of the Python programming language. Please, if you notice anything strange or incorrect texts, let me know.


Refactoring

Let's review the function again.

def treatment_missing_values_original():
    for col in df.columns:
        if col != 'age':
            mode_benign = df[(df['severity'] == 0)][col].mode()[0]
            mode_malignant = df[(df['severity'] == 1)][col].mode()[0]
            df.loc[(df[col].isnull())&(df['severity'] == 0), col] = mode_benign
            df.loc[(df[col].isnull())&(df['severity'] == 1), col] = mode_malignant

        else:
            mean_benign = df[(df['severity'] == 0)][col].mean()
            mean_malignant = df[(df['severity'] == 1)][col].mean()
            df.loc[(df[col].isnull())&(df['severity'] == 0), col] = mean_benign
            df.loc[(df[col].isnull())&(df['severity'] == 1), col] = mean_malignant


treatment_missing_values_original()


Enter fullscreen mode Exit fullscreen mode

Following the Replace Query with Parameter method, for this code. Note that in our block we have the for loop receiving df.columns. That doesn't mean it's bad, it's just not working. This can be a hassle for you as a professional when looking at your code or even needing to update this code. No parameters are passed in the function. Look, we have two things, I have df which is my pandas dataframe and next to it I am applying the *columns method. This can be a headache because if we have to change the dataframe and the columns are not the same in the future. Wouldn't it be more interesting to leave this point more generic? So we need to refactor.

First, I want to talk a little about this Python library for testing. At the beginning I said that I like to explain theory along with practice, that's the idea. I'll show you a situation. Let's say you're in your college work presentation and you prepare a bunch of slides and on each slide you have to put example code, which doesn't have any kind of execution, so you have to open your IDE and then put it side by side. side the windows and start presenting, so why not summarize this. I really like using Visual Studio Code. Without going into details, I use it because of its flexibility in extensions, projects, teamwork and other incredible options that it provides. One of them is to have support for Jupyter notebook, so you don't need to execute a command in the terminal and a window opens in the browser where the notebooks are running. In the IDE itself you can already do this, which makes the key point of the presentation here. With the notebook cells you can add text and running code in the middle of the presentation to show your work proposal and in addition, in the case of this work I am showing the scripts, this way I present even more its quality and results, and you can even leave that moment to show another example. The subject here is refactoring and refactoring must have tests, so this makes it even more important to explain a topic on this subject.

With that in mind, at the time I looked for a library to see if it was possible to test the functions within a notebook.

testbook

This simply incredible library can be simple with Python and simple like pytest. Bring a very good result. I leave the link below for you to access and leave your star in the repository.

GitHub logo nteract / testbook

🧪 📗 Unit test your Jupyter Notebooks the right way

Build Status image Documentation Status PyPI Python 3.6 Python 3.7 Python 3.8 Python 3.9 Code style: black

testbook

testbook is a unit testing framework extension for testing code in Jupyter Notebooks.

Previous attempts at unit testing notebooks involved writing the tests in the notebook itself However, testbook will allow for unit tests to be run against notebooks in separate test files hence treating .ipynb files as .py files.

testbook helps you set up conventional unit tests for your Jupyter Notebooks.

Here is an example of a unit test written using testbook

Consider the following code cell in a Jupyter Notebook example_notebook.ipynb:

def func(a, b):
   return a + b
Enter fullscreen mode Exit fullscreen mode

You would write a unit test using testbook in a Python file example_test.py as follows:

# example_test.py
from testbook import testbook


@testbook('/path/to/example_notebook.ipynb', execute=True)
def test_func(tb):
   func = tb.get("func")

   assert func(1, 2) == 3
Enter fullscreen mode Exit fullscreen mode

Then…

First of all, I want to show you the test of this function that we are going to refactor, notice that it is working. So we have to maintain the standard for the following steps. At this point you can observe the use of the library for the functions within our jupyter notebook. I will explain this initial step in more detail, as the following steps are very similar.

It will be necessary to import

from testbook import testbook
Enter fullscreen mode Exit fullscreen mode

We need to define a decorator where we will inform the path of the notebook where the function or functions are located.

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
Enter fullscreen mode Exit fullscreen mode

If you have already used pytest you will understand the following steps. We must now define a function where its name starts with test, then we must pass it as a parameter also. What this tb, in the decorator we define the object that is our notebook file, from there we have an instance of this notebook being tb.

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_treatment_missing_values_original(tb):
Enter fullscreen mode Exit fullscreen mode

From this instance we can use a function, ref. Which will reference our function within the notebook.

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_treatment_missing_values_original(tb):
    function_notebook = tb.ref("treatment_missing_values_original")

Enter fullscreen mode Exit fullscreen mode

This function initially does not have any type of parameter, but what are we going to do? Let's run it by passing the test and bringing an error as well. To have both forms and test the refactoring. Our ultimate goal is to return the inverse of this test.

The first test the function must return None and thus pass the test. Then the test will give an error, I want to show something. We can add features from the pytest library, so I import pytest and add a small block to catch this exception, without many details or increasing the test, I just want to catch the first possible type of exception, which in this case will be this function does not need arguments and in the end we want to have the inverse of this error. Now showing all the code for this functionality.

from testbook import testbook
import pytest

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_treatment_missing_values_original(tb):
    function_notebook = tb.ref("treatment_missing_values_original")
    group = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']  
    assert function_notebook() == None
    with pytest.raises(Exception) as e_info:
        function_notebook(group) == None
    assert "takes 0 positional arguments but 1 was given" in str(e_info.value)
Enter fullscreen mode Exit fullscreen mode

image create by author


Procedure

Guys, I want to leave a comment, using the Jupyter notebook that I will make available in my repository. You can see that in every refactoring function I reloaded the dataset. Because I chose this, when I present it, I will show how each function is in individual blocks and is not loading data from any other. As the function aims to handle missing data. The set will enter the function with missing data and exit with the missing data replaced by the function.

Input

Image create by author

output

Image create by author


1º Step

First we need to extract the variable we want to refactor.

Code:

def treatment_missing_values_refc_01():
    cols = refactoreDF.columns
    for col in cols:
        if col != 'age':
            mode_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mode()[0]
            mode_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mode()[0]
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mode_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mode_malignant
        else:
            mean_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mean()
            mean_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mean()
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mean_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mean_malignant
Enter fullscreen mode Exit fullscreen mode

Test:

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_treatment_missing_values_refc_01(tb):
    function_notebook = tb.ref("treatment_missing_values_refc_01")
    group = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
    assert function_notebook() == None
    with pytest.raises(Exception) as e_info:
        function_notebook(group) == None
    assert "takes 0 positional arguments but 1 was given" in str(e_info.value)

Enter fullscreen mode Exit fullscreen mode

2º Step

Next we need to extract the function. Following the book, the name of the main function we add a prefix, that's up to you. I put it like in the book. When extracting the function we create a new one that will soon be deleted.

Code:

def XXXNEWtreatment_missing_values_refc_02(cols):
    for col in cols:
        if col != 'age':
            mode_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mode()[0]
            mode_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mode()[0]
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mode_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mode_malignant
        else:
            mean_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mean()
            mean_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mean()
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mean_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mean_malignant

def XXXNEWtreatment_missing_values_refc_02_refactor(): 
    cols = refactoreDF.columns
    return XXXNEWtreatment_missing_values_refc_02(cols)

XXXNEWtreatment_missing_values_refc_02_refactor()
Enter fullscreen mode Exit fullscreen mode

Test:

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_XXXNEWtreatment_missing_values_refc_02(tb):
    function_notebook = tb.ref("XXXNEWtreatment_missing_values_refc_02")
    group = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
    assert function_notebook(group) == None
    with pytest.raises(Exception) as e_info:
        function_notebook() == None
    assert "missing 1 required positional argument" in str(e_info.value)
Enter fullscreen mode Exit fullscreen mode

3º Step

Now we need to apply the variable internalization method. Using this method we no longer use this variable.

Code:

def XXXNEWtreatment_missing_values_refc_03(cols):
    for col in cols:
        if col != 'age':
            mode_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mode()[0]
            mode_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mode()[0]
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mode_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mode_malignant
        else:
            mean_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mean()
            mean_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mean()
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mean_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mean_malignant

# ----------

def XXXNEWtreatment_missing_values_refc_03_refactor(): 
    return XXXNEWtreatment_missing_values_refc_03(refactoreDF.columns)

# ----------

XXXNEWtreatment_missing_values_refc_03_refactor()

Enter fullscreen mode Exit fullscreen mode

Test:

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_XXXNEWtreatment_missing_values_refc_03_refactor(tb):
    function_notebook = tb.ref("XXXNEWtreatment_missing_values_refc_03_refactor")
    group = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
    assert function_notebook() == None
    with pytest.raises(Exception) as e_info:
        function_notebook(group) == None
    assert "takes 0 positional arguments but 1 was given" in str(e_info.value)
Enter fullscreen mode Exit fullscreen mode

4º Step

Notice now that I stopped using the extracted function, this method is called internalizing the function. That we return to the original function with refactored code.

Code:

def XXXNEWtreatment_missing_values_refc_04(cols):
    for col in cols:
        if col != 'age':
            modebenign = refactoreDF[(refactoreDF['severity'] == 0)][col].mode()[0]
            mode_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mode()[0]
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = modebenign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mode_malignant
        else:
            meanbenign = refactoreDF[(refactoreDF['severity'] == 0)][col].mean()
            mean_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mean()
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = meanbenign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mean_malignant

# -----

#def XXXNEWtreatment_missing_values_refc_03_refactor(): 
#    return XXXNEWtreatment_missing_values_refc_03(refactoreDF.columns)

# -----

XXXNEWtreatment_missing_values_refc_04(refactoreDF.columns)
Enter fullscreen mode Exit fullscreen mode

Test:

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_XXXNEWtreatment_missing_values_refc_04(tb):
    function_notebook = tb.ref("XXXNEWtreatment_missing_values_refc_04")
    group = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
    assert function_notebook(group) == None
    with pytest.raises(Exception) as e_info:
        function_notebook() == None
    assert "missing 1 required positional argument" in str(e_info.value)
Enter fullscreen mode Exit fullscreen mode

5º Step

Finally we need to resume the function because of the changes we created. Now we have the final function refactored.

Code:

def treatment_missing_values_refc_05(cols):
    for col in cols:
        if col != 'age':
            mode_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mode()[0]
            mode_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mode()[0]
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mode_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mode_malignant
        else:
            mean_benign = refactoreDF[(refactoreDF['severity'] == 0)][col].mean()
            mean_malignant = refactoreDF[(refactoreDF['severity'] == 1)][col].mean()
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 0), col] = mean_benign
            refactoreDF.loc[(refactoreDF[col].isnull())&(refactoreDF['severity'] == 1), col] = mean_malignant

treatment_missing_values_refc_05(refactoreDF.columns)
Enter fullscreen mode Exit fullscreen mode

Test:

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_treatment_missing_values_refc_05(tb):
    function_notebook = tb.ref("treatment_missing_values_refc_05")
    group = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
    assert function_notebook(group) == None
    with pytest.raises(Exception) as e_info:
        function_notebook() == None
    assert "missing 1 required positional argument" in str(e_info.value)
Enter fullscreen mode Exit fullscreen mode

All function tests

Running all tests for the notebook.

Image create by author

Bad smell

I read the book and didn't find a bad smell that had a reference for this refactoring. Of course, if we have to go literally, the comments I made are, but remember that we are learning here.

Function Final

Finally, refactoring the function even further:

  • I added PEP 484 – Type Hints.
  • I removed the dataframe from within the function so the user can use it in others.
  • I chose for the user to choose the columns they want to deal with in the function.
  • I chose to choose the column that will be binary.
  • I added the method you will choose, this can be expanded into more options. But for this explanation I left just two.
def treatment_binary(
                    dataframe: pd.DataFrame,
                    cols: list,
                    col_binary: str,
                    method_func: str,
                    ) -> pd.DataFrame:
    df_output = dataframe.copy()

    for col in cols:
        for binary in [0,1]:
            if method_func == 'mean':
                rst = df_output[(df_output[col_binary] == binary)][col].mean()
            elif method_func == 'mode':
                rst = df_output[(df_output[col_binary] == binary)][col].mode()[0]
            else:
                raise Exception('Method incorrect, use mean or mode')
            df_output.loc[(df_output[col].isnull())&(df_output[col_binary] == binary), col] = rst
    return df_output
Enter fullscreen mode Exit fullscreen mode

In the tests they were very different, because of the tool we needed to inject the data to be able to use the function with the parameters.

Test:

@testbook('./notebook_replace_query_with_parameter.ipynb', execute=True)
def test_treatment_missing_values_final_function(tb):
    function_notebook = tb.ref("treatment_binary")
    tb.inject("""
        df = pd.read_csv('data/dataset_mammography.csv')
    """)
    df = tb.ref("df")

    tb.inject("""
        cols_mean = ['margin', 'density', 'BI_RADS', 'shape']
    """)
    cols_mean = tb.ref("cols_mean")

    tb.inject("""
        cols_mode = ['age']
    """)
    cols_mode = tb.ref("cols_mode")

    tb.inject("""
        binary = 'severity'
    """)
    binary = tb.ref("binary")

    tb.inject("""
        method_f = 'mean'
    """)
    method_f = tb.ref("method_f")

    tb.inject("""
        method_f2 = 'mode'
    """)
    method_f2 = tb.ref("method_f2")

    tb.inject("""
        method_f3 = 'Cake'
    """)
    method_f3 = tb.ref("method_f3")


    result = function_notebook(
        df,
        cols_mean,
        binary,
        method_f
        )
    assert result.shape[1] == 6
    assert result.shape[0] > 0

    result = function_notebook(
        df,
        cols_mode,
        binary,
        method_f2
        )
    assert result.shape[1] == 6
    assert result.shape[0] > 0

    with pytest.raises(Exception) as e_info:

        result = function_notebook(
            df,
            cols_mode,
            binary,
            method_f3
        )
    assert "Method incorrect, use mean or mode" in str(e_info.value)

Enter fullscreen mode Exit fullscreen mode

Guys, if you could leave a like I would be very grateful. That way I can know if you're enjoying it. In addition to helping reach other readers. Thank you very much.

Resources:

About the author:

A little more about me...

Graduated in Bachelor of Information Systems, in college I had contact with different technologies. Along the way, I took the Artificial Intelligence course, where I had my first contact with machine learning and Python. From this it became my passion to learn about this area. Today I work with machine learning and deep learning developing communication software. Along the way, I created a blog where I create some posts about subjects that I am studying and share them to help other users.

I'm currently learning TensorFlow and Computer Vision

Curiosity: I love coffee

Top comments (0)