DEV Community

Cover image for Advantages of SERP Data in Ensemble Learning
Emirhan Akdeniz for SerpApi

Posted on • Originally published at serpapi.com

Advantages of SERP Data in Ensemble Learning

Automatic Image Classification Generator allows you to perform automated data collection, automation of training, and testing for machine learning models. Empowered by SerpApi’s Google Images Scraper API, the tool bypasses the need for manual data entry, and reduces human error by providing preprocessing functionalities in its workflow.

Last week we discussed how to get automatic data capture of training, and testing processes for machine learning models with the previously created dataset, and make an automation script for gathering information on the best possible machine learning model using Python.

This week we will discuss how to combine multiple binary classification algorithms of the same kind, CNN in our case, to create a rudimentary ensemble learning model, and the advantages of SERP data in ensemble learning.

For more information on the state of the tool, and how it was created, what purposes have we used it, you may scroll to the bottom of the page.

What is Ensemble Learning in Python?

Ensemble learning refers to the machine learning models that consist of at least two machine learning algorithms in their predictive model. The key tradeoff for ensemble models is the complexity in exchange for boosting algorithms with different neural networks. Python has various tools, frameworks, and libraries dedicated to data science and machine learning to create advanced ensemble learning models.
How do you create an Ensemble in Python?

There are whole libraries out there just for the purpose of ensemble learning like xgboost. Native methods of machine learning frameworks like DecisionTreeClassifier, BaggingClassifier, or GradientBoostingRegressor of scikit-learn also exist. I really want to get into them in the future. Being able to create ensemble learning models by just calling one-liners like:
from sklearn.ensemble import VotingClassifier or
from sklearn.ensemble import RandomForestClassifier or
from sklearn.model_selection import train_test_split or
from sklearn.ensemble import AdaBoostClassifier
or simply calling previously created logistics regression model with:
from sklearn.linear_model import LogisticRegression
seems like a really easy way to create subsets with sparse training data, and employ ensemble learning techniques that encompass different models on classification problems.
However, I would like to use simple software engineering techniques to tackle the problem. We want to make it into a client tool that has less dependency. So it would be wise to do custom work if it still serves the purpose.

What is Ensemble Learning explain with example?

Before we start, I would like to remind the reader that I am not well versed in the terminology of ensemble learning. There are different kinds of ensemble methods and ensemble models like bagging, bootstrap aggregation, gradient boosting decision trees, etc. I really don’t know what my proposition to the solution for boosting algorithms by separately training them and getting the most frequent answer. I’m not a data scientist, but my best guess is Hard Voting.

In this tutorial, we will train multiple individual models of CNN with binary classification, make them take predictions on class labels one by one, and collect each of these to set the majority vote as the answer. In the coming weeks, if our binary classifiers are overfitting, we might create some sort of soft voting system by taking into account the number of images the binary model is trained on, validation accuracy, etc. to tackle regression problems in order to create a probabilistic model. But we need metrics from this model with sturdy classifiers in order to transform it into a regression model.

To put it in simple words, if you have 3 American Dog Species as class labels, namely, American Hairless Terrier, American Malamute, and American Eskimo Dog, You will need 3 binary classifiers:

american_hairless_terrier_vs_alaskan_malamute
alaskan_malamute_vs_american_eskimo_dog
alaskan_malamute_vs_american_eskimo_dog

Then we will run the image on these 3 individual models, and get the most frequent answer as the prediction.

What are the advantages of Ensemble Learning?

Ensemble learning decreases the arbitrary bad predictions of strong learner models by contrasting them with weak learners. Also, strong learner models could be combined to achieve greater performance. In other words, ensemble learning could be utilized to tackle what a single machine learning model cannot achieve.

Imagine we have 3 class labels for a classifier, and the test set of that single model gives 51% accuracy. This can be identified as a weak learner. Let’s say that if you would do a binary classification between two labels, it around 65% accuracy for each individual model. The assumption here is that if we combine the prediction of each of these models, their weighted average should be higher than 51% which is above what the single base learner can achieve. You may think of it as max_features = 2, and max_depth = 1 decision tree.

One other advantage here is that it is easier to control data points when you have more than one model in some cases. For example, if the new model that is a combination of singular models has one CNN that is overfitting, it could be retrained with a new training dataset or new algorithms easily and then be replaced. Maybe, you can even change the method of classification and replace that part with a Linear Regression, KNN, SVM, or SVC, etc. After all, the final goal is to make cross-validation of predictions.

What are the advantages of SERP Data in Ensemble Learning?

SERP Data can be used to create specialized training and testing datasets for machine learning purposes. Noise can be reduced by filtering targeted results by size, specification, source, etc. For ensemble learning, individual machine learning models could be optimized with SERP data to serve the combined model better.

In our case, we are able to create a dataset that has training items (x_train) and testing items (x_test) in our individual models that are only consisting of American Malamute Dog, American Eskimo Dog, etc., and only the square images to control kernel size easier.

For example, you may acquire a dataset from somewhere like Kaggle, import pandas, and pick the images you desire by using read_csv method on the label file to specialize your datasets. But, with SERP data, you can gather these without going through long CSV files with labels, or utilizing any additional library. You can simply specify your search once, and then gather all the data you want.

With SerpApi's Image Scraper APIs like SerpApi's Google Images Scraper API, SerpApi's Yandex Images API, or Naver Images API you can create specialized image datasets with a simple query you can shape in Playground.

You may also use other forms of data for machine learning purposes. Visit Use SERP data to build machine learning models page to get a better look.

Register to SerpApi now to Claim Free Credits.

An Example of Ensemble Learning

We use the formula to calculate the input size of the first fully connected layer in a CNN:

def calculate_fully_connected(layers, size):
    for layer in layers:
        k = 1
        p = 0
        s = 1
        d = 1
        if "kernel_size" in layer:
            k = layer["kernel_size"]
        if "padding" in layer:
            p = layer["padding"]
        if "stride" in layer:
            s = layer["stride"]
        if "dilation" in layer:
            d = layer["dilation"]
        size = math.floor((size + 2*p - d*(k-1) - 1)/s + 1)

    return size
Enter fullscreen mode Exit fullscreen mode

A reminder again that this is if the input is a square image.

Let's also define a simple CNN with 2 convolutions:

model = [
    {
        "name": "Conv2d",
        "in_channels": 3,
        "out_channels": 6,
        "kernel_size": 5
    },
    {
        "name": "ReLU",
        "inplace": True
    },
    {
        "name": "MaxPool2d",
        "kernel_size": 2,
        "stride": 2
    },
    {
        "name": "Conv2d",
        "in_channels": 6,
        "out_channels": 16,
        "kernel_size": 5
    },
    {
        "name": "ReLU",
        "inplace": True
    },
    {
        "name": "MaxPool2d",
        "kernel_size": 2,
        "stride": 2
    },
    {
        "name": "Flatten",
        "start_dim": 1
    },
    {
        "name": "Linear",
        "in_features": "change_with_calculated_fn_size",
        "out_features": 120
    },
    {
        "name": "ReLU",
        "inplace": True
    },
    {
        "name": "Linear",
        "in_features": 120,
        "out_features": 84
    },
    {
        "name": "ReLU",
        "inplace": True
    },
    {
        "name": "Linear",
        "in_features": 84,
        "out_features": "n_labels"
    }
]
Enter fullscreen mode Exit fullscreen mode

This is, by no means a fully applicable model. But it will help us construct the necessary parts. The accuracy rate of this model will be around %35 to %65. Not really good for us.

Let's pick necessary parts:

optimizer = "SGD"
lr = 0.1
momentum = 0.9
loss_function = "CrossEntropyLoss"
output_size = 16
image_size = 32
Enter fullscreen mode Exit fullscreen mode

Output size here represents the output size of the last convolutional layer, and the image_size is both width and height of the image.

Let's define our labels:

labels = [
    "American Hairless Terrier imagesize:500x500",
    "Alaskan Malamute imagesize:500x500",
    "American Eskimo Dog imagesize:500x500"
]
Enter fullscreen mode Exit fullscreen mode

Let's also create each possible unique pairing using the index of labels:

label_combinations = []
index_list = list(range(0,len(labels)))
index_list = list(itertools.combinations(index_list,2))
index_list = list(set(index_list))
for indexes in index_list:
    label_combinations = label_combinations + [[labels[indexes[0]], labels[indexes[1]]]]
Enter fullscreen mode Exit fullscreen mode

For those of you who are wondering how many items will this index_list will have, the formula is:
n_estimators = label_size!/((label_size-2)!*2!) where n_estimators represents the unique pairings(label_combinations) size.

Let's start iterating over the two labels we want to make a model on, and name the model on that:

training_dicts = []
for binary_labels in label_combinations:
    model_name = binary_labels[0].split(" imagesize")[0].replace(" ","_").lower()
    model_name = model_name + "_vs_" + binary_labels[1].split(" imagesize")[0].replace(" ","_").lower()
Enter fullscreen mode Exit fullscreen mode

For each cycle the label will look something like:
alaskan_malamute_vs_american_eskimo_dog

Let's calculate the first fully connected layer input size:

    calculated_fc_size = calculate_fully_connected(model,image_size)
    for layer in model:
        if (layer["name"] == "Linear") and (layer["in_features"] == "change_with_calculated_fn_size"):
            model[model.index(layer)]['in_features'] = calculated_fc_size * calculated_fc_size * output_size ## Assuming image shape and kernel are squares
            break
Enter fullscreen mode Exit fullscreen mode

Let's create training dictionaries at each cycle and add them to a list:

training_dict = {
  "model_name": model_name,
  "criterion": {
    "name": loss_function
  },
  "optimizer": {
    "name": optimizer,
    "lr": lr,
    "momentum": momentum
  },
  "batch_size": 4,
  "n_epoch": 10,
  "n_labels": 0,
  "image_ops": [
    {
      "resize": {
        "size": [
          image_size,
          image_size
        ],
        "resample": "Image.ANTIALIAS"
      }
    },
    {
      "convert": {
        "mode": "'RGB'"
      }
    }
  ],
  "transform": {
    "ToTensor": True,
    "Normalize": {
      "mean": [
        0.5,
        0.5,
        0.5
      ],
      "std": [
        0.5,
        0.5,
        0.5
      ]
    }
  },
  "target_transform": {
    "ToTensor": True
  },
  "label_names": binary_labels,
  "model": {
    "name": "",
    "layers": model
  }
}
training_dicts = training_dicts + [training_dict]
Enter fullscreen mode Exit fullscreen mode

Now, for future purposes, let's check for models that are already trained on a specific binary classification:

trained_models = os.listdir("models")
Enter fullscreen mode Exit fullscreen mode

Let's train binary classification models one by one:

for training_dict in training_dicts:
    if (training_dict['model_name']+".pt") not in trained_models:
        print("---")
        print("Training Model: {}".format(training_dict['model_name']))
        body = json.dumps(training_dict)
        response = requests.post("http://localhost:8000/train", headers = {"Content-Type": "application/json"}, data=body, allow_redirects = True)
        if response.status_code == 200:
            while True:
                response = requests.post("http://localhost:8000/find_attempt/?name={}".format(training_dict["model_name"]), headers = {"Content-Type": "application/json"}, allow_redirects = True)
                if response.json() != None and response.json()['status'] == "Trained":
                    break
                time.sleep(1)
        testing_dict = training_dict
        testing_dict['limit'] = 100
        body = json.dumps(testing_dict)
        response = requests.post("http://localhost:8000/test", headers = {"Content-Type": "application/json"}, data=body, allow_redirects = True)
        if response.status_code == 200:
            while True:
                response = requests.post("http://localhost:8000/find_attempt/?name={}".format(training_dict["model_name"]), headers = {"Content-Type": "application/json"}, allow_redirects = True)
                if response.json() != None and response.json()['status'] == "Complete":
                    break
                time.sleep(1)
        print("Accuracy: {}".format(response.json()['accuracy']))
        print("---")
    else:
        print("---")
        print("Skipping Already Existing Model: {}".format(training_dict['model_name']))
        print("---")
Enter fullscreen mode Exit fullscreen mode

Notice that if a model exists, we skip the training process.

At this point, we have an image called malamute_example.jpg inside examples folder. Let's shape it into a form that can be recognized by the model:

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
img = Image.open("examples/malamute_example.jpg")
transform = transforms.Compose([transforms.Resize((32,32), antialias=True), transforms.ToTensor(), transforms.Normalize(mean=(0.5,0.5,0.5), std=(0.5,0.5,0.5))])
img = transform(img)
img = [img.numpy()]
img = np.asarray(img, dtype='float64')
img = torch.from_numpy(img).float()
img = img.to(device)
Enter fullscreen mode Exit fullscreen mode

Now that we have the image as a tensor, let's call each binary classification model one by one, and collect the predictions in a list:

predictions = []
for example_dict in training_dicts:
    model_path = "models/" + example_dict['model_name'] + ".pt"
    example_dict['n_labels'] = 2
    label_names = example_dict['label_names']
    example_dict = TrainCommands(**example_dict)
    model = CustomModel(example_dict)
    model.load_state_dict(torch.load(model_path))
    if torch.cuda.is_available():
        model.cuda()
        pred = model(img).to(device)[0]
    else:
        pred = model(img)[0]
    prediction = label_names[pred.argmax()]
    predictions = predictions + [prediction]
Enter fullscreen mode Exit fullscreen mode

Notice that since we are using one-hot vectors for label tensors, we can take the index of the maximum in the prediction tensor, and then use it to get the answer from the label list.

Finally, let's find the most frequent prediction, and serve it as the final answer with string manipulation:

if " imagesize:" in final_prediction:
    final_prediction = final_prediction.split(" imagesize")[0]

print("Prediction is {}".format(final_prediction))
Enter fullscreen mode Exit fullscreen mode

Here's the output when you run it when you previous models that are already trained:

---
Skipping Already Existing Model: american_hairless_terrier_vs_alaskan_malamute
---
---
Skipping Already Existing Model: american_hairless_terrier_vs_american_eskimo_dog
---
---
Skipping Already Existing Model: alaskan_malamute_vs_american_eskimo_dog
---
Prediction is Alaskan Malamute
Enter fullscreen mode Exit fullscreen mode

The prediction worked for this. But this is not a good indicator that it works in the long run. The CNN model has to be adjusted further with last week's script, and then replaced here.

Full Code

Conclusion

I am grateful to the reader for their attention and to the Brilliant People of SerpApi for their support. In the coming weeks, we will discuss how to optimize individual binary classification models, how to store all these models in one file, and how to minimize the training process to make it into a command line tool.

Top comments (0)