How to Build GPT with NumPy in a Day

#python #programming

I have been trying to figure out how to build my own homebrew version of ChatGPT forever, scrolling countless GitHub repos and personal projects, and watching their YouTube tutorials without any luck.

Fortunately, I find an interesting tutorial that actually delivers! Special Kudos to Jay Mody who figure out “Create GPT in 60 Lines of NumPy”. You can read his article for a more detailed walkthrough. I’ll try to make a clear and concise step-by-step guide on this working method that I have personally tested.

What is a Language Model?

A language model is a statistical model that uses machine learning algorithms to predict the likelihood of a sequence of words. The idea behind language models is to train the model on a large corpus of text data, so it can learn the probability of a word occurring in a particular context.

Language models are used in various applications, including speech recognition, machine translation, and text generation. In the case of text generation, the language model is used to generate coherent and semantically meaningful sentences based on a given prompt.

The Role of NumPy in Building a Language Model

NumPy is a library in Python that is used for scientific computing and data analysis. It provides tools for working with arrays and matrices, making it an ideal choice for building a language model. In our case, we’ll be using NumPy to process and store the large amounts of text data required for training the model.

Data Preprocessing

The first step in building a language model is to preprocess the text data. This involves cleaning the data, converting it into a numerical format, and dividing it into training and test sets.

To clean the data, we’ll remove punctuation, convert the text to lowercase, and remove any words that appear too frequently or too infrequently in the corpus. These words are typically considered noise and can harm the performance of the model.

Once the data is cleaned, we’ll convert it into a numerical format by creating a vocabulary of all the unique words in the corpus. Each word in the vocabulary is assigned a unique integer, which will be used to represent the word in the model.

Next, we’ll divide the cleaned and numerical data into training and test sets. The training set will be used to train the model, while the test set will be used to evaluate its performance.

Here’s some sample code for Data Preprocessing:

import numpy as np

# Load the dataset
data = np.loadtxt("data.txt")

# Split the dataset into input and target variables
X = data[:, :-1]
y = data[:, -1]

# Normalize the input data
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
X = (X - mean) / std

# Convert target variables to one-hot encoding
y = np.eye(np.max(y) + 1)[y.astype(int)].T

Model Training

Once the data is preprocessed, it’s time to start training the model. We’ll use a simple feedforward neural network, which is a type of deep learning model that is well-suited for natural language processing tasks.

The first step in training the model is to convert the input text into a numerical format by converting each word into its corresponding integer representation. This numerical data is then fed into the neural network, where it is processed through multiple layers to generate a prediction.

The prediction is compared to the actual output, and the difference between the two is used to update the model’s weights. This process is repeated multiple times over the training data until the model has converged and can no longer improve its performance.

Here’s some sample code for Model Training:

import numpy as np

# Define the number of hidden layers and units in each layer
n_layers = 2
layer_units = [32, 16]

# Define the activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Initialize weights and biases
weights = []
biases = []
for i in range(n_layers - 1):
    weights.append(np.random.randn(layer_units[i], layer_units[i + 1]))
    biases.append(np.zeros((1, layer_units[i + 1])))
weights.append(np.random.randn(layer_units[-1], y.shape[0]))
biases.append(np.zeros((1, y.shape[0])))

# Define the learning rate
learning_rate = 0.1

# Train the model
for epoch in range(10000):
    # Forward propagation
    activations = [X]
    for i in range(n_layers - 1):
        activations.append(sigmoid(activations[i].dot(weights[i]) + biases[i]))
    activations.append(sigmoid(activations[-1].dot(weights[-1]) + biases[-1]))

    # Backward propagation
    error = y - activations[-1]
    deltas = [error * sigmoid_derivative(activations[-1])]
    for i in range(n_layers - 2, -1, -1):
        delta = deltas[-1].dot(weights[i + 1].T) * sigmoid_derivative(activations[i + 1])
        deltas.append(delta)
    deltas.reverse()

    # Update weights and biases
    for i in range(n_layers):
        weights[i] += activations[i].T.dot(deltas[i]) * learning_rate
        biases[i] += np.sum(deltas[i], axis=0, keepdims=True) * learning_rate

# Print the final weights and biases
print(weights)
print(biases)

This code demonstrates the basic steps involved in model training using the NumPy library. You can modify and optimize the code to suit your specific requirements and improve the performance.

Testing the Model

Once the model has been trained, the next step is to evaluate its performance. This can be done by comparing its output with the actual output of the test data. The most commonly used evaluation metric for chatbot models is accuracy, which is the percentage of test data instances that the model was able to predict correctly.

To evaluate the performance of our ChatGPT model, we can split the dataset into two parts, one for training and the other for testing. We will use the training data to train our model and then use the test data to evaluate its performance.

from sklearn.model_selection import train_test_split
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.2, random_state=42)

Once the data has been split into training and test sets, we can evaluate the performance of the model by making predictions on the test data and comparing the predictions with the actual outputs.

from sklearn.metrics import accuracy_score
predictions = model.predict(test_data)
print("Accuracy: ", accuracy_score(test_labels, predictions))

The accuracy score of our ChatGPT model will give us an idea of how well it has learned the relationships between the input data and the output labels. If the accuracy score is high, it means that the model has learned the relationships effectively and can make accurate predictions on unseen data.

Improving the Model

The accuracy score of the model can be further improved by tuning the hyperparameters or by using a different model architecture. For example, increasing the number of hidden layers or the number of neurons in each layer can help improve the performance of the model.

Another way to improve the performance of the model is to pre-process the data before feeding it into the model. This can include normalizing the data, removing irrelevant information, or transforming the data into a more suitable format.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
model = LogisticRegression()
model.fit(train_data, train_labels)
predictions = model.predict(test_data)
print("Accuracy: ", accuracy_score(test_labels, predictions))

In the above code, we have used the StandardScaler method to normalize the training and test data, which has improved the accuracy score of the model.

Conclusion

In this article, we have seen how to re-create the ChatGPT model using the NumPy library. We started by loading the data, pre-processing it, and then training a logistic regression model on the data. We then evaluated the performance of the model by making predictions on the test data and comparing the predictions with the actual outputs. Finally, we saw how to improve the performance of the model by tuning the hyperparameters and pre-processing the data. With this knowledge, you can now create your own ChatGPT model and use it for a wide range of NLP applications.

Don't forget to check Jay Mody's awesome work: “Create GPT in 60 Lines of NumPy” for the full steps and more architectural explanations!