## DEV Community # Extending GraphSAGE to Solve Sudoku.

In this article, I will illustrate how a Graphical Neural Network (GNN) can used to solve Sudoku puzzles with an impressive accuracy.

A Sudoku puzzle consists of a 9x9 grid that can be further separated into 9 adjacent subsquares of size 3x3. The puzzles are partially filled with numbers ranging from 1-9, and the task at hand is to fill the rest of the grid, such that each row, column, and subsquare has every number ranging from 1-9.

More generally, a Sudoku puzzle is a popularized game version of the graph-colouring problem in which each cell is connected to all the cells in its row, column, and 3x3 subsquare. The "colours" that the player can use are the numbers 1-9.

Convolutional Neural Networks (CNN) have been used before to solve these puzzles; however, this inherent graphical structure leads us to believe that these solutions could be improved with a Graphical Neural Network (GNN). The GNN we use extends the GraphSAGE algorithm to process the graph-like structure, but also incorporates the best features from the CNN linked above.

# Construction

## Preprocessing Data

We used Kaggle's dataset of 1 million Sudoku games, which we then split into a train and test set by 80% and 20% respectively. The dataset contains the unsolved and solved puzzle represented by two strings of 81 characters ranging from 0-9, where 0 depicts unfilled squares. We normalized the unsolved data by dividing each of the values that have domain [0,9] by 9 and subtracting by 0.5 so that values were between -0.5 and 0.5 to enhance performance, as neural networks tend to generalize better on zero-centered normalized data.

## Convolutional Neural Network

We re-implemented the CNN by Verma. The architecture is shown in the figure below.

## Graphical Neural Network

Each Sudoku puzzle was defined as a graph with 81 nodes representing each cell in the grid with only one feature - its normalized value from the unsolved puzzle. This was a node classification problem, where each node belongs to one of nine classes (numbers). Each node had 8 + 8 + 8 - 4 = 20 + 1 = 21 edges between the other nodes in the same column (8), row (8), and subsquare (8) with 4 nodes overlapping, and the additional edge represented a self-loop.

The same architecture as described above was implemented with a SAGEConv layer replacing each Conv2d layer. Each SAGEConv layer is described by:

$\bm{h_v^k} = \sigma (\bm{W_k} \sum_{u \in N(v)} \frac{\bm{h_u^{k-1}}}{|N(v)|} + \bm{B_k}\bm{h_v^{k-1}})$

where $k = 1, ..., k-1$ and $\bm{W_k}, \bm{B_k}$ are trainable parameters. The first term aggregates the messages passed on by neighbouring nodes simply by summing the messages and normalizing it by the number of neighbouring nodes. The second term acts as an activation for the self-loop edge, as it is a function of the previous node embedding. The equation then represents an approximate localized spectral convolution as in the CNN, and the aggregate mean function can be seen as a "skip connection" from previous layers in GraphSAGE (link to paper).

The architectures for both neural nets is shown in the figure below:

# Evaluation

When evaluating our model, we experimented with two different forms of evaluation:

## Cell Accuracy

For this approach, we first took the output of our model with shape [BS, 9, 81] and calculated the argmax over dimension 1. Since we used Cross Entropy Loss during training, this will correlate to the most likely label for this cell as predicted by our model. This gave us a prediction tensor of shape [BS, 81]. After this, we compared these predictions with the solved puzzles and looked at the the amount of nodes that were classified correctly.

## Puzzle Accuracy

For the second approach, we adapted the method humans tend to use -- rather than filling all squares at once, we fill the cells one-at-a-time based on how confident the model is with that cell's prediction. As implemented by Verma, we took the softmax over the 1st dimension of our output, as to normalize the likelihoods between nodes. After this, we could compare which cells the model was more confident with predicting. We then took the prediction with the highest probability of being correct and filled it in. We repeat this step until all squares are filled.

# Results and Discussion

## Hyperparameters

After running both of our architectures with various learning rates and number of hidden channels. We found that, for both models, the optimal results occurred with a learning rate near 0.001. Learning rates of order $10^{-2}$ lead to frequent oscillations and a large training loss, and rates on the order of $10^{-4}$ led to a computationally inefficient model. We found that, the number of hidden channels in our networks had a large impact on the computational time, but a small impact on the loss of our models. For this reason, we standardized the use 64 hidden channels in all of our models.

## Experiments

Using the two evaluation metrics noted above, we are able to experiment with two different aspects of the solution to these puzzles. First, we analyzed at which model performs better at predicting the entire solution at once. After this, we compared each model's ability to predict the puzzle by solving for one square at a time. All experiments were trained with 800,000 puzzles and were tested on 5000 puzzles. ### Experiment 1: Predicting Entire Solution

Referring to Table 1, we can see that the GNN implementation has much smaller loss values for both the training and test loss. We believe that this is a result of the architecture having more information about the structure of the problem during training. This loss difference results in a near 10% accuracy difference when we calculate the amount of cells that are correctly predicted after the first model output.

This is a significant result when it comes to the extendability of our network. If we consider an arbitrary large puzzle of size n x n, re-evaluating each example puzzle $n^2$ times, is not an ideal way of solving the problem. The performance of the GNN vs. CNN when it comes to the initial prediction is promising with regards to analyzing future architectures that are able solve this problem perfectly within one pass of the model.

### Experiment 2: Predicting One Square at a Time

In the second experiment, the CNN performed far better than the newly implemented GNN. We evaluated 5000 Sudoku puzzles by filling up the unfilled squares one-by-one by the method above and the model ended up evaluating 4999/5000 perfectly. This is a final accuracy of 99.99%. This is comparable to the accuracy found by Verma. It is important to note that we only classified a predicted puzzle as correct when all cells were filled correctly. This is contrary to our evaluation metric in Experiment 1.

Suspiciously, however, the prediction capabilities of our GNN were much worse when we performed this experiment. We found that while filling up the first half of the puzzle, the GNN performed comparatively to the CNN. However, as the puzzle became increasingly filled, the output values of our model became smaller and smaller. For this reason, it came to a point that, when we normalized with the softmax, the model essentially randomly guessed which value to predict. After predicting only a few incorrect nodes, the model was unable to perform competitively.

While this is a limitation of our model, we do not believe that it makes it unsuccessful. The future of solving CSPs will require models that can make strong predictions with limited computational power. For this reason, when the size of the problem space becomes large, the fill one-by-one approach will likely become obsolete. Only models that are able to compute solutions in one pass will thrive.

# Conclusion

In conclusion, we were able to successfully extend upon a promising CNN architecture by taking into account the graphical structure of Sudoku puzzles. The specific graphical embedding algorithm we used was GraphSAGE, and the architecture performed exceptionally well at predicting the puzzle solution in one pass. One limitation of our architecture is its inability to solve the puzzles perfectly in a one-by-one fashion. While this limitation hinders our architecture's performance in this particular problem space, we believe that the advantages of our architecture over the CNN better extend to larger problems: like an n x n Sudoku puzzle. Unfortunately, in order to test this hypothesis, a new dataset is required.