Introduction
This is my second "toward understanding DNN (deep neural network) well" series. I will explore the effect of the numbers of layers and units again with the iris dataset.
github repository: comparison_of_dnn
Note that, this is not a "guide", this is a memo from a beginner to beginners. If you have any comments, suggestions, questions, etc. whilst reading this article, please let me know in the comments below.
Iris dataset
Obviously, it is a so famous dataset. Most people would not need an explanation about this dataset. But I will see a little bit because I am a beginner.
We can use this dataset by sklearn.datasets.load_iris()
function. This is for multi-classification. It contains 150 data and each data has the following four features.
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
The number of classes is three and the dataset has the same numbers of data belonging to each class. This dataset is the three-classification dataset. As most of us know, there has no missing data, but this is like a tutorial article, so I check whether there are missing values.
Input:
import sklearn
from sklearn import datasets
iris_dataset = sklearn.datasets.load_iris(as_frame=True)["frame"]
iris_df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
Cool. There are no missing values. Next, I check the basic statistics.
Input
iris_df.describe().drop(["count"])
Output
sepal length (cm) sepal width (cm) petal length (cm) /
mean 5.843333 3.057333 3.758000 /
std 0.828066 0.435866 1.765298 /
min 4.300000 2.000000 1.000000 /
25% 5.100000 2.800000 1.600000 /
50% 5.800000 3.000000 4.350000 /
75% 6.400000 3.300000 5.100000 /
max 7.900000 4.400000 6.900000 /
petal width (cm) target
1.199333 1.000000
0.762238 0.819232
0.100000 0.000000
0.300000 0.000000
1.300000 1.000000
1.800000 2.000000
2.500000 2.000000
Of course, I am interested in data analysis, but I have no ability for analysing it so far. I will analyse the data someday.
Comparison
For the sake of simplicity, I suppose the following conditions.
- A model is fixed all conditions except for the number of layers and the numbers of units of each layer.
- Any data preprocessing is not performed.
- Seed is fixed.
Most of those conditions can be changed or removed. All you have to do is change config_iris.yaml
. The yaml file has the following lines.
mlflow:
experiment_name: iris
run_name: default
dataset:
eval_size: 0.25
test_size: 0.25
train_size: 0.75
shuffle: True
dnn:
n_layers: 3
n_units_list:
- 8
- 4
- 3
activation_function_list:
- relu
- relu
- softmax
seed: 57
dnn_train:
epochs: 30
batch_size: 4
patience: 5
The following changes work to build a model that has five layers (four dense layers plus one output layer), which have relu function as their activation functions, and 8 units.
dnn:
n_layers: 5
n_units_list:
- 8
- 8
- 8
- 8
- 3
activation_function_list:
- relu
- relu
- relu
- relu
- softmax
Note that, some of the model's information is hard coding. You have to write codes to change them. For example, model's loss function is cross entropy, which is calculated by keras.losses.SparseCategoricalCrossentropy()
function and it is specified in iris_dnn.py
:
https://github.com/ksk0629/comparison_of_dnn/blob/8498a7d15ed6a4447f13f9f277e214f4821f46a1/src/iris_dnn.py#L28-L30
result
First, I summarise all results. The losses and accuracy are as follows.
#layers | #parameters | training loss | evaluation loss | test loss | test accuracy |
---|---|---|---|---|---|
2 | 35 | 0.166 | 0.136 | 0.157 | 0.947 |
2 | 67 | 0.086 | 0.022 | 0.039 | 0.974 |
2 | 131 | 0.086 | 0.033 | 0.043 | 1.0 |
2 | 259 | 0.09 | 0.024 | 0.047 | 0.974 |
3 | 263 | 0.104 | 0.018 | 0.069 | 0.974 |
4 | 260 | 0.123 | 0.05 | 0.115 | 0.947 |
5 | 261 | 0.089 | 0.089 | 0.075 | 0.974 |
6 | 255 | 0.138 | 0.043 | 0.119 | 0.947 |
7 | 263 | 0.091 | 0.023 | 0.047 | 0.974 |
8 | 261 | 1.099 | 1.099 | 1.099 | 0.316 |
9 | 259 | 1.099 | 1.099 | 1.099 | 0.316 |
The amount of test data is 38 and the dataset has 12 data belonging to class 0, 13 data belonging to class 1, and 13 data belonging to class 2.
I performed 11 experiences to explore the following two things.
- effect of the number of parameters
- effect of the number of layers
The experiments from the first to the fourth are for the first one and the experiments from the fourth to eleventh are for the second one.
The result says the following facts.
- The model that has two layers and 67 parameters is the best one in the sense of the test loss value.
- The model that has two layers and 131 parameters is the best one in the sense of the test accuracy.
- The models that have 8 layers and 9 layers are the worst ones.
It is a bit surprising to me because I expected the best model would be one whose layers and parameters are more than the above best ones. It is possibly due to the distribution of the test data because it might be too small to evaluate the performance. But at least under the above conditions, the two models that have two layers are the best ones. It possibly means that other ones became overfitting.
As mentioned later, the vanishing gradient occurred in the eight and nine layers model experiments. That is, the eight layers are too much to learn well at least with the iris data under the above conditions.
Except for the models that were occurred the vanishing gradient problem and the best one in the sense of test accuracy, all of the models classified correctly 36 or 37 test data. And interestingly, one of the data classified wrongly is the same one. It possibly implies the distribution of the test data is not great, which means there is a difference between the training data and the test data.
Furthermore, most of the models correctly classified most of the data, which means DNN is so effective to the iris data even though the model structure is so simple.
two layers with 35 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 4) 20
dense_1 (Dense) (None, 3) 15
=================================================================
Total params: 35
Trainable params: 35
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.166
- evaluation loss: 0.136
- test loss: 0.157
- test accuracy: 0.947
The number of the correct outputted results is 36 since the amount of test data is 38. It looks great and it actually works great. At least for iris data, DNN is a very powerful tool even though the model has a very simple structure.
two layers with 67 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 8) 40
dense_1 (Dense) (None, 3) 27
=================================================================
Total params: 67
Trainable params: 67
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.086
- evaluation loss: 0.022
- test loss: 0.039
- test accuracy: 0.974
This model correctly classified 37 test data.
two layers with 131 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 80
dense_1 (Dense) (None, 3) 51
=================================================================
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.086
- evaluation loss: 0.033
- test loss: 0.043
- test accuracy: 1.0
This model correctly classified all test data.
two layers with 259 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 32) 160
dense_1 (Dense) (None, 3) 99
=================================================================
Total params: 259
Trainable params: 259
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.09
- evaluation loss: 0.024
- test loss: 0.047
- test accuracy: 0.974
This model correctly classified 37 test data.
three layers with 263 parameters
The structure is as follows.
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 80
dense_1 (Dense) (None, 9) 153
dense_2 (Dense) (None, 3) 30
=================================================================
Total params: 263
Trainable params: 263
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.104
- evaluation loss: 0.018
- test loss: 0.069
- test accuracy: 0.974
This model correctly classified 37 data too.
four layers with 260 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 70
dense_1 (Dense) (None, 9) 135
dense_2 (Dense) (None, 4) 40
dense_3 (Dense) (None, 3) 15
=================================================================
Total params: 260
Trainable params: 260
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.123
- evaluation loss: 0.05
- test loss: 0.115
- test accuracy: 0.947
This model correctly classified 36 data.
five layers with 261 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 12) 60
dense_1 (Dense) (None, 8) 104
dense_2 (Dense) (None, 6) 54
dense_3 (Dense) (None, 4) 28
dense_4 (Dense) (None, 3) 15
=================================================================
Total params: 261
Trainable params: 261
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.089
- evaluation loss: 0.089
- test loss: 0.075
- test accuracy: 0.974
This model correctly classified 37 data.
six layers with 255 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 50
dense_1 (Dense) (None, 8) 88
dense_2 (Dense) (None, 6) 54
dense_3 (Dense) (None, 4) 28
dense_4 (Dense) (None, 4) 20
dense_5 (Dense) (None, 3) 15
=================================================================
Total params: 255
Trainable params: 255
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.138
- evaluation loss: 0.043
- test loss: 0.119
- test accuracy: 0.947
This model correctly classified 37 data.
seven layers with 263 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 50
dense_1 (Dense) (None, 6) 66
dense_2 (Dense) (None, 6) 42
dense_3 (Dense) (None, 6) 42
dense_4 (Dense) (None, 4) 28
dense_5 (Dense) (None, 4) 20
dense_6 (Dense) (None, 3) 15
=================================================================
Total params: 263
Trainable params: 263
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 0.091
- evaluation loss: 0.023
- test loss: 0.047
- test accuracy: 0.974
This model correctly classified 37 data.
eight layers with 261 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 8) 40
dense_1 (Dense) (None, 6) 54
dense_2 (Dense) (None, 6) 42
dense_3 (Dense) (None, 6) 42
dense_4 (Dense) (None, 4) 28
dense_5 (Dense) (None, 4) 20
dense_6 (Dense) (None, 4) 20
dense_7 (Dense) (None, 3) 15
=================================================================
Total params: 261
Trainable params: 261
Non-trainable params: 0
________________________________________________________________
The final indices are as follows.
- training loss: 1.099
- evaluation loss: 1.099
- test loss: 1.099
- test accuracy: 0.316
The vanishing gradient problem occurred whilst learning. In fact, the training loss converged soon:
It implies 8 layers are too much to learn at least with the iris data.
nine layers with 259 parameters
The structure is as follows.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 8) 40
dense_1 (Dense) (None, 6) 54
dense_2 (Dense) (None, 6) 42
dense_3 (Dense) (None, 4) 28
dense_4 (Dense) (None, 4) 20
dense_5 (Dense) (None, 4) 20
dense_6 (Dense) (None, 4) 20
dense_7 (Dense) (None, 4) 20
dense_8 (Dense) (None, 3) 15
=================================================================
Total params: 259
Trainable params: 259
Non-trainable params: 0
_________________________________________________________________
The final indices are as follows.
- training loss: 1.099
- evaluation loss: 1.099
- test loss: 1.099
- test accuracy: 0.316
The vanishing gradient occurred too. I have already observed this problem in the experiment of eight layers model. This experiment is for just checking whether it was certainly due to the number of layers and the vanishing gradient problem occurred again.
Conclusion
I explored the effect of the numbers of layers and the number of parameters with the iris dataset. As the result, I found the two layers models are the best ones in the sense of the test loss value and the test accuracy though it might be due to the small test size. The eight and nine layers models learnt anything. The vanishing gradient occurred. It implies it is too much to learn if the amount of layers is more than eight.
As mentioned in result section, the data that most of the models were classified wrongly is the same and the data is as follows.
sepal length (cm) 6.3
sepal width (cm) 2.5
petal length (cm) 4.9
petal width (cm) 1.5
target 1.0
Name: 72, dtype: float64
I guess it is certainly important to check whether or not the data is an outlier.
All of the experiences were performed under 57 seed. It sounds interesting to change the seed and perform the same experiences. Note that, the seed also affects a way of splitting the iris data into ones for training, evaluation, and test data. To use the same test data, it is needed to change load_splitted_dataset_with_eval()
function in custom_dataset.py
:
https://github.com/ksk0629/comparison_of_dnn/blob/8498a7d15ed6a4447f13f9f277e214f4821f46a1/src/custom_dataset.py#L75-L110
Top comments (0)