Introduction:
Recently, we organized a mini series on Machine Learning, consisting of three 1-hour talks by two of our engineers. The topics covered were k-means clustering, evolutionary algorithms, and neural networks. Below we give a short description of these seminars, and we also provide links to the source code and to the videos of the talks.
First Talk: K-Means by André Miranda
For our first talk, we decided to focus on unsupervised Machine Learning Algorithms, more specifically Clustering. We used the K-Means Clustering method to cluster Football Players present in the Kaggle Fifa 2019 Dataset. This Dataset includes attributes for 86 players, for example Nationality, Sprint Speed, and so on.
To simplify our task, we only kept 25 numeric features and filtered the 193 best players in order to see some logic in the clusters’ results. We used a Python Notebook to explain the process during the talk, in which we used the Numpy and Pandas libraries for data handling, and SKlearn for modeling. We programmed our model to cluster the players into 5 different clusters, using the k-means algorithm to find the centroid computing the feature mean values between the 5 different groups. In the end, we displayed the players belonging to each group and checked that the players in similar field positions where indeed being grouped together by the algorithm! We also tried to assign a previously computed cluster to players not seen yet by the model, and the results were also good, with the players being assigned to the cluster where the respective players had similar characteristics.
Second Talk: Evolutionary Algorithms by Paulo Branco
In the second talk, we presented an evolutionary algorithm (EA) to show how we can use genetic algorithms (GA) together with machine learning (ML) algorithms. GAs are a concept inspired on Darwin’s general theory of natural selection. Specifically, only the fittest individuals in an environment survive, and these pass their genetic information on to subsequent generations. In this way, GA uses selection, reproduction and mutation to find the fittest individuals in order to solve a given problem.
GAs are often used in order to find optimal or near optimal solutions in highly complex problems, and present the following advantages and disadvantages:
Advantages
a) They are faster and more efficient compared with traditional algorithms;
b) They can be used to optimize discrete and continuous functions with multiple objectives;
c) They can provide a list of solutions.
Disadvantages
a) They are not applicable to simple problems;
b) They may have performance issues when complex fitness functions are used;
c) The optimal solution may not be found.
We prepared a practical exercise to show how we can use the GA concept in a ML problem. In this case, we used the football players example of our first talk where the k-means algorithm was used to assign football players to clusters based on their individual field characteristics. In the first example, the features to train the model were selected using common sense. However, in this exercise we used GA to choose a near optimal set of features based on their fitness using the silhouette coefficient as the fitness function.
We compared the fitness of the model trained using the set of features used in the first example (0.22948566998763453) with the one obtained using the GA (0.40374152548280884) and we had an increase of 75.93%. This means that the set of features selected by our GA could better train the k-means model.
Third Talk: Neural Networks by André Miranda
On our third talk we introduced the topic of Deep Learning, applied to Image Classification. We approached the well-known problem of handwritten digits classification, using the MNIST Dataset. Specifically, our main goal was to classify images of handwritten digits into numbers from 0 to 9.
With 70k samples in our dataset, we used 60k to train a simple, but yet powerful, Convolutional Neural Network. In the Python Notebook used during the talk, we showed all the steps using the Keras framework. Keras provides us tools for loading built-in datasets, for data transformation, data modeling, metrics and so much more!
Our Covnet Architecture was as follows:
1. 2 pairs of Convolution + MaxPooling layers.
2. Convolution layer.
3. Flatten layer, to reshape the data in 1 dimension.
4. A Dense layer.
5. Final Dense layer with ten final nodes, the same as the number of classes in our problem.
(Kind of a black art, right?)
Each one of our final nodes gives us a measure of probability for the sample being one of the digits. So, in the end we just consider the node where this measure is the highest. After our validation step, we concluded that 7 training epochs were be the best way to train the model, and in the end we reached an accuracy of 99% and we also tried to classify with success our own handwritten digits. A great result for such few lines of code!
Conclusion
This mini-series on machine learning was a great opportunity to expose our software engineers to practical uses of common tools and methods in AI. And of course, it was a lot of fun!
Written by André Miranda , Dimitris Mostrous and Paulo Branco / Software Developers at Cleverti