Simple Nearest Neighbors Algorithm
For data scientists there are many options to choose from for tackling any kind of prediction problem. There are linear regression models, logistic regression, random forests, neural networks, and so many more. The type of model you use is dependent on what type of problem you have. I want to share a simple algorithm that works great for simple classification problems and can even work with regression problems too! K-Nearest Neighbors.
The nearest neighbors algorithm is quite simple. It looks at the nearest k data points from the training data and uses them to “vote” on which class is best to predict. Let’s look at an example to visualize this. The Iris dataset is a very popular and simple data set to use. First, let’s plot the points of each Iris flower based on sepal width and sepal length from the training data, as shown below. The color will represent the different classes.
For this problem I wrote a simple nearest neighbors algorithm that can be found here: https://github.com/PalmerTurley34/K-Nearest-Neighbors
Now we’ll add a point that we want to try to predict the class of flower it belongs to.
My algorithm is a simple, brute-force algorithm, which means it will calculate the distance from each data point in the training data to find the nearest ones, but we only really care about the ‘k’ nearest points. In this case k would be 5.
So once we have the 5 nearest point the algorithm will predict a class based on that.
We can see that of the 5 nearest data points, 3 of them are red and 2 of them are green. So the prediction would be red, or Iris versicolor, because that is the class that got the most “votes”. And that’s all there is to it.
There are plenty of ways to optimize the algorithm, such as adding weights to the neighbors based on distance but this is the simple version. The k-nearest neighbors model is simple, yet powerful for many classification problems and deserves some research into how to use it in your data science work.