A Bit of Theory First – Random Forest Classifier in 5 Steps
A random forest classifier is an ML algorithm. It learns how to predict something based on the already known information. Plus, it is an ensemble algorithm, because it is powered by the repetition of the core algorithm, the decision tree. By retrieving classes of the numerous decision trees, the algorithm decides on the best prediction for the whole forest. Check this classification to familiarize yourself with different ML classifiers.
The basic step-by-step random forest implementation looks like this:
- Pick the right data. At this stage, you select features that you believe will be relevant for making predictions about the target variable. You will also need a history of values for the target variable itself. You can be less thorough in the beginning, because the algorithm will show you the importance of the features in the end. Once you know the importance, you will be able to come up with ways to make a better model.
- Perform a train-test split of the dataset and feature engineering. Luckily, all these functions are already in-built in Python, and you can do this all semi-automatically. No endless lines of code needed.
- Create your model. One of the best choices is Python’s sklearn. ensemble library, from which you can import Random Forest Classifier.
- Understand your model and its performance. Check out the accuracy, and other performance metrics. Play with the number of decision trees to track the changes in your model’s performance.
- Check out the important features. This is the true bonus of the whole algorithm. Together with the predictions, the final model shows you the importance of each predictor. When you know which features are more important in predicting the target variable you can select the subset of the most meaningful features and build an even better model with them! Though you need to be careful with this approach, and later we will explain why.
For a more visual representation of this plain theory, watch this video:
Your Math Exam Dataset For the Random Forest Model
The theory and terms are very important, but we can skip to the good part – predicting your future exam score. If you have been studying for a while now at school or college, you must have already passed or failed some Math exams. No offense, we already have the dataset of your assumed past results below.
In the table, you can see the properties of each past exam, and the past values for exam outcome (or future target variable). All the other columns, such as season, teacher, or number of questions act as our predictors.
So, what’s our next action to finally find out, whether you are going to pass your next, let’s say ‘Quadratic Graphs’, exam or not?
Imagine, if you were analyzing the table by asking yourself questions about the past exams. This would look exactly like this:
This is exactly what a decision tree, a basic random forest unit, looks like. You can tell that it is very intuitive and easily interpretable. Before we proceed with the task, let’s finalize the differences between these two.
One Last Time – Decision Tree Vs. Random Forest
As we said before, a random forest is made up of a smaller or bigger number of separate decision trees. It means that the decision tree is the primary unit. The difference can be summed up to two main aspects:
- Execution speed. The decision tree is for sure a faster classification method, but if you are looking for precision and trained predictive power, the random forest classifier wins by analyzing numerous decision trees. It basically collects the ‘votes’ from each generated tree and picks the final vote based on the majority principle. The more features you put into your dataset, the longer it will take to collect the votes from all the branches of the individual tree. The speed is compensated by another useful characteristic of the algorithm though. The deeper it goes, the easier it is to overfit your single decision tree, but the random forest allows you to avoid this.
- Readability. When you look at the decision tree, the visualization allows you to make cursory or pretty concrete conclusions about the picture. You can follow the branches and their votes, thus explicitly and logically reaching the final decision. Even better, when you have collected the dataset yourself, you can still remember how and why you voted ‘yes’ or ‘no’. But the random forest visualization is too sophisticated to perceive with the researcher’s eyes alone. The outstandingly small interpretation potential is probably the most prominent inconvenience in comparison to the decision tree.
Generating predictions with the Random Forest Classifier
As you can see, the decision tree approach is very intuitive. Our decision tree represents the training dataset because we use the data that we already possess to teach our model about which exams are usually failed or passed by you. It is very limited though, because of the questions we asked.
Imagine you want to ask more specific questions, such as ‘How much time did you take to prepare for the past exam?’, ‘What was your attendance on the topic during the semester?’, and more. If we give all these answers, we will need more features and more trees, as well as we will very likely come up with different conclusions about your possible exam success.
Simply, the more trees we have, the more reliable our predictions become, and the more obvious it gets that we won’t be able to solve the problem with one or two decision trees.
Below you can see the visualization of the random forest, where multiple trees are generated.
Another characteristic that makes our algorithm so useful is, in fact, its randomness. The algorithm is random in two main ways:
- Each tree is assigned a random subset of the training data.
- Each tree is assigned a random subset of predictors. This means that one decision tree, for example, will consider who was the teacher and what time/season the exam took place, and another decision tree will look at the number of questions, assigned teacher, and whether the problems were included or not.
This allows our decision trees to be very independent of one another and prevent each other from errors.
So, how can we finally make a prediction about the future exam?
To test the predictive power of our model, we will need to build another dataset with the exam data we already know. This will be our testing dataset. Using our model on the testing dataset will show us if our model can already guess our failed/passed exams with high accuracy.
Once you have achieved the accuracy score, which satisfies you, you can be sure about using your model to predict the yet unknown results.