Today we are going to capture important points about Random Forest Algorithm. In my previous blog , we learnt about Decision Tree algorithm.
|What is Ensemble|
|An ensemble is a collection of models is used to make predictions instead of individual models.|
|For an ensemble to work, each model of it should comply with the following conditions:|
o Each model should be diverse. This means the individual models make predictions independent of each other.
o Each model should be acceptable. It means each model is at least better than a random model.
|A random forest is an ensemble made using a combination of numerous decision trees.|
|Steps to follow|
|Random forests are created using a special ensemble method called bagging. Bagging stands for bootstrap aggregation.|
|Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created by sampling the rows of a given data set uniformly and with replacements.|
|Create a bootstrap sample from the training set|
|Now construct a decision tree using the bootstrap sample. While splitting a node of the tree, only consider a random subset of features. Every time a node has to split, a different random subset of features will be considered.|
|Repeat steps 1 and 2 n times to construct n trees in the forest. Remember, each tree is constructed independently, so it is possible to construct them parallel to each other.|
|While predicting a test case, each tree predicts individually, and the final prediction is given by the majority vote of all the trees.|
|A random forest is more stable than any single decision tree because the results get averaged out|
|A random forest is immune to the curse of dimensionality since only a subset of features is used to split a node.|
|You can parallelise the training of a forest since each tree is constructed independently.|
|You can calculate the OOB (out-of-bag) error using the training set;|
• The OOB provides a good estimate of the performance of a forest on unseen data.
• Hence, there is no need to split the data into training and validation; you can use all of it to train the forest.
• The final OOB error is calculated by calculating the error on each observation and aggregating it.
• It turns out that the OOB error is as good as a cross-validation error.
|Time to build a forest|
To construct a forest of S trees on a data set that has M features and N observations, the time taken will depend on the following factors
|The number of trees: The time is directly proportional to the number of trees. But, this duration can be reduced by creating the trees parallel to each other.|
|The size of the bootstrap sample: Generally, the size of a bootstrap sample is 30-70% of N. The smaller the size, the faster it takes to create a forest.|
|The size of the subset of features while splitting a node: Generally, this is taken as √𝑀 in classification and M/3 in regression.|