Algorithm
Random Forest
Description
A robust ensemble meta-estimator that fits a number of decision tree classifiers/regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Our implementation utilizes `std::vector` and raw pointers for managing a forest of trees, with specific optimizations for random feature selection at each node split.
$$ \hat{y}(x) = \frac{1}{T} \sum_{t=1}^T f_t(x) $$
Algorithm Workflow
START
Initialize the forest with `ntrees` empty trees.
BOOTSTRAP
For each tree t=1..T, generate a random sample of size N from X with
replacement.
BUILD
Grow a decision tree on the bootstrapped data.
FEATURE SUBSET
At each node, select a random subset of features of size
`mtry`.
BEST SPLIT
Find the best split among the `mtry` features only.
GROW
Continue splitting until max_depth is reached (no pruning).
STORE
Save the root pointer of the tree in the forest container.
PREDICT
For a new sample, traverse all T trees.
AGGREGATE
Compute the average (Regression) or Majority Vote (Classification) of
all tree predictions.
END
Return the final ensemble prediction.
Implementation Details
Implemented in `RandomForest.cpp`.
for(int i=0; i<ntrees; ++i) {
// Bootstrap sampling
// Random feature selection
trees[i] = build_tree(bootstrap_X, bootstrap_y, mtry);
}
// Prediction
for(Node* t : trees)
preds += predict_tree(t, X);
preds /= ntrees;
Complexity & Optimization
Time Complexity
Training: O(T * N * log(N) * mtry).
Space Complexity
O(T * Nodes).
Optimizations
Ensemble variance reduction.
Limitations
Single-threaded build.
Use Cases
Robust classification/regression.