Algorithm

Random Forest

Description

A robust ensemble meta-estimator that fits a number of decision tree classifiers/regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Our implementation utilizes `std::vector` and raw pointers for managing a forest of trees, with specific optimizations for random feature selection at each node split.

$$ \hat{y}(x) = \frac{1}{T} \sum_{t=1}^T f_t(x) $$

Algorithm Workflow

START Initialize the forest with `ntrees` empty trees.

BOOTSTRAP For each tree t=1..T, generate a random sample of size N from X with replacement.

BUILD Grow a decision tree on the bootstrapped data.

FEATURE SUBSET At each node, select a random subset of features of size `mtry`.

BEST SPLIT Find the best split among the `mtry` features only.

GROW Continue splitting until max_depth is reached (no pruning).

STORE Save the root pointer of the tree in the forest container.

PREDICT For a new sample, traverse all T trees.

AGGREGATE Compute the average (Regression) or Majority Vote (Classification) of all tree predictions.

END Return the final ensemble prediction.

Implementation Details

Implemented in `RandomForest.cpp`.

for(int i=0; i<ntrees; ++i) {
    // Bootstrap sampling
    // Random feature selection
    trees[i] = build_tree(bootstrap_X, bootstrap_y, mtry);
}
// Prediction
for(Node* t : trees)
    preds += predict_tree(t, X);
preds /= ntrees;

Complexity & Optimization

Time Complexity

Training: O(T * N * log(N) * mtry).

Space Complexity

O(T * Nodes).

Optimizations

Ensemble variance reduction.

Limitations

Single-threaded build.

Use Cases

Robust classification/regression.