Algorithm

Decision Tree Regressor

Description

A high-performance, recursive partitioning algorithm that constructs a regression tree by exhaustively searching for optimal splits. Unlike standard implementations, vectorForgeML's Decision Tree is built directly in C++ using raw pointers for graph construction, ensuring minimal memory overhead and cache-friendly traversal. It supports deep trees and handles continuous features with high precision. The model predicts the target value by traversing the tree from root to leaf, where each node represents a decision rule based on a single feature threshold.

$$ \min_{j,s} \left[ \sum_{x_i \in R_1(j,s)} (y_i - \bar{y}_{R_1})^2 + \sum_{x_i \in R_2(j,s)} (y_i - \bar{y}_{R_2})^2 \right] $$

Algorithm Workflow

START Initialize the tree building process with the full dataset (X, y) at the root node.
CHECK Verify if current depth < max_depth and if variance in target y> 0.
SEARCH Iterate through ALL features (j=1..p) and ALL unique thresholds to find the optimal split.
CRITERION Calculate the Cost Function (MSE Reduction) for every potential split: Cost = (n_L/n)*MSE_L + (n_R/n)*MSE_R.
SPLIT Partition the data into Left (L) and Right (R) subsets based on the best feature/threshold pair.
RECURSE Recursively call the build function on L and R subsets.
LEAF Create a leaf node if stopping criteria are met. Store the mean value $\bar{y}$ of the node's samples.
OPTIMIZE Use index-based views (no deep copies) to pass data subsets to child nodes.
END Return the root pointer of the constructed tree structure.

Implementation Details

Implemented in `DecisionTree.cpp`. Recursively partitions data.

Node* build(NumericMatrix X, NumericVector y, int depth,int maxd){
    // ... (find best split based on minimizing MSE) ...
    if(depth>=maxd || bf==-1){
        node->value=s/n; // Leaf node: store mean
        return node;
    }
    node->left=build(XL,yL,depth+1,maxd);
    node->right=build(XR,yR,depth+1,maxd);
    return node;
}

Complexity & Optimization

Time Complexity

Training: O(N * P * D). Prediction: O(D).

Space Complexity

O(Nodes).

Optimizations

Recursive implementation with index passing.

Limitations

Regression only.

Use Cases

Non-linear regression.