Algorithm

K-Nearest Neighbors

Description

A non-parametric, lazy learning algorithm that does not 'learn' a fixed model but instead stores the training instances. Prediction is performed by searching for the k most similar instances in the feature space. Our implementation is optimized for dense vectors using efficient Euclidean distance calculations and C++ standard library partial sorting to find the top-k neighbors without sorting the entire dataset.

$$ d(\mathbf{p}, \mathbf{q}) = \sqrt{ \sum_{i=1}^n (q_i - p_i)^2 } $$

Algorithm Workflow

TRAINING Store the entire dataset $X_{train}$ and $y_{train}$ in memory.

QUERY For each new sample $x_{q}$ to predict:

DISTANCE Compute Squared Euclidean Distance $d(x_q, x_i)$ for all $i=1..N$.

SEARCH Use `std::partial_sort` to find the $k$ smallest distances in $O(N \log K)$ time.

RETRIEVE Get the labels of these $k$ nearest neighbors.

VOTE For Classification, assign the most frequent class (Mode).

AVERAGE For Regression, assign the mean value of the neighbors.

END Return the predicted value.

Implementation Details

Implemented in `KNN.cpp`.

// Partial sort to find top-k
std::partial_sort(indices.begin(), indices.begin()+k, indices.end(),
    [&](int i, int j){ return dists[i] < dists[j]; });
// Vote
for(int i=0; i<k; ++i) votes[y[indices[i]]]++;

Complexity & Optimization

Time Complexity

Pred: O(N * P).

Space Complexity

O(N * P).

Optimizations

Partial Sort.

Limitations

Slow prediction.

Use Cases

Pattern recognition.