k-Nearest Neighbors algorithms

In this blog post, I am going to introduce one of the most intuitive algorithms in the field of Supervised Learning[1], the k-Nearest Neighbors algorithm (kNN).

The original k-Nearest Neighbors algorithm

The kNN algorithm is very intuitive. Indeed, with the assumption that items close together in the dataset are typically similar, kNN infers the output of new sample by first constructing the distance score with every sample in the training dataset. From there, it creates a ‘neighbor zone’ by selecting samples that are ‘near’ the candidate one, and does the supervised tasks based on samples lie inside that zone. The task could be either classification or regression.

Let’s start with the basic kNN algorithm. Let $L = \{(y_i, x_i), i=1, \ldots, N\}$ be our training dataset with $N$ samples belong to $c$ classes, where $y_i \in \{1, \ldots, c\}$ is the class of one sample, and $x_i \in \mathbb{R}^{1\times d}$ denotes the corresponding feature vector that describes the characteristics of that sample. Furthermore, it is necessary to define the suitable distance metric, since it drives the algorithm to select neighbors and make prediction later on. The distance metric $d$, is a mapping $d: X\times X\xrightarrow{}\mathbb{R}^{+}\cup\{0\}$ over a vector space $X \in \mathbb{R}^{d}$, where the following conditions are satisfied $\forall x_i, x_j, x_k \in X$:

  • $d(x_i, x_j) \geq 0$
  • $d(x_i, x_j) = d(x_j, x_i)$
  • $d(x_i, x_j) \leq d(x_i, x_k) + d(x_j, x_k)$
  • $d(x_i, x_j) = 0 \iff x_i = x_j$

In the following steps to describe the k-Nearest Neighbors algorithm, the Euclidean distance will be used as the distance metric $d$.

For any new instance $x^{\prime}$:

  • Find $\{(y_j, x_j)\} \in S_k$ where $S_k$ is the set of $k$ samples that are closest to $x^\prime$
  • The way to define the nearest neighbors is based on distance metric $d$ (Note that we are using Euclidean distance).

$$ \begin{aligned} d_{Euclidean}(x_i, x_j) = \Bigg(\sum_{s=1}^{p}|x_{is} - x_{js}|^{2}\Bigg)^{\frac{1}{2}} \end{aligned} $$

  • The classifier $h$ is defined as:
    $$\ \begin{aligned} h(x^\prime) = \arg\max_{r} \Bigg(\sum_{i=1}^k I(y_i = r)\Bigg) \end{aligned} $$
    where $I(.)$ is the unit function. Note that for the regression problem, the function $h(x^\prime)$ will just an average of all response values $y$ from neighbor samples.

Weighted k-Nearest Neighbors

In the kNN algorithm, we weight all neighbors equally. It may affect the inference steps, especially when the neighbor zone becomes bigger and bigger. To strengthen the effect of ‘close’ neighbors than others, the weighted scheme of k-Nearest Neighbors is applied.

Weighted k-Nearest Neighbors is based on the idea that, within $S_k$, such observations that are closer to $x^\prime$, should get a higher weight than the further neighbors. Now it is necessary to note some properties of any weighting schemes $K$ on any distance metric $d$:

  • $K(a) \geq 0, \forall a \in R^+\cup\{0\}$
  • $\arg\max_{a} K(a) = 0$
  • $K(a)$ decreases monotonously for $d\xrightarrow{} \pm\infty$

For any new instance $x^\prime$:

  • We find $\{(y_j, x_j)\} \in S_k$ where $S_k$ is the set of $k$ samples that are closest to $x^\prime$
  • The $(k+1)$th neighbor is used for standardization of the $k$ smallest distance: $$ \begin{aligned} d_{standardized}(x_i, x^\prime) = \frac{d(x_i, x^\prime)}{d(x_{k+1}, x^\prime)} \end{aligned} $$
  • We transform the standardized distance $d_{\text{standardized}}$ with any kernel function $K$ into weights $w_i = K(d_{standardized}(x_i, x^\prime))$.
  • The classifier $\hat{h}$ is defined as:
    $$ \begin{aligned} \hat{h}(x^\prime) = \arg\max_{r} \Bigg(\sum_{i=1}^kw_i I(y_i = r)\Bigg) \end{aligned} $$

The pros and cons of kNN, and further topics

The kNN and weighted kNN do not rely on any specific assumption on the distribution of the data, so it is quite easy to apply it on many problems as the baseline model. Furthermore, kNN (and its family) is very intuitive for understanding and implementing, which again make it a worthy try-it-first approach for many supervised problems.

Despite those facts, kNN still has challenges in some aspects: It is computationally expensive – especially when the dataset size becomes huge. Another challenge is choosing the ‘correct’ distance metric that best follows the assumption for using this algorithm: items close together in the data set should be typically similar. Lastly, the curse of dimensionality heavily affects the distance metric. Beyer et al.[2] proves that, under some preconditions, in high dimension space, all points converge to the same distance from the query point. In this case, the concept of ‘nearest neighbors’ is no longer meaningful.