Anytime Algorithm For Machine Learning

Definition

In computer science, most algorithms run to completion: they provide a single answer after performing some fixed amount of computation. But nowadays, in machine learning generation, the models take a long time to train and predict the result, the user may wish to terminate the algorithm before completion. That’s where anytime algorithm came in, an algorithm that can return a valid solution even if it is interrupted before it ends. The longer it keeps running, the better solution the user gets.

What makes anytime algorithms unique is their ability to return many possible outcomes for any given input. An anytime algorithm uses many well-defined quality measures to monitor progress in problem-solving and distributed computing resources. While this may sound like dynamic programming, the difference is that it is fine-tuned through random adjustments, rather than sequential.

The expected performance of anytime algorithm
Figure 1: The expected performance of anytime algorithm

Algorithm prerequisites

Initialize: While some algorithms start with immediate guesses, anytime algorithm take more calculated approach and have a start–up period before making any guesses

Growth direction: How the quality of the program’s “output” or result, varies as run time

Growth rate: Amount of increase with each step. Does it change constantly or unpredictably?

End condition: The amount of runtime needed

Case studies

  • Clustering with time series

In clustering, we must compute distance or similarity between pairs of time series and we also have a lot of measurements. As many types of research proved that dynamic time warping (DTW) is more robust than others, but the complexity O(N2) of DTW makes computation more trouble.

With anytime algorithm, the life is easier than ever. Below is pseudocode for anytime clustering algorithm.

Algorithm [Clusters] = AnytimeClustering(Dataset)
1.  aDTW = BuildApproDistMatrix(Dataset)
2.  Clusters = Clustering(aDTW, Dataset)
3.  Disp("Setup is done, interruption is possible")
4.  O = OrderToUpdateDTW(Dataset)
5.  For i = 1:Length(O)
6.     aDTW(O(i)) = DTW(Dataset, O(i))
7.     if UserInterruptIsTrue()
8.        Clusters = Clustering(aDTW, Dataset)
9.        if UserTerminateIsTrue()
10.          return
11.       endif
12.    endif
13. endfor
14. Clusters = Clustering(aDTW, Dataset)

Firstly, we need to approximate the distance matrix of the dataset.

Secondly, we do cluster to get the very basic result.

Thirdly, we make some heuristic to find the optimal solution to update the distance matrix, after each updating, we get a better result than before.

Lastly, if the algorithm keeps running longer, it will finish the task and we can get the final optimal output.

  • Learning optimal Bayesian networks

Another common example for anytime algorithm is searching optimal output in Bayesian networks. As we know that Bayesian network is NP–hard. If we have too many variables or events, the algorithm will fail due to limited time and memory. That’s why the anytime weighted A* algorithm is often applied to find an optimal result.

Bayesian Networks of 4 variables
Figure 2: Bayesian Networks of 4 variables

Weighted A* algorithm use this cost function to minimize

f(n) = g(n) + ε * h(n)

where n is the last node on the graph, g(n) is the cost of the path from the start node to n, and h(n) is a heuristic that estimates the cost of the cheapest path from n to the last node, ε gradually lowers to 1. The anytime algorithm bases on this cost function to stop and continue at any time.

k-Nearest Neighbors algorithms

In this blog post, I am going to introduce one of the most intuitive algorithms in the field of Supervised Learning[1], the k-Nearest Neighbors algorithm (kNN).

The original k-Nearest Neighbors algorithm

The kNN algorithm is very intuitive. Indeed, with the assumption that items close together in the dataset are typically similar, kNN infers the output of a new sample by first constructing the distance score with every sample in the training dataset. From there, it creates a ‘neighbor zone’ by selecting samples that are ‘near’ the candidate one, and does the supervised tasks based on samples lie inside that zone. The task could be either classification or regression.

Let’s start with the basic kNN algorithm. Let $L = \{(y_i, x_i), i=1, \ldots, N\}$ be our training dataset with $N$ samples belong to $c$ classes, where $y_i \in \{1, \ldots, c\}$ is the class of one sample, and $x_i \in \mathbb{R}^{1\times d}$ denotes the corresponding feature vector that describes the characteristics of that sample. Furthermore, it is necessary to define the suitable distance metric, since it drives the algorithm to select neighbors and make predictions later on. The distance metric $d$, is a mapping $d: X\times X\xrightarrow{}\mathbb{R}^{+}\cup\{0\}$ over a vector space $X \in \mathbb{R}^{d}$, where the following conditions are satisfied $\forall x_i, x_j, x_k \in X$:

  • $d(x_i, x_j) \geq 0$
  • $d(x_i, x_j) = d(x_j, x_i)$
  • $d(x_i, x_j) \leq d(x_i, x_k) + d(x_j, x_k)$
  • $d(x_i, x_j) = 0 \iff x_i = x_j$

In the following steps to describe the k-Nearest Neighbors algorithm, the Euclidean distance will be used as the distance metric $d$.

For any new instance $x^{\prime}$:

  • Find $\{(y_j, x_j)\} \in S_k$ where $S_k$ is the set of $k$ samples that are closest to $x^\prime$
  • The way to define the nearest neighbors is based on distance metric $d$ (Note that we are using Euclidean distance).

$$ \begin{aligned} d_{Euclidean}(x_i, x_j) = \Bigg(\sum_{s=1}^{p}|x_{is} - x_{js}|^{2}\Bigg)^{\frac{1}{2}} \end{aligned} $$

  • The classifier $h$ is defined as:
    $$\ \begin{aligned} h(x^\prime) = \arg\max_{r} \Bigg(\sum_{i=1}^k I(y_i = r)\Bigg) \end{aligned} $$
    where $I(.)$ is the unit function. Note that for the regression problem, the function $h(x^\prime)$ will just an average of all response values $y$ from neighbor samples.

Vietnam AI LAB
Please also check Vietnam AI Lab

Weighted k-Nearest Neighbors

In the kNN algorithm, we weigh all neighbors equally. It may affect the inference steps, especially when the neighbor zone becomes bigger and bigger. To strengthen the effect of ‘close’ neighbors than others, the weighted scheme of k-Nearest Neighbors is applied.

Weighted k-Nearest Neighbors is based on the idea that, within $S_k$, such observations that are closer to $x^\prime$, should get a higher weight than the further neighbors. Now it is necessary to note some properties of any weighting schemes $K$ on any distance metric $d$:

  • $K(a) \geq 0, \forall a \in R^+\cup\{0\}$
  • $\arg\max_{a} K(a) = 0$
  • $K(a)$ decreases monotonously for $d\xrightarrow{} \pm\infty$

For any new instance $x^\prime$:

  • We find $\{(y_j, x_j)\} \in S_k$ where $S_k$ is the set of $k$ samples that are closest to $x^\prime$
  • The $(k+1)$th neighbor is used for standardization of the $k$ smallest distance: $$ \begin{aligned} d_{standardized}(x_i, x^\prime) = \frac{d(x_i, x^\prime)}{d(x_{k+1}, x^\prime)} \end{aligned} $$
  • We transform the standardized distance $d_{\text{standardized}}$ with any kernel function $K$ into weights $w_i = K(d_{standardized}(x_i, x^\prime))$.
  • The classifier $\hat{h}$ is defined as:
    $$ \begin{aligned} \hat{h}(x^\prime) = \arg\max_{r} \Bigg(\sum_{i=1}^kw_i I(y_i = r)\Bigg) \end{aligned} $$

The pros and cons of kNN, and further topics

The kNN and weighted kNN do not rely on any specific assumption on the distribution of the data, so it is quite easy to apply it to many problems as the baseline model. Furthermore, kNN (and its family) is very intuitive for understanding and implementing, which again makes it a worthy try-it-first approach for many supervised problems.

Despite those facts, kNN still has challenges in some aspects: It is computationally expensive – especially when the dataset size becomes huge. Another challenge is choosing the ‘correct’ distance metric that best follows the assumption for using this algorithm: items close together in the data set should be typically similar. Lastly, the curse of dimensionality heavily affects the distance metric. Beyer et al.[2] proves that, under some preconditions, in high dimension space, all points converge to the same distance from the query point. In this case, the concept of ‘nearest neighbors’ is no longer meaningful.

Efficient Algorithms: An overview

Motivation

What makes computers useful for us is primarily the ability to solve problems. The procedure in which computers solve a problem is an algorithm.  In the recent context of an increasing number of algorithms available for solving data-related problems, there is increasing demand for a higher level of understanding of algorithm’s performance for data scientists to choose the right algorithms for their problems.

Having a general perception of the efficiency of an algorithm would help to shape the thought process for creating or choosing better algorithms. With this intention in mind, I would like to create a series of posts to discuss what makes a good algorithm in practice, or for short, efficient algorithm. And this article is the first step of the journey.

Define Efficiency

An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, ‘acceptable’ means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input.

There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage. In the next 2 sections, we will be looking at the two different perspectives for measuring the efficiency of an algorithm from theoreticians and practitioners.

Theoreticians perspective

Theoreticians are interested in measuring the efficiency of an algorithm without actually have to run it in several machines and input size. The key idea is that they are not going to consider the runtimes of the algorithm on any particular input. Rather, they look at what is known as asymptotic runtimes. Or in other words, they look at how the runtime scale with input size (n) as n gets larger. Does the output scale proportional to n, or proportional to n squared, or maybe exponential in n? These rates of growth are so different that as long as n is sufficiently large, constant multiples that come from other measures like temporary disk usage, long-term disk usage would be relatively small and neglected.
time complexity using asymptotic notation
Fig1:An illustration for time complexity using asymptotic notation for different functions

Practitioner perspective

While certainly useful, the asymptotic runtime of an algorithm doesn’t tell the whole story. Some algorithms have good asymptotic runtime, but constants that are so huge that they effectively can’t be used. Ever. Lipton calls them Galactic Algorithms. A galactic algorithm is an algorithm that is wonderful in its asymptotic behavior but is never used to compute anything.
 A fun exchange between a theoretician and practitioner
Fig2: A fun exchange between a theoretician and practitioner

In practice, other factors that can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, how an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.

Implementation issues can also effect efficiencies, such as the choice of programming language, or how the algorithm is coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases, a language implemented by an interpreter may be much slower than a language implemented by a compiler.