Definition

In computer science, most of algorithms run to completion: they provide a single answer after performing some fixed amount of computation. But nowadays, in machine learning generation, the models take long time to train and predict the result, the user may wish to terminate the algorithm prior to completion. That’s where anytime algorithm came in, an algorithm that can return a valid solution even if it is interrupted before it ends. The longer it keeps running, the better solution the user get.

What makes anytime algorithms unique is their ability to return many possible outcomes for any given input. An anytime algorithm uses many well defined quality measures to monitor progress in problem solving and distributed computing resources. While this may sound like dynamic programming, the difference is that it is fine-tuned through random adjustments, rather than sequential.

Figure 1: The expected performance of anytime algorithm

Algorithm prerequisites

Initialize: While some algorithms start with immediate guesses, anytime algorithm take more calculated approach and have a start – up period before making any guesses

Growth direction: How the quality of the program’s “output” or result, varies as run time

Growth rate: Amount of increase with each step. Does it change constantly or unpredictably

End condition: The amount of runtime needed

Case studies

• Clustering with time series

In clustering, we must compute distance or similarity between pairs of time series and we also have a lot of measurements. As many researches proved that dynamic time warping (DTW) is more robust than others, but the complexity O(N2) of DTW make computation more trouble.

With anytime algorithm, the life is easier than ever. Below is pseudocode for anytime clustering algorithm.

Algorithm [Clusters] = AnytimeClustering(Dataset)
3.  Disp("Setup is done, interruption is possible")
4.  O = OrderToUpdateDTW(Dataset)
5.  For i = 1:Length(O)
7.     if UserInterruptIsTrue()
9.        if UserTerminateIsTrue()
10.          return
11.       endif
12.    endif
13. endfor
14. Clusters = Clustering(aDTW, Dataset)

Firstly, we need to approximate the distance matrix of dataset.

Secondly, we do cluster to get the very basic result.

Thirdly, we make some heuristic to find the optimal solution to update the distance matrix, after each updating, we get a better result than before.

Lastly, if the algorithm keeps running longer, it will finish the task and we can get the final optimal output.

• Learning optimal Bayesian networks

Another common example for anytime algorithm is searching optimal output in Bayesian networks. As we know that Bayesian networks is NP – hard. If we have too many variables or events, the algorithm will fail due to limited time and memory. That’s why anytime weighted A* algorithm is often applied to find an optimal result.

Figure 2: Bayesian Networks of 4 variables

Weighted A* algorithm use this cost function to minimize

f(n) = g(n) + ε * h(n)

where n is the last node on the graph, g(n) is the cost of the path from the start node to n and h(n) is a heuristic that estimates the cost of the cheapest path from n to the last node, ε gradually lowers to 1. The anytime algorithm bases on this cost function to stop and continue at any time.

k-Nearest Neighbors algorithms

In this blog post, I am going to introduce one of the most intuitive algorithms in the field of Supervised Learning[1], the k-Nearest Neighbors algorithm (kNN).

The original k-Nearest Neighbors algorithm

The kNN algorithm is very intuitive. Indeed, with the assumption that items close together in the dataset are typically similar, kNN infers the output of new sample by first constructing the distance score with every sample in the training dataset. From there, it creates a ‘neighbor zone’ by selecting samples that are ‘near’ the candidate one, and does the supervised tasks based on samples lie inside that zone. The task could be either classification or regression.

Let’s start with the basic kNN algorithm. Let be our training dataset with samples belong to classes, where is the class of one sample, and denotes the corresponding feature vector that describes the characteristics of that sample. Furthermore, it is necessary to define the suitable distance metric, since it drives the algorithm to select neighbors and make prediction later on. The distance metric , is a mapping over a vector space , where the following conditions are satisfied :

In the following steps to describe the k-Nearest Neighbors algorithm, the Euclidean distance will be used as the distance metric .

For any new instance :

• Find where is the set of samples that are closest to
• The way to define the nearest neighbors is based on distance metric (Note that we are using Euclidean distance).

• The classifier is defined as:

where is the unit function. Note that for the regression problem, the function will just an average of all response values from neighbor samples.

Please also check Vietnam AI Lab

Weighted k-Nearest Neighbors

In the kNN algorithm, we weight all neighbors equally. It may affect the inference steps, especially when the neighbor zone becomes bigger and bigger. To strengthen the effect of ‘close’ neighbors than others, the weighted scheme of k-Nearest Neighbors is applied.

Weighted k-Nearest Neighbors is based on the idea that, within , such observations that are closer to , should get a higher weight than the further neighbors. Now it is necessary to note some properties of any weighting schemes on any distance metric :

• decreases monotonously for

For any new instance :

• We find where is the set of samples that are closest to
• The th neighbor is used for standardization of the smallest distance:
• We transform the standardized distance with any kernel function into weights .
• The classifier is defined as:

The pros and cons of kNN, and further topics

The kNN and weighted kNN do not rely on any specific assumption on the distribution of the data, so it is quite easy to apply it on many problems as the baseline model. Furthermore, kNN (and its family) is very intuitive for understanding and implementing, which again make it a worthy try-it-first approach for many supervised problems.

Despite those facts, kNN still has challenges in some aspects: It is computationally expensive – especially when the dataset size becomes huge. Another challenge is choosing the ‘correct’ distance metric that best follows the assumption for using this algorithm: items close together in the data set should be typically similar. Lastly, the curse of dimensionality heavily affects the distance metric. Beyer et al.[2] proves that, under some preconditions, in high dimension space, all points converge to the same distance from the query point. In this case, the concept of ‘nearest neighbors’ is no longer meaningful.

Motivation

What makes computers useful for us is primarily the ability to solve problems. The procedure in which computers solve a problem is an algorithm.  In the recent context of increasing number of algorithms available for solving data-related problems, there is increasing demand for higher level of understanding of algorithm’s performance in order for data scientists to choose the right algorithms for their problems.

Having a general perception for efficiency of an algorithm would help shaping the thought process for creating or choosing better algorithms. With this intention in mind, I would like to create a series of posts to discuss about what makes a good algorithm in practice, or in short, efficient algorithm. And this article is the first step of the journey.

Define Efficiency

An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, ‘acceptable’ means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input.

There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage. In the next 2 sections, we will be looking at the two different perspectives for measuring efficiency of an algorithm from theoretician and practitioners.

Theoreticians perspective

Theoreticians are interested in measuring efficiency of algorithm without actually have to run it in several machines and input size. The key idea is that they are not going to actually consider the runtimes of the algorithm on any particular input. Rather, they look at what are known as asymptotic runtimes. Or in other words, they look at how the runtime scale with input size (n) as n gets larger. Does the output scale proportional to n, or proportional to n squared, or maybe exponential in n? In fact, these rate of growth are so different that as long as n is sufficiently large, constant multiples that come from other measures like temporary disk usage, long-term disk usage would be relatively small and neglected.

Practitioner perspective

While certainly useful, the asymptotic runtime of an algorithm doesn’t tell the whole story. There are algorithms that have good asymptotic runtime, but constants that are so huge that they effectively can’t be used. Ever. Lipton calls them Galactic Algorithms. A galactic algorithm is an algorithm that is wonderful in its asymptotic behavior, but is never used to actual compute anything.

In practice, there are other factors which can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, the way in which an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.

Implementation issues can also have an effect on efficiency, such as the choice of programming language, or the way in which the algorithm is actually coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases a language implemented by an interpreter may be much slower than a language implemented by a compiler.