Introduction to Feature Engineering

Introduction

In a modeling process, there are 3 core concepts that will always exist:

  • Data.
  • Features.
  • Type of model and its corresponding parameters.

From data to the model, features are a measurable representation of the data, which would be the format for the data to be processed by the model thus method to create features from data is a must-to-do modeling process. Moreover, to improve model performance, apart from enhancing data quality and executing the model selection, having better features is also a possibility. On the other hand, the domain aspect related to the data information will affect the feature creation process by either being the requirement or providing the direction to execute it. Therefore, how to create features, do it well, and even relating domain knowledge to assist this process is a helpful skill to reach results easier.

This thus would lead us to the main purpose of this blog post: introduction to feature engineering. It is hoped that this blog could help in getting a sense of the operation and provide the start ground for anyone who would like to research feature engineering.

What is feature engineering?

Feature engineering can be said as the process of transforming information from data into features that effectively enhance the model’s performance with the possibility of using domain knowledge as an assist.

As created features are not always useful and a large number of features could easily cause overfitting or curse of dimensionality situation, feature engineering would go along with feature selection to conclude only contributing features to be used for the model. Apart from that, regularization or kernel method can also help in limiting the growth of feature numbers effectively.

Feature engineering process.

Since the modeling process is an iterative one, the feature engineering process is the same as well. The process can be summarized like this: after data gathering and preprocessing, data analysis and assumption can be used as the base to come up with the initial features set, and then after testing through the model result, a decision for further feature engineering or not can be made.

The steps of a feature engineering process are:

  • Looking at the data to have a direction for what features to start.
  • Creating features.
  • See the effectiveness of the feature provided for the result.
  • If the result is still unsatisfied, evaluate and set a new direction for feature engineering.

Mindset and consideration for feature engineering

Mindset

Not only just for covering method to create features, but feature engineering principles has also evolved into constructing best practices and intuition in the feature creation process to avoid mistakes and boost the performance on a whole.

As data and use-case are situational, best feature engineering practices mostly can be achieved through tries out and analysis. There is no systematic way to do it, but underlying reasons do exist for practitioners to mainly do things in a certain way. The mindset for feature engineering is to actively operate experiments or study from past precedents to look for deeper principles and in the end shaping the intuition to execute the operation.

Consideration

From “Feature Engineering and Selection: A Practical Approach for Predictive Models” (Kuhn & Johnson, 2019), some points worth noting:

  • Overfitting is also a concern when a failed feature engineering introduces features that are relevant to the current dataset but don’t have any relationship with the outcomes when new additional data being included.
  • To find the link between predictors and defined outcomes for prediction, supervised or unsupervised data analysis can be done.
  • Considering the “No Free Lunch” Theorem, trying several of models’ types to see which one works best is the best course of action.
  • Another concept that relating to this topic: Model vs modeling process, model bias and variance, experience-driven modeling & empirically driven modeling, and big data.

They also showcase another example to present these points:

  • The “Trial and error” process is needed to go through to find the best one.
  • The interaction between models and features is complex and unpredictable. However, an effect from feature sets may be more significant than the effect of different models.
  • Based on the same set of right features, the best performance can be ensured regardless of the model’s types.

Feature engineering techniques’ introduction by following an example

Different data types will have their corresponding feature engineering techniques. This part of the blog post is to introduce the techniques by linking them with the data type so that when coming across the data type related techniques can be identified immediately.

Basic time – related machine learning models

Introduction

With data that have time-related information, time features can be created to possibly add more information to the models.

Since how to consider time series for machine learning is a broad topic, this article only aims to introduced basic ways to create time features for those models.

Type of data that is expected for this application

It is expected that transaction data type or any kind that similar to it will be the common one for this application. Other kinds of data that have timestamps information for each data point should also apply to some extent of this approach well.

Considering before attempt: a need of analyzing the problem and scope

For data with the time element, it can be presented as a time series, which is how a set of data points described by an entity are follow ordered indexes of time. One aspect considered for time series is that observations are expected to depend on the previous one in time sequence, with the latter is correlated to the one before. In those cases, using time series models for forecasting is a straightforward approach to use this data. Another way to approach it is to use feature engineering to transform data to have features that can be used for supervised machine learning models, which is the focus of this article.

Using the time series model or adapting the machine learning model is dependent. In some cases, domain knowledge or business requirement will influence this decision.  It is better to analyze the problem first to see the need of using either one or both types of models.

Regardless of the domain knowledge  or business requirement aspects, the approach decision should have always considering the efficiency the approach will bring in terms of accuracy and computation cost.

Basic methods

A first preprocessing step to have the first set of time features: extracting time information from timestamp

The most straightforward thing to do is to extract basic time units, which for instance are hours, date, month, years into separates features. Another kind of information that can also be extracted is the characteristic of the time, which could be whether the time is at a part of days (morning, afternoon), is weekend or not or is it a holiday, etc.

In some business requirements or domains’ aspects, those initial features at this level are already needed to see if the value of observation is followed those factors or not. For example, the data is the record of timestamps of customers visiting a shop and their purchases. There is a need to know at which hours, date, month… a customer would come and purchase so that follow-up actions can be made to increase sales.

Aggregate techniques

Regarding feature engineering for time data, the well-known technique that is commonly used is aggregate features by taking statistics (variance, max, min, etc.) of the set of values grouped by the desired time unit: hours, days, months…

Apart from that, a time window could be defined and compute aggregate by rolling or expanding the time window.

  • Rolling: have a fixed time window size and to predict a value for the data point at a time, features will be computed from aggregating backward number of time steps according to the time window
  • Expanding: from the data point, the window will be the whole record of past time steps.

There are also two aspects of aggregating:

  • Aggregating to create new features for the current data points. For the first case, the model is considered to include the time series characteristic, meaning a moment will likely be related to other moments from the recent past.
  • Aggregating to create a new set of data points with a corresponding new set of features from the current ones. For the second one, the new number of data points considering for the model is changed and each new data point is the summary of information from a subset of initial data points. As a result, objects for the models may be shifted like being mentioned in considering the before part. If the data only about the information record of one entity, or in other words only contains one time series of an entity, through this technique, the new computed data points can be the summary of other features’ value in the chosen time unit. On the other hand, if there are more entities observed in the data set, each new data point is then the summary information of each observed entity.

How to decide on the focus objects for the problem and the approach is situational but for a fresh problem and fresh data with no specific requirement or prior domain knowledge, it is better to consider all of them for the model and execute feature selection to see if the created time features are of any value.

Dealing with hours of a day – Circular data

For some needs, a specific time of the day is required to be focus. A use case for detecting fraud transactions is a good example of this. To find something like the most frequent time that a kind of behavior is performed, for instance, using the arithmetic mean may be misleading and is not a good representation. An important point that needs to be considered is that hours of the day is a circular data and it should be represented on a circular axis with its’ value between 0 to 2π. To have a better representation of the mean, using von Mises distribution to have periodic mean is a suitable approach for this situation (Mishtert, 2019).

Validation for the model

Before the model building, a validation set is needed to be selected from the data first. In the usual cases, to avoid overfitting data will be randomly shuffled and then will be divided into a training set and validation set. However, for this kind of situation, it shouldn’t be done so to avoid the mistake of having past data in the validation and the future data in the training, in other words using future data to predict the past.