Contents
Introduction
In a modeling process, there are 3 core concepts that will always exist:
- Data.
- Features.
- Type of model and its corresponding parameters.
From data to the model, features are a measurable representation of the data, which would be the format for the data to be processed by the model thus method to create features from data is a must-to-do modeling process. Moreover, to improve model performance, apart from enhancing data quality and executing the model selection, having better features is also a possibility. On the other hand, the domain aspect related to the data information will affect the feature creation process by either being the requirement or providing the direction to execute it. Therefore, how to create features, do it well, and even relating domain knowledge to assist this process is a helpful skill to reach results easier.
This thus would lead us to the main purpose of this blog post: introduction to feature engineering. It is hoped that this blog could help in getting a sense of the operation and provide the start ground for anyone who would like to research feature engineering.
What is feature engineering?
Feature engineering can be said as the process of transforming information from data into features that effectively enhance the model’s performance with the possibility of using domain knowledge as an assist.
As created features are not always useful and a large number of features could easily cause overfitting or curse of dimensionality situation, feature engineering would go along with feature selection to conclude only contributing features to be used for the model. Apart from that, regularization or kernel method can also help in limiting the growth of feature numbers effectively.
Feature engineering process.
Since the modeling process is an iterative one, the feature engineering process is the same as well. The process can be summarized like this: after data gathering and preprocessing, data analysis and assumption can be used as the base to come up with the initial features set, and then after testing through the model result, a decision for further feature engineering or not can be made.
The steps of a feature engineering process are:
- Looking at the data to have a direction for what features to start.
- Creating features.
- See the effectiveness of the feature provided for the result.
- If the result is still unsatisfied, evaluate and set a new direction for feature engineering.
Mindset and consideration for feature engineering
Mindset
Not only just for covering method to create features, but feature engineering principles has also evolved into constructing best practices and intuition in the feature creation process to avoid mistakes and boost the performance on a whole.
As data and use-case are situational, best feature engineering practices mostly can be achieved through tries out and analysis. There is no systematic way to do it, but underlying reasons do exist for practitioners to mainly do things in a certain way. The mindset for feature engineering is to actively operate experiments or study from past precedents to look for deeper principles and in the end shaping the intuition to execute the operation.
Consideration
From “Feature Engineering and Selection: A Practical Approach for Predictive Models” (Kuhn & Johnson, 2019), some points worth noting:
- Overfitting is also a concern when a failed feature engineering introduces features that are relevant to the current dataset but don’t have any relationship with the outcomes when new additional data being included.
- To find the link between predictors and defined outcomes for prediction, supervised or unsupervised data analysis can be done.
- Considering the “No Free Lunch” Theorem, trying several of models’ types to see which one works best is the best course of action.
- Another concept that relating to this topic: Model vs modeling process, model bias and variance, experience-driven modeling & empirically driven modeling, and big data.
They also showcase another example to present these points:
- The “Trial and error” process is needed to go through to find the best one.
- The interaction between models and features is complex and unpredictable. However, an effect from feature sets may be more significant than the effect of different models.
- Based on the same set of right features, the best performance can be ensured regardless of the model’s types.
Feature engineering techniques’ introduction by following an example
Different data types will have their corresponding feature engineering techniques. This part of the blog post is to introduce the techniques by linking them with the data type so that when coming across the data type related techniques can be identified immediately.