- Feature Engineering Made Easy
- Sinan Ozdemir Divya Susarla
- 451字
- 2021-06-25 22:45:51
Evaluating unsupervised learning algorithms
This is a bit trickier. Because unsupervised learning is not concerned with predictions, we cannot directly evaluate performance based on how well the model can predict a value. That being said, if we are performing a cluster analysis, such as in the previous marketing segmentation example, then we will usually utilize the silhouette coefficient (a measure of separation and cohesion of clusters between -1 and 1) and some human-driven analysis to decide if a feature engineering procedure has improved model performance or if we are merely wasting our time.
Here is an example of using Python and scikit-learn to import and calculate the silhouette coefficient for some fake data:
attributes = tabular_data
cluster_labels = outputted_labels_from_clustering
from sklearn.metrics import silhouette_score
silhouette_score(attributes, cluster_labels)
We will spend much more time on unsupervised learning later on in this book as it becomes more relevant. Most of our examples will revolve around predictive analytics/supervised learning.
It is important to remember that the reason we are standardizing algorithms and metrics is so that we may showcase the power of feature engineering and so that you may repeat our procedures with success. Practically, it is conceivable that you are optimizing for something other than accuracy (such as a true positive rate, for example) and wish to use decision trees instead of logistic regression. This is not only fine but encouraged. You should always remember though to follow the steps to evaluating a feature engineering procedure and compare baseline and post-engineering performance.
It is possible that you are not reading this book for the purposes of improving machine learning performance. Feature engineering is useful in other domains such as hypothesis testing and general statistics. In a few examples in this book, we will be taking a look at feature engineering and data transformations as applied to a statistical significance of various statistical tests. We will be exploring metrics such as R2 and p-values in order to make judgements about how our procedures are helping.
In general, we will quantify the benefits of feature engineering in the context of three categories:
- Supervised learning: Otherwise known as predictive analytics
- Regression analysis—predicting a quantitative variable:
- Will utilize MSE as our primary metric of measurement
- Classification analysis—predicting a qualitative variable
- Will utilize accuracy as our primary metric of measurement
- Regression analysis—predicting a quantitative variable:
- Unsupervised learning: Clustering—the assigning of meta-attributes as denoted by the behavior of data:
- Will utilize the silhouette coefficient as our primary metric of measurement
- Statistical testing: Using correlation coefficients, t-tests, chi-squared tests, and others to evaluate and quantify the usefulness of our raw and transformed data
In the following few sections, we will look at what will be covered throughout this book.
- 數據庫技術與應用教程(Access)
- 輕松學大數據挖掘:算法、場景與數據產品
- Java Data Science Cookbook
- Voice Application Development for Android
- 計算機信息技術基礎實驗與習題
- 工業大數據分析算法實戰
- 智能數據時代:企業大數據戰略與實戰
- 白話大數據與機器學習
- gnuplot Cookbook
- 大數據精準挖掘
- Construct 2 Game Development by Example
- 活用數據:驅動業務的數據分析實戰
- Hands-On System Programming with C++
- Nagios Core Administrators Cookbook
- Hadoop海量數據處理:技術詳解與項目實戰(第2版)