官术网_书友最值得收藏!

  • Hands-On Data Science with R
  • Vitor Bianchi Lanzetta Nataraj Dasgupta Ricardo Anjoleto Farias
  • 551字
  • 2021-06-10 19:12:25

Predictive analytics (machine learning)

In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.

The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.

In general, creating a machine learning model involves a series of steps such as the sequence:

  1. Cleanse and curate the dataset to extract the cohort on which the model will be built.
  2. Analyze the data using descriptive statistics, for example, distributions and visualizations.
  3. Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.
  4. Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).
  5. Select appropriate machine learning models and create the model using cross validation.
  6. Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.
  7. Perform predictions on the test dataset.
  8. Deliver the final model.

The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.

Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.

The Comprehensive R Archive Network ( CRAN) has a task view page titled  CRAN Task View: Machine Learning & Statistical Learning ( https://cran.r-project.org/web/views/MachineLearning.html) that summarizes some of the key packages in use today for machine learning using R.

Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.

It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.

主站蜘蛛池模板: 莲花县| 安新县| 兴化市| 大名县| 晋城| 延吉市| 会理县| 墨竹工卡县| 墨竹工卡县| 车险| 买车| 和硕县| 麻阳| 轮台县| 石家庄市| 兴和县| 阿鲁科尔沁旗| 清镇市| 陆川县| 六盘水市| 嘉定区| 兴隆县| 古交市| 高雄县| 麻江县| 慈利县| 临江市| 吴川市| 高邑县| 邵阳县| 彰化市| 邹平县| 绥化市| 织金县| 南陵县| 新乡县| 新龙县| 磴口县| 崇仁县| 富平县| 丽水市|