官术网_书友最值得收藏!

Chapter 2. Data Preparation for Spark ML

Machine learning professionals and data scientists often spend 70% or 80% of their time preparing data for their machine learning projects. Data preparation can be very hard work, but it is necessary and extremely important as it affects everything to follow. Therefore, in this chapter, we will cover all the necessary data preparation parts for our machine learning, which often runs from data accessing, data cleaning, datasets joining, and then to feature development so as to get our datasets ready to develop ML models on Spark. Specifically, we will discuss the following six data preparation tasks mentioned before and then end our chapter with a discussion of repeatability and automation:

  • Accessing and loading datasets
    • Publicly available datasets for ML
    • Loading datasets into Spark easily
    • Exploring and visualizing data with Spark
  • Data cleaning
    • Dealing with missing cases and incompleteness
    • Data cleaning on Spark
    • Data cleaning made easy
  • Identity matching
    • Dealing with identity issues
    • Data matching on Spark
    • Data matching made better
  • Data reorganizing
    • Data reorganizing tasks
    • Data reorganizing on Spark
    • Data reorganizing made easy
  • Joining data
    • Spark SQL to join datasets
    • Joining data with Spark SQL
    • Joining data made easy
  • Feature extraction
    • Feature extraction challenges
    • Feature extraction on Spark
    • Feature extraction made easy
  • Repeatability and automation
    • Dataset preprocessing workflows
    • Spark pipelines for preprocessing
    • Dataset preprocessing automation
主站蜘蛛池模板: 秀山| 马龙县| 桐乡市| 惠东县| 房山区| 巴东县| 吉隆县| 许昌县| 广汉市| 抚顺市| 越西县| 禹城市| 读书| 定结县| 武夷山市| 二手房| 浙江省| 安阳市| 维西| 都江堰市| 云林县| 进贤县| 武夷山市| 三门峡市| 揭阳市| 合川市| 保靖县| 青川县| 淳化县| 榆中县| 吴堡县| 日喀则市| 上饶市| 巴东县| 会同县| 定远县| 德清县| 泊头市| 城口县| 胶州市| 沙田区|