官术网_书友最值得收藏!

Programming with Data

It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.

For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.

Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.

A short side note on terminology
Data science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.

Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:

  • There may be extra steps involved in getting the data
  • The information needed may be spread across multiple sources
  • Datasets may be too large to work with in their original format
  • There may be far more fields or information in a particular dataset than needed
  • Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
  • Datasets may be structured or formatted in a way that is not compatible with a particular application

Due to this, it is often the responsibility of the data programmer to perform the following functions:

  • Discover and gather the data that is needed (getting data)
  • Merge data from different sources if necessary (merging data)
  • Fix flaws in the data entries (cleaning data)
  • Extract the necessary data and put it in the proper structure (shaping data)
  • Store it in the proper format for further use (storing data)

This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:

  • Getting data
  • Cleaning data
  • Merging and shaping data
  • Storing data

Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling. 

主站蜘蛛池模板: 米林县| 高唐县| 义乌市| 安庆市| 梓潼县| 宽城| 金塔县| 龙海市| 泰顺县| 乌审旗| 确山县| 新竹县| 尼勒克县| 茌平县| 九龙县| 基隆市| 富顺县| 洱源县| 淮南市| 中江县| 阿克苏市| 昔阳县| 大关县| 怀仁县| 唐海县| 河北区| 固原市| 邵武市| 灵寿县| 河津市| 杂多县| 大丰市| 康保县| 乌兰浩特市| 长岛县| 永福县| 洞口县| 锡林浩特市| 屯门区| 永平县| 万载县|