- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 503字
- 2021-07-02 20:09:23
Chapter 3. Data Wrangling
I assume that by now you are at ease with importing datasets from various sources and exploring the look and feel of the data. Handling missing values, creating dummy variables and plots are some tasks that an analyst (predictive modeller) does with almost all the datasets to make them model-worthy. So, for an aspiring analyst it will be better to master these tasks, as well.
Next in the line of items to master in order to juggle data like a pro is data wrangling. Put simply, it is just a fancy word for the slicing and dicing of data. If you compare the entire predictive modelling process to a complex operation/surgery to be performed on a patient, then the preliminary analysis with a stethoscope and diagnostic checks on the patient is the data cleaning and exploration process, zeroing down on the ailing area and deciding which body part to operate on is data wrangling, and performing the surgery/operation is the modelling process.

A surgeon can vouch for the fact that zeroing down on a specific body part is the most critical piece of the puzzle to crack down before one gets to the root of the ailment. The same is the case with data wrangling. The data is not always at one place or in one table, maybe the information you need for your model is scattered across different datasets. What does one do in such cases? One doesn't always need the entire data. Many a times, one needs only a column or a few rows or a combination of a few rows and columns. How to do all this jugglery? This is the crux of this chapter. Apart from this, the chapter tries to provide the reader with all the props needed in their tryst with predictive modelling.
At the end of the chapter, the reader should be comfortable with the following functions:
- Sub-set a dataset: Slicing and dicing data, selecting few rows and columns based on certain conditions that is similar to filtering in Excel
- Generating random numbers: Generating random numbers is an important tool while performing simulations and creating dummy data frames
- Aggregating data: A technique that helps to group the data by categories in the categorical variable
- Sampling data: This is very important before venturing into the actual modelling; piding a dataset between training and testing data is essential
- Merging/appending/concatenating datasets: This is the solution of the problem that arises when the data required for the purpose of modelling is scattered over different datasets
We will be using a variety of public datasets in this chapter. Another good way of demonstrating these concepts is to use dummy datasets created using random numbers. In fact, random numbers are used heavily for this purpose. We will be using a mix of both public datasets and dummy datasets, created using random numbers.
Let us now kick-start the chapter by learning about subsetting a dataset. As it unfolds, one will realize how ubiquitous and indispensable this is.
- 數(shù)據(jù)要素安全流通
- 數(shù)據(jù)庫(kù)應(yīng)用實(shí)戰(zhàn)
- DB29forLinux,UNIX,Windows數(shù)據(jù)庫(kù)管理認(rèn)證指南
- 數(shù)字媒體交互設(shè)計(jì)(初級(jí)):Web產(chǎn)品交互設(shè)計(jì)方法與案例
- SQL優(yōu)化最佳實(shí)踐:構(gòu)建高效率Oracle數(shù)據(jù)庫(kù)的方法與技巧
- 深入淺出Greenplum分布式數(shù)據(jù)庫(kù):原理、架構(gòu)和代碼分析
- 數(shù)字IC設(shè)計(jì)入門(mén)(微課視頻版)
- 實(shí)用數(shù)據(jù)結(jié)構(gòu)
- Oracle數(shù)據(jù)庫(kù)管理、開(kāi)發(fā)與實(shí)踐
- 二進(jìn)制分析實(shí)戰(zhàn)
- 深入理解InfluxDB:時(shí)序數(shù)據(jù)庫(kù)詳解與實(shí)踐
- 改變未來(lái)的九大算法
- 活用數(shù)據(jù):驅(qū)動(dòng)業(yè)務(wù)的數(shù)據(jù)分析實(shí)戰(zhàn)
- SIEMENS數(shù)控技術(shù)應(yīng)用工程師:SINUMERIK 840D-810D數(shù)控系統(tǒng)功能應(yīng)用與維修調(diào)整教程
- 大數(shù)據(jù)分析:R基礎(chǔ)及應(yīng)用