官术网_书友最值得收藏!

Data cleaning

Data cleaning is a fundamental process to make sure we are able to produce good results at the end. It is task-specific, as in the cleaning you will have to perform on audio data will be different for images, text, or a time series data.

We will need to make sure there is no missing data, and if that's the case we can decide how to deal with it. In the case of missing data—for example, an instance missing a few variables, it's possible to fill them with the average for that variable, fill it with a value that the input cannot assume, such as -1 if the variable is between 0 and 1 or disregard the instance if we have a lot of data.

Also, it's good to check whether the data respects the limitations of the values we are measuring. For example, a temperature in Celsius cannot be lower than 273.15 degrees, if that's the case, we know straight away that the data point is unreliable.

Other checks include the format, the data types, and the variance in the dataset.

It's possible to load some clean data directly from scikit-learn. There are a lot of datasets for all sort of tasks—for example, if we want to load some image data, we can use the following Python code:

from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

This data is known as Labeled Faces in the Wild, a dataset for face recognition.

主站蜘蛛池模板: 瑞安市| 嘉兴市| 绩溪县| 安化县| 高淳县| 静海县| 屏南县| 平果县| 柘荣县| 武乡县| 思南县| 库尔勒市| 上虞市| 华容县| 竹溪县| 青岛市| 砀山县| 盐池县| 苍南县| 个旧市| 玛曲县| 谷城县| 马公市| 西充县| 如东县| 巨野县| 和平区| 永泰县| 澄城县| 孟村| 旅游| 尚志市| 马山县| 娄烦县| 崇阳县| 海阳市| 淳安县| 兖州市| 宝山区| 罗源县| 乳源|