官术网_书友最值得收藏!

Data cleansing

Data cleansing is the process of identifying and fixing corrupt or fallacious records in a record set, table, or database. It also deals with identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data, and then replacing, modifying, or deleting the infected data. Data entry and acquisition is inherently prone to errors, both simple and complex. There is much effort involved in this frontend process, but the fact remains that errors are common in large datasets. With respect to big data management, data cleaning is very important, for the following reasons:

  • The main data is usually spread across different legacy systems, including spreadsheets, text files, and web pages
  • By ensuring that the data is as accurate as possible, an organization can maintain good relationships with its customers, improving the organization's efficiency
  • Correct and complete data provides better insights into the process that the data concerns

There are libraries for Python (Pandas) and R (Dplyr) that can help with this process. In addition, there are other premium services available in the market, including Trifacta, OpenRefine, Paxata, and so on. 

主站蜘蛛池模板: 千阳县| 饶平县| 南丰县| 皋兰县| 琼海市| 合肥市| 鲁甸县| 黑河市| 九龙县| 盘山县| 砚山县| 云林县| 永州市| 博客| 巩义市| 伊宁县| 鄂尔多斯市| 南皮县| 建水县| 宁强县| 大名县| 光泽县| 洞头县| 肥西县| 潞城市| 介休市| 宜兰市| 托里县| 巧家县| 峨眉山市| 冕宁县| 恩平市| 高密市| 盘锦市| 秦皇岛市| 夏津县| 上栗县| 扎兰屯市| 辽阳县| 拉萨市| 淅川县|