- Hands-On Data Science with R
- Vitor Bianchi Lanzetta Nataraj Dasgupta Ricardo Anjoleto Farias
- 379字
- 2021-06-10 19:12:25
Computer science
During the course of performing data science, if large datasets are involved, the practitioner may spend a fair amount of time cleansing and curating the dataset. In fact, it is not uncommon for data scientists to spend the majority of their time preparing data for analysis. The generally accepted distribution of time for a data science project involves 80% spent in data management and the remaining 20% spent in the actual analysis of the data.
While this may seem or sound overly general, the growth of big data, that is, large-scale datasets, usually in the range of terabytes, has meant that it takes sufficient time and effort to extract data before the actual analysis takes place. Real-world data is seldom perfect. Issues with real-world data range from missing variables to incorrect entries and other deficiencies. The size of datasets also poses a formidable challenge.
Technologies such as Hadoop, Spark, and NoSQL databases have addressed the needs of the data science community for managing and curating terabytes, if not petabytes, of information. These tools are usually the first step in the overall data science process that precedes the application of algorithms on the datasets using languages such as R, Python and others.
Hence, as a first step, the data scientist generally should be capable of working with datasets using contemporary tools for large-scale data mining. For instance, if the data resides in a Hadoop cluster, the practitioner must be able and willing to perform the work necessary to retrieve and curate the data from the source systems.
Second, once the data has been retrieved and curated, the data scientist should be aware of the requirements of the algorithm from a computational perspective and determine if the system has the necessary resources to efficiently execute these algorithms. For instance, if the algorithms can be taken advantage of with multi-core computing facilities, the practitioner must use the appropriate packages and functions to leverage. This may mean the difference between getting results in an hour versus requiring an entire day.
Last, but not least, the creation of machine learning models will require programming in one or more languages. This in itself demands a level of knowledge and skill in applying algorithms and using appropriate data structures and other computer science concepts:
- 教父母學會上網
- 樂高創意機器人教程(中級 下冊 10~16歲) (青少年iCAN+創新創意實踐指導叢書)
- 分布式多媒體計算機系統
- Docker Quick Start Guide
- JSF2和RichFaces4使用指南
- 數據挖掘方法及天體光譜挖掘技術
- 計算機網絡技術基礎
- 學會VBA,菜鳥也高飛!
- Applied Data Visualization with R and ggplot2
- 單片機技術項目化原理與實訓
- Learning Apache Apex
- JRuby語言實戰技術
- 貫通Hibernate開發
- Mastercam X5應用技能基本功特訓
- Microsoft Power BI Complete Reference