- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 383字
- 2021-07-02 18:46:04
A day in the life of a data scientist
This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!
So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.
Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:
If you can open your dataset using Excel, you are working with small data.
- R語言數據分析從入門到精通
- INSTANT OpenCV Starter
- Android 7編程入門經典:使用Android Studio 2(第4版)
- R語言數據可視化實戰
- JS全書:JavaScript Web前端開發指南
- Swift細致入門與最佳實踐
- Learning Concurrent Programming in Scala
- 51單片機C語言開發教程
- Python極簡講義:一本書入門數據分析與機器學習
- Processing創意編程指南
- 算法設計與分析:基于C++編程語言的描述
- 遠方:兩位持續創業者的點滴思考
- 大學計算機基礎實訓教程
- Developing Java Applications with Spring and Spring Boot
- Learning Redux