官术网_书友最值得收藏!

A day in the life of a data scientist

This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!

So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.

Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:

If you can open your dataset using Excel, you are working with small data.

主站蜘蛛池模板: 湄潭县| 巴东县| 东海县| 达拉特旗| 融水| 德保县| 西乌| 桃江县| 娄底市| 宁强县| 任丘市| 鄂尔多斯市| 苍梧县| 吕梁市| 革吉县| 芦溪县| 邯郸县| 东山县| 内乡县| 玉树县| 阿合奇县| 垦利县| 绥宁县| 葵青区| 陵水| 台江县| 苍南县| 兴业县| 江都市| 葵青区| 凉山| 监利县| 交城县| 吉木萨尔县| 马龙县| 惠州市| 霞浦县| 河间市| 颍上县| 会泽县| 沙雅县|