官术网_书友最值得收藏!

Building datasets

Data scientists often need hundreds of thousands of data points in order to build, train, and test machine learning models. In some cases, this data is already pre-packaged and ready for consumption. Most of the time, the scientist would need to venture out on their own and build a custom dataset. This is often done by building a web scraper to collect raw data from various sources of interest, and refining it so it can be processed later on. These web scrapers also need to periodically collect fresh data to update their predictive models with the most relevant information.

A common use case that data scientists run into is determining how people feel about a specific subject, known as sentiment analysis. Through this process, a company could look for discussions surrounding one of their products, or their overall presence, and gather a general consensus. In order to do this, the model must be trained on what a positive comment and a negative comment are, which could take thousands of individual comments in order to make a well-balanced training set. Building a web scraper to collect comments from relevant forums, reviews, and social media sites would be helpful in constructing such a dataset.

These are just a few examples of web scrapers that drive large business such as Google, Mozenda, and Cheapflights.com. There are also companies that will scrape the web for whatever available data you need, for a fee. In order to run scrapers at such a large scale, you would need to use a language that is fast, scalable, and easy to maintain.

主站蜘蛛池模板: 分宜县| 邹平县| 韶关市| 衡水市| 页游| 辰溪县| 安丘市| 灵寿县| 磐石市| 建水县| 沈阳市| 余江县| 拉孜县| 逊克县| 浦县| 平陆县| 浙江省| 阳谷县| 宜良县| 喀喇| 佛坪县| 天气| 嘉兴市| 石狮市| 芦溪县| 金山区| 嵊泗县| 泰顺县| 青龙| 出国| 合山市| 台安县| 迁西县| 开远市| 阜南县| 苍南县| 城固县| 宜城市| 新野县| 昌江| 平果县|