官术网_书友最值得收藏!

Introduction

Data is everywhere, logging is cheap, and analysis is inevitable. One of the most fundamental concepts of this chapter is based on gathering useful data. After building a large collection of usable text, which we call the corpus, we must learn to represent this content in code. The primary focus will be first on obtaining data and later on enumerating ways of representing it.

Gathering data is arguably as important as analyzing it to extrapolate results and form valid generalizable claims. It is a scientific pursuit; therefore, great care must and will be taken to ensure unbiased and representative sampling. We recommend following along closely in this chapter because the remainder of the book depends on having a source of data to work with. Without data, there isn't much to analyze, so we should carefully observe the techniques laid out to build our own formidable corpus.

The first recipe enumerates various sources to start gathering data online. The next few recipes deal with using local data of different file formats. We then learn how to download data from the Internet using our Haskell code. Finally, we finish this chapter with a couple of recipes on using databases in Haskell.

主站蜘蛛池模板: 通江县| 喀喇沁旗| 保山市| 鄂托克前旗| 浑源县| 富源县| 延安市| 将乐县| 缙云县| 扎兰屯市| 磐石市| 商河县| 新沂市| 定兴县| 芜湖市| 辉县市| 鹤峰县| 黑水县| 桂林市| 菏泽市| 威远县| 临沂市| 五大连池市| 南昌市| 黑河市| 黎平县| 屯留县| 泗水县| 合川市| 承德县| 巩留县| 九台市| 积石山| 宜阳县| 铁岭市| 和林格尔县| 沅江市| 宁德市| 北海市| 揭西县| 邵阳县|