官术网_书友最值得收藏!

Introduction

Many interesting analysis techniques can be used on a large corpus of words. Whether it be examining the structure of a sentence or the content of a book, these recipes will introduce us to some useful tools.

When manipulating strings for data analysis, some of the most common functions are among substring search and edit distance computations. Since numbers are often found in a corpus of text, this chapter will start by showing how to represent numbers in an arbitrary base as a string. We will cover a couple of string-searching algorithms and then focus on extracting text to study not only the words but also how the words are used together.

Many practical applications can be constructed given the simple set of tools provided in this section. For example, in the last recipe, we will demonstrate a way to correct spelling mistakes. How we use these algorithms is entirely up to our creativity, but at least having them at our disposal is an excellent start.

主站蜘蛛池模板: 泰州市| 云和县| 乐亭县| 连南| 海门市| 西和县| 理塘县| 隆化县| 周至县| 萨迦县| 云安县| 靖江市| 龙门县| 宜丰县| 乐至县| 饶阳县| 威海市| 新绛县| 江油市| 聂荣县| 富源县| 天柱县| 陆良县| 浠水县| 凯里市| 咸宁市| 德令哈市| 宁乡县| 金门县| 根河市| 怀宁县| 华池县| 凤城市| 清水县| 光泽县| 古田县| 台安县| 赤峰市| 桐城市| 射阳县| 滨海县|