官术网_书友最值得收藏!

Transforming Text into Data Structures

Text data offers a very unique proposition by not providing any direct representation available for it in terms of numbers. Computers only understand numbers. Representing text using numbers is a challenge. At the same time, it is an opportunity to invent and try out approaches to represent text so that the maximum information can be captured in the process. In this chapter, we will look at how text and math interface. Let's take baby steps toward transforming text data into mathematical data structures that will provide insights on how to actually represent text using numbers and, consequently, build Natural Language Processing (NLP) models.

Pause for a moment here and dwell on how would you try to solve it.

As we progress toward the end of this chapter, we will be better equipped to handle text data as we understand techniques including count vectorization and term frequency-inverse document frequency (TF-IDF) vectorization, among others.

Before we proceed and discuss various possible approaches such as count vectors and TF-IDF vectors in this chapter and more approaches such as Word2vec in future chapters, we need to understand two supremely important concepts that validate every language. These are syntax and semantics. Syntax defines the grammatical structures or the set of rules defining a language. It can be thought of as a set of guiding principles that define how words can be put in each other's vicinity to form sentences or phrases. However, syntactically correct sentences may not be meaningful. Semantics is the part that takes care of the meanings and defines how to put words together so that they actually make sense when organized based on the available syntactical rules.

In this chapter, we will primarily focus on the syntactical aspects, where we use information such as how many times a word occurred in a document or in a set of documents as potential features to represent documents. Let's see how these approaches pan out in solving the representation problem we have.

The following topics will be covered in this chapter:

  • Understanding vectors and matrices
  • Exploring the Bag-of-Words (BoW) architecture
  • TF-IDF vectors
  • Distance/similarity calculation between document vectors
  • One-hot vectorization
  • Building a basic chatbot
主站蜘蛛池模板: 兴海县| 马山县| 吉林市| 嫩江县| 株洲县| 晋州市| 长汀县| 合川市| 西青区| 宝应县| 常州市| 丹江口市| 陇南市| 南通市| 大渡口区| 灯塔市| 攀枝花市| 涿鹿县| 宣化县| 阳新县| 广宁县| 云林县| 南汇区| 临洮县| 喀什市| 中方县| 大港区| 孙吴县| 河曲县| 定结县| 武功县| 龙口市| 克山县| 韶山市| 吉林省| 安溪县| 大同市| 安福县| 肇庆市| 昂仁县| 兴仁县|