官术网_书友最值得收藏!

Types of Data

To deal with data effectively, we need to understand the various forms in which it exists. Let's first understand the types of data that exist. There are two main ways to categorize data, by structure and by content, as explained in the upcoming sections.

Categorizing Data Based on Structure

On the basis of structure, data can be divided into three categories, namely structured, semi-structured, and unstructured, as shown in the following diagram:

Figure 2.1: Categorization based on content

These three categories are explained in detail here:

  • Structured Data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following figure shows what structured data usually looks like:

Figure 2.2: Structured data

  • Semi-Structured Data: This type of data is not presented in a tabular structure, but it can be represented in a tabular format after transformation. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data. The following figure shows how semi-structured data can appear:

Figure 2.3: Semi-structured data

  • Unstructured Data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data. The following figure shows how unstructured data looks like:

Figure 2.4: Unstructured data

Categorization of Data Based on Content

On the basis of content, data can be divided into four categories, as shown in the following figure:

Figure 2.5: Categorization of data based on structure

Let's look at each category here:

  • Text Data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
  • Image Data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
  • Audio Data: This refers to recordings of someone's voice, music, and so on. This type of data can only be heard.
  • Video Data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.

We have learned about the different types of data as well their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the coming section, we will look into some pre-processing steps for cleaning data.

主站蜘蛛池模板: 盘锦市| 葵青区| 汉源县| 文化| 宝丰县| 分宜县| 河西区| 北川| 门头沟区| 卢龙县| 加查县| 霸州市| 福州市| 阿图什市| 凤冈县| 会同县| 依兰县| 阿克陶县| 旺苍县| 通河县| 八宿县| 资兴市| 黔西| 赣榆县| 吉首市| 凤冈县| 洮南市| 浙江省| 驻马店市| 连山| 绿春县| 平远县| 什邡市| 常宁市| 犍为县| 宜都市| 长阳| 肇州县| 白河县| 兴山县| 东山县|