Feature engineering is a method for extracting new features from existing features. These new features are extracted as they tend to effectively explain variability in data. One application of feature engineering could be to calculate how similar different pieces of text are. There are various ways of calculating the similarity between two texts. The most popular methods are cosine similarity and Jaccard similarity. Let's learn about each of them:
Cosine similarity: The cosine similarity between two texts is the cosine of the angle between their vector representations. BoW and TF-IDF matrices can be regarded as vector representations of texts.
Jaccard similarity: This is the ratio of the number of terms common between two text documents to the total number of unique terms present in those texts.
Let's understand this with the help of an example. Suppose there are two texts:
Text 1: I like detective Byomkesh Bakshi.
Text 2: Byomkesh Bakshi is not a detective, he is a truth seeker.
The common terms are "Byomkesh," "Bakshi," and "detective."
The number of common terms in the texts is three.
The unique terms present in the texts are "I," "like," "is," "not," "a," "he," "is," "truth," "seeker."
The number of unique terms is nine.
Therefore, the Jaccard similarity is 3/9 = 0.3.
To get a better understanding of text similarity, we will solve an exercise in the next section.
Unlike numeric data, there are very few ways in which text data can be represented visually. The most popular way of visualizing text data is using word clouds. A word cloud is a visualization of a text corpus in which the sizes of the tokens (words) represent the number of times they have occurred. Let's go through an exercise to understand this better.
Exercise 27: Word Clouds
In this exercise, we will visualize the first 10 articles from sklearn's fetch_20newsgroups text dataset using a word cloud. Follow these steps to implement this exercise:
Open a Jupyter notebook.
Import the necessary libraries and dataset. Add the following code to do this:
Figure 2.51: Word cloud representation of the first 10 articles
In the next section, we will explore other visualizations, such as dependency parse trees and named entities.
Other Visualizations
Apart from word clouds, there are various other ways of visualizing texts. Some of the most popular ways are listed here:
Visualizing sentences using a dependency parse tree: Generally, the phrases constituting a sentence depend on each other. We depict these dependencies by using a tree structure known as a dependency parse tree. For instance, the word "helps" in the sentence "God helps those who help themselves" depends on two other words. These words are "God" (the one who helps) and "those" (the ones who are helped).
Visualizing named entities in a text corpus: In this case, we extract the named entities from texts and highlight them by using different colors.
Let's go through the following exercise to understand this better.
Exercise 28: Other Visualizations (Dependency Parse Trees and Named Entities)
In this exercise, we will look at other visualization methods, such as dependency parse trees and using named entities. Follow these steps to implement this exercise:
Open a Jupyter notebook.
Insert a new cell and add the following code to import the necessary libraries:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
Now we'll depict the sentence "God helps those who help themselves" using a dependency parse tree. Add the following code to implement this:
doc = nlp('God helps those who help themselves.')
displacy.render(doc, style='dep', jupyter=True)
The code generates the following output:
Figure 2.52: Dependency parse tree
Now we will visualize the named entities of the text corpus. Add the following code to implement this:
text = 'Once upon a time there lived a saint named Ramakrishna Paramahansa. \
His chief disciple Narendranath Dutta also known as Swami Vivekananda \
is the founder of Ramakrishna Mission and Ramakrishna Math.'
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
The code generates the following output:
Figure 2.53: Named entities
Now that you have learned about visualizations, in the next section, we will solve an activity based on them to gain an even better understanding.
Activity 4: Text Visualization
In this activity, we will create a word cloud for the 50 most frequent words in a dataset. The dataset we will use consists of random sentences that are not clean. First, we need to clean them and create a unique set of frequently occurring words.
Note
The text_corpus.txt dataset used in this activity can found at this location: https://bit.ly/2HQ2luS.
Follow these steps to implement this activity:
Import the necessary libraries.
Fetch the dataset.
Perform the pre-processing steps, such as text cleaning, tokenization, stop-word removal, lemmatization, and stemming, on the fetched data.
Create a set of unique words along with their frequencies for the 50 most frequently occurring words.
Create a word cloud for these top 50 words.
Justify the word cloud by comparing it with the word frequency calculated.
Note
The solution for this activity can be found on page 266.