官术网_书友最值得收藏!

Loading the data

I have always liked The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle. Let's download the book and save it locally:

url = 'http://www.gutenberg.org/ebooks/1661.txt.utf-8'
file_name = 'sherlock.txt'

Let's actually download the file. You only need to do this once, but this download utility can be used whenever you are downloading other datasets, too:

import urllib.request
# Download the file from `url` and save it locally under `file_name`:
with urllib.request.urlopen(url) as response:
with open(file_name, 'wb') as out_file:
data = response.read() # a `bytes` object
out_file.write(data)

Moving on, let's check whether we got the correct file in place with shell syntax inside our Jupyter notebook. This ability to run basic shell commands on both Windows and Linux – is really useful:

!ls *.txt

The preceding command returns the following output:

sherlock.txt

The file contains header and footer information from Project Gutenberg. We are not interested in this, and will discard the copyright and other legal notices. This is what we want to do:

  1. Open the file.
  2. Delete the header and footer information.
  3. Save the new file as sherlock_clean.txt.

I opened the text file and found that I need to remove the first 33 lines. Let's do that using shell commands which also work on Windows inside Jupyter notebook. You remember this now, don't you? Marching on:

!sed -i 1,33d sherlock.txt

I used the sed syntax.  The -i flag tells you to make the necessary changes. 1,33d instructs you to delete lines 1 to 33.

Let's double-check this. We expect the book to now begin with the iconic book title/cover:

!head -5 sherlock.txt

This shows the first five lines of the book. They are as we expect:

THE ADVENTURES OF SHERLOCK HOLMES


by


SIR ARTHUR CONAN DOYLE

What do I see?

Before I move on to text cleaning for any NLP task, I would like to spend a few seconds taking a quick glance at the data itself. I noted down some of the things I spotted in the following list. Of course, a keener eye will be able to see a lot more than I did:

  • Dates are written in a mixed format: twentieth of March, 1888; times are too: three o'clock.
  • The text is wrapped at around 70 columns, so no line can be longer than 70 characters.
  • There are a lot of proper nouns. These include names such as Atkinson and Trepoff, in addition to locations such as Trincomalee and Baker Street.
  • The index is in Roman numerals such as I and IV, and not 1 and 4.
  • There is a lot of dialogues such as You have carte blanche, with no narrative around them. This storytelling style switches freely from being narrative to dialogue-driven.
  • The grammar and vocabulary is slightly unusual because of the time when Doyle wrote.  

These subjective observations are helpful in understanding the nature and edge cases in your text. Let's move on and load the book into Python for processing:

# let's get this data into Python

text = open(file_name, 'r', encoding='utf-8').read() # note that I add an encoding='utf-8' parameter to preserve information

print(text[:5])

This returns the first five characters:

THE A

Let's quickly verify that we have loaded the data into useful data types.

To check our own data types, use the following command:

print(f'The file is loaded as datatype: {type(text)} and has {len(text)} characters in it')

The preceding command returns the following output:

The file is loaded as datatype: <class 'str'> and has 581204 characters in it

There is a major improvement between Py2.7 and Py3.6 on how strings are handled. They are now all Unicode by default.

In Python 3, str are Unicode strings, and it is more convenient for the NLP of non-English texts.

Here is a small relevant example to highlight the differences between the two:

from collections import Counter
Counter('M?belstück')

In Python 2: Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1})
In Python 3: Counter({'M': 1, '?': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1, 'c': 1, 'k': 1})
主站蜘蛛池模板: 巴彦县| 横山县| 河曲县| 济宁市| 湘潭市| 武冈市| 汕头市| 桑日县| 天柱县| 尼勒克县| 梅州市| 综艺| 进贤县| 衡山县| 沅陵县| 神农架林区| 奎屯市| 萨迦县| 顺义区| 黄大仙区| 阿拉善左旗| 大埔县| 郑州市| 科尔| 阿鲁科尔沁旗| 光泽县| 贺兰县| 德化县| 衡阳县| 株洲县| 津南区| 房产| 蓬溪县| 天门市| 保靖县| 芦溪县| 会昌县| 福安市| 开化县| 盐城市| 新巴尔虎左旗|