官术网_书友最值得收藏!

How to do it…

In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:

  1. First, import a textual dataset:
with open("anonops_short.txt", encoding="utf8") as f:
anonops_chat_logs = f.readlines()
  1. Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
  1. The end result is a sparse matrix with each row being a vector representing one of the texts:
X_train_tf

<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>

print(X_train_tf)

The following is the output:

主站蜘蛛池模板: 抚宁县| 乐陵市| 桑日县| 临潭县| 济源市| 苏州市| 综艺| 江川县| 甘南县| 来宾市| 抚州市| 秀山| 图片| 彰武县| 满城县| 潞西市| 永兴县| 柘荣县| 从江县| 依兰县| 竹溪县| 长垣县| 镇宁| 花莲市| 仁怀市| 哈密市| 云林县| 清远市| 岫岩| 新田县| 河源市| 鞍山市| 霍州市| 枝江市| 揭西县| 六安市| 隆安县| 津南区| 靖州| 堆龙德庆县| 正阳县|