官术网_书友最值得收藏!

How to do it...

We'll code up the strategy defined previously as follows (please refer to the Categorizing news articles into topics.ipynb file in GitHub while implementing the code):

  1. Import the dataset :
from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

In the preceding code snippet, we loaded data from the reuters dataset that is available  in Keras. Additionally, we consider only the 10000 most frequent words in the dataset.

  1. Inspect the dataset:
train_data[0]

A sample of the loaded training dataset is as follows:

Note that the numbers in the preceding output represent the index of words that are present in the output.

  1. We can extract the index of values as follows:
word_index = reuters.get_word_index()
  1. Vectorize the input. We will convert the text into a vector in the following way:
    • One-hot-encode the input words—resulting in a total of 10000 columns in the input dataset.
    • If a word is present in the given text, the column corresponding to the word index shall have a value of one and every other column shall have a value of zero.
    • Repeat the preceding step for all the unique words in a text. If a text has two unique words, there will be a total of two columns that have a value of one, and every other column will have a value of zero:
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results

In the preceding function, we initialized a variable that is a zero matrix and imputed it with a value of one, based on the index values present in the input sequence.

In the following code, we are converting the words into IDs.

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
  1. One-hot-encode the output:
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

The preceding code converts each output label into a vector that is 46 in length, where one of the 46 values is one and the rest are zero, depending on the label's index value.

  1. Define the model and compile it:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(10000,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(46, activation='softmax'))
model.summary()
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

Note that while compiling, we defined loss as categorical_crossentropy as the output in this case is categorical (multiple classes in output).

  1. Fit the model:
history = model.fit(X_train, y_train,epochs=20,batch_size=512,validation_data=(X_test, y_test))

The preceding code results in a model that has 80% accuracy in classifying the input text into the right topic, as follows:

主站蜘蛛池模板: 泸州市| 余江县| 桐乡市| 时尚| 溧水县| 新龙县| 探索| 广平县| 富裕县| 社会| 驻马店市| 哈尔滨市| 岐山县| 肇庆市| 龙口市| 浦东新区| 喜德县| 安泽县| 和林格尔县| 留坝县| 抚宁县| 富蕴县| 华阴市| 辽阳市| 娱乐| 隆安县| 新津县| 通渭县| 福州市| 淄博市| 武胜县| 泗阳县| 宝兴县| 久治县| 黔江区| 临潭县| 三明市| 本溪市| 黎川县| 修水县| 衡南县|