官术网_书友最值得收藏!

Getting ready

To understand the intuition of performing text analysis, let's consider the Reuters dataset, where each news article is classified into one of the 46 possible topics.

We will adopt the following strategy to perform our analysis:

  • Given that a dataset could contain thousands of unique words, we will shortlist the words that we shall consider.
  •  For this specific exercise, we shall consider the top 10,000 most frequent words.
  • An alternative approach would be to consider the words that cumulatively constitute 80% of all words within a dataset. This ensures that all the rare words are excluded.
  • Once the words are shortlisted, we shall one-hot-encode the article based on the constituent frequent words.
  • Similarly, we shall one-hot-encode the output label.
  • Each input now is a 10,000-dimensional vector, and the output is a 46-dimensional vector:
  • We will divide the dataset into train and test datasets. However, in code, you will notice that we will be using the in-built dataset of reuters in Keras that has built-in function to identify the top n frequent words and split the dataset into train and test datasets.
  • Map the input and output with a hidden layer in between.
  • We will perform softmax at the output layer to obtain the probability of the input belonging to one of the 46 classes.
  • Given that we have multiple possible outputs, we shall employ a categorical cross entropy loss function.
  • We shall compile and fit the model to measure its accuracy on a test dataset.
主站蜘蛛池模板: 大新县| 朝阳县| 和林格尔县| 苏尼特左旗| 莎车县| 府谷县| 横山县| 石楼县| 凤冈县| 镇雄县| 德江县| 佛坪县| 乌苏市| 牟定县| 肇源县| 莲花县| 葫芦岛市| 苍溪县| 建昌县| 阜南县| 吴忠市| 林周县| 普洱| 革吉县| 邯郸市| 沙坪坝区| 徐水县| 城市| 合水县| 宜兰市| 西华县| 琼中| 普兰县| 雅安市| 襄汾县| 古丈县| 保亭| 波密县| 乌什县| 浑源县| 吉水县|