官术网_书友最值得收藏!

Time for action – counting frequent words by filtering

On this occasion, you have some plain text files, and you want to know what is said in them. You don't want to read them, so you decide to count the times that the words appear in the text, and see the most frequent ones to get an idea of what the files are about. The first of our two tutorials on filtering is about counting the words in the file.

Note

Before starting, you'll need at least one text file to play with. The text file used in this tutorial is named smcng10.txt, and is available for you to download from Packt Publishing's website, www.packtpub.com.

Let's work.

Tip

This section and the following sections have many steps. So, feel free to preview the data from time-to-time. In this way, you make sure that you are doing well, and understand what filtering is about, as you progress in the design of your transformation.

  1. Create a new transformation.
  2. By using a Text file input step, read your file. The trick here is to put as a Separator, a sign you are not expecting in the file, such as |. By doing so, of the whole lines would be recognized as a single field. Configure the Fields tab by defining a single String field named line.
  3. This particular file has a big header describing the content and origin of it. We are not interested in those lines, so in the Content tab, as Header type 378, which is the number of lines that precedes the specific content we're interested in.
  4. From the Transform category of steps, drag to the canvas a Split field to rows step, and create a hop from the Text file input step to this one.
  5. Configure the step as follows:
  6. With this last step selected, do a preview. Your preview window should look as follows:
  7. Close the preview window.
  8. Add a Select values step to remove the line field.
    Note

    It's not mandatory to remove this field, but as it will not be used any longer, removing it will make future previews clearer.

  9. Expand the Flow category of steps, and drag a Filter rows step to the work area.
  10. Create a hop from the last step to the Filter rows step.
  11. Edit the Filter rows step by double-clicking on it.
  12. Click on the <field> textbox to the left of the = sign. The list of fields appears. Select word.
  13. Click on the = sign. A list of operations appears. Select IS NOT NULL.
  14. The window looks like the following screenshot:
  15. Click on OK.
  16. From the Transform category of steps, drag a Sort rows step to the canvas.
  17. Create a hop from the Filter rows step, to the Sort rows step. When asked for the kind of hop, select Main output of step, as shown in the following screenshot:
  18. Use the last step to sort the rows by word (ascending).
  19. From the Statistics category, drag-and-drop a Group by step on the canvas, and add it to the stream, after the Sort rows step.
  20. Configure the grids in the Group by configuration window, as shown in the following screenshot:
  21. With the Group by step selected, do a preview. You will see this:

What just happened?

You read a regular plain file, and counted the words appearing in it.

The first thing you did was read the plain file, and split the lines so that every word became a new row in the dataset. For example, as a consequence of splitting the line:

subsidence; comparison with the Portillo chain.

The following rows were generated:

Thus, a new field named word became the basis for your transformation, and therefore you removed the line field.

First of all, you discarded rows with null words. You did it by using a filter with the condition word IS NOT NULL.

Then, you counted the words by using the Group by step you learned in the previous tutorial. Doing it this way, you got a preliminary list of the words in the file, and the number of occurrences of each word.

主站蜘蛛池模板: 天水市| 永城市| 昭苏县| 忻州市| 赤城县| 突泉县| 鄂托克前旗| 黄山市| 永泰县| 乌鲁木齐县| 申扎县| 连州市| 长兴县| 宽城| 郴州市| 长岭县| 海林市| 桃源县| 额尔古纳市| 清丰县| 文化| 延寿县| 肇庆市| 班玛县| 南丰县| 临高县| 砚山县| 化隆| 南涧| 宁蒗| 苏尼特左旗| 兴仁县| 白河县| 屏东县| 汾西县| 赞皇县| 屏东市| 清新县| 葫芦岛市| 平原县| 漯河市|