官术网_书友最值得收藏!

  • Using OpenRefine
  • Ruben Verborgh Max De Wilde
  • 493字
  • 2021-08-06 16:57:12

Recipe 3 – exploring your data

In this recipe, you will get to know your data by scanning the different zones giving access to the total number of rows/records, the various display options, the column headers and menus, and the actual cell contents.

Once your dataset has been loaded, you will access the main interface of OpenRefine as shown in the following screenshot:

Four zones are seen on this screen; let's go through them from top to bottom, numbered as 1 to 4 in the preceding screenshot:

  1. Total number of rows: If you did not forget to specify that quotation marks are to be ignored (see Recipe 2 – creating a new project), you should see a total of 75814 rows from the Powerhouse file. When data are filtered on a given criterion, this bar will display something like 123 matching rows (75814 total).
  2. Display options: Try to alternate between rows and records by clicking on either word. True, not much will change, except that you may now read 75814 records in zone 1. The number of rows is always equal to the number of records in a new project, but they will evolve independently from now on. This zone will also let you choose whether to display 5, 10, 25, or 50 rows/records on a page, and it also provides the right way to navigate from page to page.
  3. Column headers and menus: You will find here the first row that was parsed as column headers when the project was created. In the Powerhouse dataset, the columns read Record ID, Object Title, Registration Number, and so on (if you deselected the Parse next 1 line as column headers option box, you will see Column 1, Column 2, and so on instead). The leftmost column is always called All and is divided in three subcolumns containing stars (to mark good records, for instance), flags (to mark bad records, for instance), and IDs. Starred and flagged rows can easily be faceted, as we will see in Chapter 2, Analyzing and Fixing Data. Every column also has a menu (see the following screenshot) that can be accessed by clicking on the small dropdown to the left of the column header.
  4. Cell contents: This option shows the main area displaying the actual values of the cells.

Before starting to profile and clean your data, it is important to get to know them well and to be at ease with OpenRefine: have a look at each column (using the horizontal scrollbar) to verify that the column headers have been parsed correctly, that the cell types were rightly guessed, and so on. Change the number of rows displayed per page to 50 and go through a few pages to check that the values are consistent (ideally, you should already have done so during preview before creating your project). When you feel that you are sufficiently familiar with the interface, you can consider moving along to the next recipe.

主站蜘蛛池模板: 滕州市| 泸溪县| 富源县| 本溪| 新乡市| 聂拉木县| 郁南县| 华亭县| 什邡市| 奎屯市| 德令哈市| 阿城市| 宁德市| 临漳县| 吉隆县| 肃南| 永德县| 新田县| 旌德县| 金华市| 马关县| 黎城县| 旺苍县| 阿尔山市| 通山县| 富源县| 金昌市| 龙江县| 垦利县| 南昌市| 砚山县| 洛浦县| 扎兰屯市| 普格县| 张掖市| 广灵县| 保靖县| 略阳县| 牙克石市| 丰县| 邵武市|