官术网_书友最值得收藏!

Capturing table rows from an HTML page

Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process.

In this recipe, we will find a table on a web page and gather all rows to be used in the program.

Getting ready

We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure:

The HTML behind this table is as follows:

$ cat input.html

<!DOCTYPE html>
<html>
    <body>
        <h1>Course Listing</h1>
        <table>
            <tr>
                <th>Course</th>
                <th>Time</th>
                <th>Capacity</th>
            </tr>
            <tr>
                <td>CS 1501</td>
                <td>17:00</td>
                <td>60</td>
            </tr>
            <tr>
                <td>MATH 7600</td>
                <td>14:00</td>
                <td>25</td>
            </tr>
            <tr>
                <td>PHIL 1000</td>
                <td>9:30</td>
                <td>120</td>
            </tr>
        </table>
    </body>
</html>

If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines:

$ cabal install hxt
$ cabal install split

How to do it...

  1. We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet:
    import Text.XML.HXT.Core
    import Data.List.Split (chunksOf)
  2. Define and implement main to read the input.html file.
    main :: IO ()
    main = do
      input <- readFile "input.html"
  3. Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code:
      texts <- runX $ readString 
               [withParseHTML yes, withWarnings no] input 
        //> hasName "td"
        //> getText
  4. The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code:
      let rows = chunksOf 3 texts
      print $ findBiggest rows
  5. By folding through the data, identify the course with the largest capacity using the following code snippet:
    findBiggest :: [[String]] -> [String]
    findBiggest [] = []
    findBiggest items = foldl1 
                        (\a x -> if capacity x > capacity a 
                                 then x else a) items
    
    capacity [a,b,c] = toInt c
    capacity _ = -1
    
    toInt :: String -> Int
    toInt = read
  6. Running the code will display the class with the largest capacity as follows:
    $ runhaskell Main.hs
    
    {"PHIL 1000", "9:30", "120"}
    

How it works...

This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].

主站蜘蛛池模板: 安丘市| 广元市| 屏东市| 五常市| 梧州市| 九江县| 汉源县| 福海县| 柘荣县| 祁连县| 容城县| 缙云县| 涞源县| 藁城市| 绥中县| 盘锦市| 兖州市| 盐津县| 醴陵市| 乌兰浩特市| 炉霍县| 云南省| 佛学| 盘锦市| 彰化县| 施秉县| 老河口市| 田东县| 祁阳县| 江阴市| 西藏| 哈密市| 元谋县| 西乌珠穆沁旗| 辽宁省| 山丹县| 芒康县| 邢台市| 资阳市| 五大连池市| 彰化县|