官术网_书友最值得收藏!

Validating records by matching regular expressions

A regular expression is a language for matching patterns in a string. Our Haskell code can process a regular expression to examine a text and tell us whether or not it matches the rules described by the expression. Regular expression matching can be used to validate or identify a pattern in the text.

In this recipe, we will read a corpus of English text to find possible candidates of full names in a sea of words. Full names usually consist of two words that start with a capital letter. We use this heuristic to extract all the names from an article.

Getting ready

Create an input.txt file with some text. In this example, we use a snippet from a New York Times article on dinosaurs (http://www.nytimes.com/2013/12/17/science/earth/outsider-challenges-papers-on-growth-of-dinosaurs.html)

Other co-authors of Dr. Erickson's include Mark Norell, chairman of paleontology at the American Museum of Natural History; Philip Currie, a professor of dinosaur paleobiology at the University of Alberta; and Peter Makovicky, associate curator of paleontology at the Field Museum in Chicago.

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. Import the regular expression library:
    import Text.Regex.Posix ((=~))
  2. Match a string against a regular expression to detect words that look like names:
    looksLikeName :: String -> Bool
    looksLikeName str = str =~ "^[A-Z][a-z]{1,30}$" :: Bool
  3. Create functions that remove unnecessary punctuation and special symbols. We will use the same functions defined in the previous recipe entitled Ignoring punctuation and specific characters:
    punctuations = [ '!', '"', '#', '$', '%'
                   , '(', ')', '.', ',', '?']
    removePunctuation = filter (`notElem` punctuations)
            
    specialSymbols = ['/', '-']
    replaceSpecialSymbols = map $ 
                             (\c -> if c `elem`  specialSymbols
                                    then ' ' else c)
  4. Pair adjacent words together and form a list of possible full names:
    createTuples (x:y:xs) = (x ++ " " ++ y) : 
                              createTuples (y:xs)
    createTuples _ = [] 
  5. Retrieve the input and find possible names from a corpus of text:
    main :: IO ()
    main = do
    
      input <- readFile "input.txt"
      let cleanInput = 
        (removePunctuation.replaceSpecialSymbols) input
    
      let wordPairs = createTuples $ words cleanInput
    
      let possibleNames = 
        filter (all looksLikeName . words) wordPairs
    
      print possibleNames
  6. The resulting output after running the code is as follows:
    $ runhaskell Main.hs
    
    ["Dr Erickson","Mark Norell","American Museum","Natural History","History Philip","Philip Currie","Peter Makovicky","Field Museum"]
    

How it works...

The =~ function takes in a string and a regular expression and returns a target that we parse as Bool. In this recipe, the ^[A-Z][a-z]{1,30}$ regular expression matches the words that start with a capital letter and are between 2 and 31 letters long.

In order to determine the usefulness of the algorithm presented in this recipe, we will introduce two metrics of relevance: precision and recall. Precision is the percent of retrieved data that is relevant. Recall is the percent of relevant data that is retrieved.

Out of a total of 45 words in the input.txt file, four correct names are produced and a total eight candidates are retrieved. It has a precision of 50 percent and a recall of 100 percent. This is not bad at all for a simple regular expression trick.

See also

Instead of running regular expressions on a string, we can pass them through a lexical analyzer. The next recipe entitled Lexing and parsing an e-mail address will cover this in detail.

主站蜘蛛池模板: 兴仁县| 原平市| 兴文县| 云龙县| 东兰县| 建瓯市| 突泉县| 富裕县| 建德市| 内丘县| 靖州| 铜梁县| 南陵县| 丘北县| 潜江市| 石楼县| 渝北区| 开原市| 叶城县| 北宁市| 峨眉山市| 鹤壁市| 文水县| 鄂州市| 泸水县| 雷山县| 新和县| 陈巴尔虎旗| 尖扎县| 法库县| 青田县| 通榆县| 西安市| 银川市| 宁远县| 金寨县| 房产| 陇南市| 宿松县| 兴文县| 灌云县|