官术网_书友最值得收藏!

Ignoring punctuation and specific characters

Usually in natural language processing, some uninformative words or characters, called stop words, can be filtered out for easier handling. When computing word frequencies or extracting sentiment data from a corpus, punctuation or special characters might need to be ignored. This recipe demonstrates how to remove these specific characters from the body of a text.

How to do it...

There are no imports necessary. Create a new file, which we will call Main.hs, and perform the following steps:

  1. Implement main and define a string called quote. The back slashes (\) represent multiline strings:
    main :: IO ()
    main = do
      let quote = "Deep Blue plays very good chess-so what?\ 
        \Does that tell you something about how we play chess?\
        \No. Does it tell you about how Kasparov envisions,\ 
        \understands a chessboard? (Douglas Hofstadter)"
      putStrLn $ (removePunctuation.replaceSpecialSymbols) quote
  2. Replace all punctuation marks with an empty string, and replace all special symbols with a space:
    punctuations = [ '!', '"', '#', '$', '%'
                   , '(', ')', '.', ',', '?']
      
    removePunctuation = filter (`notElem` punctuations)
            
    specialSymbols = ['/', '-']
    
    replaceSpecialSymbols = map $ 
      (\c ->if c `elem` specialSymbols then ' ' else c)
  3. By running the code, we will find that all special characters and punctuation are appropriately removed to facilitate dealing with the text's corpus:
    $ runhaskell Main.hs
    Deep Blue plays very good chess so what Does that tell you something about how we play chess No Does it tell you about how Kasparov envisions understands a chessboard Douglas Hofstadter
    

There's more...

For more powerful control, we can install MissingH, which is a very helpful utility we can use to deal with strings:

$ cabal install MissingH

It provides a replace function that takes three arguments and produces a result as follows:

Prelude> replace "hello" "goodbye" "hello world!"

"goodbye world!"

It replaces all occurrences of the first string with the second string in the third argument. We can also compose multiple replace functions:

Prelude> ((replace "," "").(replace "!" "")) "hello, world!"

"hello world"

By folding the composition (.) function over a list of these replace functions, we can generalize the replace function to an arbitrary list of tokens:

Prelude> (foldr (.) id $ map (flip replace "") [",", "!"]) "hello, world!"

"hello world"

The list of punctuation marks can now be arbitrarily long. We can modify our recipe to use our new and more generalized functions:

removePunctuation = foldr (.) id $ map (flip replace "") 
        ["!", "\"", "#", "$", "%", "(", ")", ".", ",", "?"]
        
replaceSpecialSymbols = foldr (.) id $ map (flip replace " ") 
        ["/", "-"]
主站蜘蛛池模板: 哈尔滨市| 建德市| 平顶山市| 高密市| 敦煌市| 磐石市| 嘉义县| 会同县| 安塞县| 丹江口市| 登封市| 呼和浩特市| 大洼县| 德兴市| 叶城县| 柳林县| 瑞丽市| 德令哈市| 乐安县| 杭锦旗| 思茅市| 黑山县| 武鸣县| 金川县| 巩义市| 兖州市| 兴隆县| 昌都县| 郎溪县| 沂源县| 乌苏市| 洪湖市| 南召县| 东乌珠穆沁旗| 京山县| 城口县| 新宾| 宁陵县| 开平市| 万盛区| 乐山市|