官术网_书友最值得收藏!

Parsing custom data formats

If you work with data for long enough, you'll eventually come across data that you can't find a library for, and you'll need to write your own parser. Some formats might be simple enough for regular expressions, but if you need to balance syntactic structures in the input, or handle anything too complicated, you're probably better off creating a custom parser. Sometimes, custom parsers can be slower than regular expressions for very large inputs, but sometimes they're still your best option.

Clojure, and most functional languages, are great for parsing, and many have parser-combinator libraries that make writing parsers extremely simple.

For this recipe, as an example of a data format that needs parsing, we'll work with some FASTA data (http://en.wikipedia.org/wiki/FASTA_format). FASTA is a file format that's used in bioinformatics to exchange nucleotide and peptide sequences. Of course, there are parsers already for this, but it's a simple, yet non-trivial format, which makes a good example case for this recipe.

The first line of FASTA data starts with a > character followed by a unique identifier. This line often contains other information about the specimen described, the database it came from, and more. After this line comes one or more lines that list the sequence information. A more detailed explanation of the FASTA format is available at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. This is how an example FASTA record looks:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPY
IGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIP
FHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS
VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPY
TIIGQMASILYFSIILAFLPIAGXIENY

We'll use the parse-ez library (https://github.com/protoflex/parse-ez) to build the parser.

Getting ready

We need to make sure that parse-ez is listed in our Leiningen project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [parse-ez "0.3.6"]])

We also need to make it available to our script or REPL:

(use 'protoflex.parse)

How to do it…

To define a parser, we just define functions that parse the different parts of the input and then combine them to parse larger structures:

  1. It would be useful to have a way to parse two things and throw away the results of the second. This function will do that:
    (defn <| [l r]
      (let [l-output (l)]
        (r)
        l-output))
  2. Also, we'll define a parser for the end of a line. It matches either a carriage return or a new line:
    (defn nl []
      (chr-in #{\newline \return}))
  3. Let's start putting the pieces together. The first function parses the sequence definition line by accepting a > character, followed by anything up to the end of the line:
    (defn defline []
      (chr \>)
      (<| #(read-to-re #"[\n\r]+") nl))
  4. We parse a sequence of amino acid or nucleic acid codes by defining a parser for a single code and then building on that to create a parser for a line of code:
    (defn acid-code []
      (chr-in #{\A \B \C \D \E \F \G \H \I \K \L \M
                \N \P \Q \R \S \T \U \V \W \X \Y \Z
                \- \*}))
    
    (defn acid-code-line []
      (<| #(multi+ acid-code) #(attempt nl)))
  5. Next, we combine these parsers into one that parses an entire FASTA record and populates a map with our data. Moreover, we define a combinator that parses multiple FASTA records:
    (defn fasta []
      (ws?)
      (let [dl (defline)
            gls (apply str (flatten
                             (multi+ acid-code-line)))]
        {:defline dl, :gene-seq gls}))
    (defn multi-fasta []
      (<| #(multi+ fasta)
          ws?))
  6. Finally, we create a wrapper function that passes our parser to parse-ez and configures the parsing engine the way we need it:
    (defn parse-fasta [input]
      (parse multi-fasta input
             :eof false :auto-trim false))

Now, we can use this function to parse the example record, provided at the beginning of this recipe:

user=> parse-fasta test-data)
{:defline
 "gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]",
 :gene-seq
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY"}

How it works…

At the most abstract level, parsers are functions. They take a string as input and return a data structure. More complex, advanced parsers are built by combining simpler elements.

The <| function is a good example of this. It does not parse anything by itself. However, this function makes it possible to combine two other parsers in a useful way: it parses both parts and throws away the results of the second.

The acid-code function is an example of how to create a parser from a basic component. It matches any of the characters in the set.

acid-code-line then combines the acid-code parser. It has to match one or more acid-code characters, optionally followed by a newline. It uses the <| combinator to throw away the newline and return the sequence of acid-codes.

This entire parser is built up in this way, by composing complex structures from simple parts. While this is a very basic parser, it's possible to create quite complex parsers in this way, leveraging the full power of Clojure while keeping the code readable and maintainable.

主站蜘蛛池模板: 无棣县| 柘荣县| 资中县| 海林市| 漳浦县| 巴中市| 孟村| 临沂市| 桐庐县| 太湖县| 多伦县| 赤峰市| 绥滨县| 宁国市| 维西| 上思县| 什邡市| 蕲春县| 江都市| 辽宁省| 格尔木市| 崇州市| 英山县| 锡林浩特市| 麻阳| 大田县| 久治县| 抚顺县| 安溪县| 洛阳市| 广元市| 东光县| 崇州市| 察隅县| 庆城县| 饶河县| 胶南市| 仁怀市| 军事| 灌南县| 淮滨县|