- Clojure Data Analysis Cookbook(Second Edition)
- Eric Rochester
- 555字
- 2021-08-06 19:26:09
Parsing custom data formats
If you work with data for long enough, you'll eventually come across data that you can't find a library for, and you'll need to write your own parser. Some formats might be simple enough for regular expressions, but if you need to balance syntactic structures in the input, or handle anything too complicated, you're probably better off creating a custom parser. Sometimes, custom parsers can be slower than regular expressions for very large inputs, but sometimes they're still your best option.
Clojure, and most functional languages, are great for parsing, and many have parser-combinator libraries that make writing parsers extremely simple.
For this recipe, as an example of a data format that needs parsing, we'll work with some FASTA data (http://en.wikipedia.org/wiki/FASTA_format). FASTA is a file format that's used in bioinformatics to exchange nucleotide and peptide sequences. Of course, there are parsers already for this, but it's a simple, yet non-trivial format, which makes a good example case for this recipe.
The first line of FASTA data starts with a >
character followed by a unique identifier. This line often contains other information about the specimen described, the database it came from, and more. After this line comes one or more lines that list the sequence information. A more detailed explanation of the FASTA format is available at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. This is how an example FASTA record looks:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPY IGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIP FHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPY TIIGQMASILYFSIILAFLPIAGXIENY
We'll use the parse-ez library (https://github.com/protoflex/parse-ez) to build the parser.
Getting ready
We need to make sure that parse-ez
is listed in our Leiningen project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [parse-ez "0.3.6"]])
We also need to make it available to our script or REPL:
(use 'protoflex.parse)
How to do it…
To define a parser, we just define functions that parse the different parts of the input and then combine them to parse larger structures:
- It would be useful to have a way to parse two things and throw away the results of the second. This function will do that:
(defn <| [l r] (let [l-output (l)] (r) l-output))
- Also, we'll define a parser for the end of a line. It matches either a carriage return or a new line:
(defn nl [] (chr-in #{\newline \return}))
- Let's start putting the pieces together. The first function parses the sequence definition line by accepting a
>
character, followed by anything up to the end of the line:(defn defline [] (chr \>) (<| #(read-to-re #"[\n\r]+") nl))
- We parse a sequence of amino acid or nucleic acid codes by defining a parser for a single code and then building on that to create a parser for a line of code:
(defn acid-code [] (chr-in #{\A \B \C \D \E \F \G \H \I \K \L \M \N \P \Q \R \S \T \U \V \W \X \Y \Z \- \*})) (defn acid-code-line [] (<| #(multi+ acid-code) #(attempt nl)))
- Next, we combine these parsers into one that parses an entire FASTA record and populates a map with our data. Moreover, we define a combinator that parses multiple FASTA records:
(defn fasta [] (ws?) (let [dl (defline) gls (apply str (flatten (multi+ acid-code-line)))] {:defline dl, :gene-seq gls})) (defn multi-fasta [] (<| #(multi+ fasta) ws?))
- Finally, we create a wrapper function that passes our parser to parse-ez and configures the parsing engine the way we need it:
(defn parse-fasta [input] (parse multi-fasta input :eof false :auto-trim false))
Now, we can use this function to parse the example record, provided at the beginning of this recipe:
user=> parse-fasta test-data) {:defline "gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]", :gene-seq "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY"}
How it works…
At the most abstract level, parsers are functions. They take a string as input and return a data structure. More complex, advanced parsers are built by combining simpler elements.
The <|
function is a good example of this. It does not parse anything by itself. However, this function makes it possible to combine two other parsers in a useful way: it parses both parts and throws away the results of the second.
The acid-code
function is an example of how to create a parser from a basic component. It matches any of the characters in the set.
acid-code-line
then combines the acid-code
parser. It has to match one or more acid-code
characters, optionally followed by a newline
. It uses the <|
combinator to throw away the newline
and return the sequence of acid-codes
.
This entire parser is built up in this way, by composing complex structures from simple parts. While this is a very basic parser, it's possible to create quite complex parsers in this way, leveraging the full power of Clojure while keeping the code readable and maintainable.
- Mastering Entity Framework Core 2.0
- C語言程序設計案例教程(第2版)
- Objective-C應用開發全程實錄
- 構建移動網站與APP:HTML 5移動開發入門與實戰(跨平臺移動開發叢書)
- jQuery從入門到精通 (軟件開發視頻大講堂)
- 硅谷Python工程師面試指南:數據結構、算法與系統設計
- SQL 經典實例
- Android項目實戰:手機安全衛士開發案例解析
- “笨辦法”學C語言
- Raspberry Pi Robotic Blueprints
- Mastering Docker
- 分布式架構原理與實踐
- 從零開始學Selenium自動化測試:基于Python:視頻教學版
- 奔跑吧 Linux內核
- Building UIs with Wijmo