官术网_书友最值得收藏!

Lazily processing very large data sets

One of the good features of Clojure is that most of its sequence-processing functions are lazy. This allows us to handle very large datasets with very little effort. However, when combined with readings from files and other I/O, there are several things that you need to watch out for.

In this recipe, we'll take a look at several ways to safely and lazily read a CSV file. By default, the clojure.data.csv/read-csv is lazy, so how do you maintain this feature while closing the file at the right time?

Getting ready

We'll use a project.clj file that includes a dependency on the Clojure CSV library:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [org.clojure/data.csv "0.1.2"]])

We need to load the libraries that we're going to use into the REPL:

(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])

How to do it…

We'll try several solutions and consider their strengths and weaknesses:

  1. Let's start with the most straightforward way:
    (defn lazy-read-bad-1 [csv-file]
      (with-open [in-file (io/reader csv-file)]
        (csv/read-csv in-file)))
    user=> (lazy-read-bad-1 "data/small-sample.csv")
    IOException Stream closed  java.io.BufferedReader.ensureOpen (BufferedReader.java:97)

    Oops! At the point where the function returns the lazy sequence, it hasn't read any data yet. However, when exiting the with-open form, the file is automatically closed. What happened?

    First, the file is opened and passed to read-csv, which returns a lazy sequence. The lazy sequence is returned from with-open, which closes the file. Finally, the REPL tries to print out this lazy sequence. Now, read-csv tries to pull data from the file. However, at this point the file is closed, so the IOException is raised.

    This is a pretty common problem for the first draft of a function. It especially seems to bite me whenever I'm doing database reads, for some reason.

  2. So, in order to fix this, we'll just force all of the lines to be read:
    (defn lazy-read-bad-2 [csv-file]
      (with-open [in-file (io/reader csv-file)]
        (doall
          (csv/read-csv in-file))))

    This will return data, but everything gets loaded into the memory. Now, we have safety but no laziness.

  3. Here's how we can get both:
    (defn lazy-read-ok [csv-file]
      (with-open [in-file (io/reader csv-file)]
        (frequencies
          (map #(nth % 2) (csv/read-csv in-file)))))

    This is one way to do it. Now, we've moved what we're going to do to the data into the function that reads it. This works, but it has a poor separation of concerns. It is both reading and processing the data, and we really should break these into two functions.

  4. Let's try it one more time:
    (defn lazy-read-csv [csv-file]
      (let [in-file (io/reader csv-file)
            csv-seq (csv/read-csv in-file)
            lazy (fn lazy [wrapped]
                   (lazy-seq
                     (if-let [s (seq wrapped)]
                       (cons (first s) (lazy (rest s)))
                       (.close in-file))))]
        (lazy csv-seq)))

This works! Let's talk about why.

How it works…

The last version of the function, lazy-read-csv, works because it takes the lazy sequence that csv/read-csv produces and wraps it in another sequence that closes the input file when there is no more data coming out of the CSV file. This is complicated because we're working with two levels of input: reading from the file and reading CSV. When the higher-level task (reading CSV) is completed, it triggers an operation on the lower level (reading the file). This allows you to read files that don't fit into the memory and process their data on the fly.

However, with this function, we again have a nice, simple interface that we can present to callers while keeping the complexity hidden.

Unfortunately, this still has one glaring problem: if we're not going to read the entire file (say we're only interested in the first 100 lines or something) the file handle won't get closed. For the use cases in which only a part of the file will be read, lazy-read-ok is probably the best option.

主站蜘蛛池模板: 凌云县| 金昌市| 孟村| 黑水县| 贺州市| 南靖县| 星座| 台中县| 门头沟区| 工布江达县| 台山市| 华容县| 青阳县| 将乐县| 嘉荫县| 蓬莱市| 信丰县| 海伦市| 中宁县| 英超| 渝中区| 宜宾县| 山东| 惠水县| 高要市| 临邑县| 屯门区| 迁安市| 北京市| 广河县| 阿拉善左旗| 平潭县| 建水县| 甘肃省| 九江县| 读书| 将乐县| 樟树市| 辰溪县| 中江县| 内江市|