- Clojure Data Analysis Cookbook(Second Edition)
- Eric Rochester
- 624字
- 2021-08-06 19:26:09
Lazily processing very large data sets
One of the good features of Clojure is that most of its sequence-processing functions are lazy. This allows us to handle very large datasets with very little effort. However, when combined with readings from files and other I/O, there are several things that you need to watch out for.
In this recipe, we'll take a look at several ways to safely and lazily read a CSV file. By default, the clojure.data.csv/read-csv
is lazy, so how do you maintain this feature while closing the file at the right time?
Getting ready
We'll use a project.clj
file that includes a dependency on the Clojure CSV library:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [org.clojure/data.csv "0.1.2"]])
We need to load the libraries that we're going to use into the REPL:
(require '[clojure.data.csv :as csv] '[clojure.java.io :as io])
How to do it…
We'll try several solutions and consider their strengths and weaknesses:
- Let's start with the most straightforward way:
(defn lazy-read-bad-1 [csv-file] (with-open [in-file (io/reader csv-file)] (csv/read-csv in-file))) user=> (lazy-read-bad-1 "data/small-sample.csv") IOException Stream closed java.io.BufferedReader.ensureOpen (BufferedReader.java:97)
Oops! At the point where the function returns the lazy sequence, it hasn't read any data yet. However, when exiting the
with-open
form, the file is automatically closed. What happened?First, the file is opened and passed to
read-csv
, which returns a lazy sequence. The lazy sequence is returned fromwith-open
, which closes the file. Finally, the REPL tries to print out this lazy sequence. Now,read-csv
tries to pull data from the file. However, at this point the file is closed, so theIOException
is raised.This is a pretty common problem for the first draft of a function. It especially seems to bite me whenever I'm doing database reads, for some reason.
- So, in order to fix this, we'll just force all of the lines to be read:
(defn lazy-read-bad-2 [csv-file] (with-open [in-file (io/reader csv-file)] (doall (csv/read-csv in-file))))
This will return data, but everything gets loaded into the memory. Now, we have safety but no laziness.
- Here's how we can get both:
(defn lazy-read-ok [csv-file] (with-open [in-file (io/reader csv-file)] (frequencies (map #(nth % 2) (csv/read-csv in-file)))))
This is one way to do it. Now, we've moved what we're going to do to the data into the function that reads it. This works, but it has a poor separation of concerns. It is both reading and processing the data, and we really should break these into two functions.
- Let's try it one more time:
(defn lazy-read-csv [csv-file] (let [in-file (io/reader csv-file) csv-seq (csv/read-csv in-file) lazy (fn lazy [wrapped] (lazy-seq (if-let [s (seq wrapped)] (cons (first s) (lazy (rest s))) (.close in-file))))] (lazy csv-seq)))
This works! Let's talk about why.
How it works…
The last version of the function, lazy-read-csv
, works because it takes the lazy sequence that csv/read-csv
produces and wraps it in another sequence that closes the input file when there is no more data coming out of the CSV file. This is complicated because we're working with two levels of input: reading from the file and reading CSV. When the higher-level task (reading CSV) is completed, it triggers an operation on the lower level (reading the file). This allows you to read files that don't fit into the memory and process their data on the fly.
However, with this function, we again have a nice, simple interface that we can present to callers while keeping the complexity hidden.
Unfortunately, this still has one glaring problem: if we're not going to read the entire file (say we're only interested in the first 100 lines or something) the file handle won't get closed. For the use cases in which only a part of the file will be read, lazy-read-ok
is probably the best option.
- Flask Web全棧開發實戰
- Raspberry Pi for Python Programmers Cookbook(Second Edition)
- Designing Machine Learning Systems with Python
- HTML5移動Web開發技術
- 數據結構習題精解(C語言實現+微課視頻)
- EPLAN實戰設計
- 響應式架構:消息模式Actor實現與Scala、Akka應用集成
- 超簡單:用Python讓Excel飛起來(實戰150例)
- .NET 4.0面向對象編程漫談:應用篇
- Java高并發編程詳解:深入理解并發核心庫
- Raspberry Pi Robotic Projects
- Python程序設計現代方法
- Eclipse開發(學習筆記)
- 開發者測試
- 算法技術手冊