- Clojure Data Analysis Cookbook(Second Edition)
- Eric Rochester
- 490字
- 2021-08-06 19:26:08
Parsing dates and times
One difficult issue when normalizing and cleaning up data is how to deal with time. People enter dates and times in a bewildering variety of formats; some of them are ambiguous, and some of them are vague. However, we have to do our best to interpret them and normalize them into a standard format.
In this recipe, we'll define a function that attempts to parse a date into a standard string format. We'll use the clj-time
Clojure library, which is a wrapper around the Joda Java library (http://joda-time.sourceforge.net/).
Getting ready
First, we need to declare our dependencies in the Leiningen project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clj-time "0.9.0-beta1"]])
Then, we need to load these dependencies into our script or REPL. We'll exclude second
from clj-time
to keep it from clashing with clojure.core/second
:
(use '[clj-time.core :exclude (extend second)] '[clj-time.format])
How to do it…
In order to solve this problem, we'll specify a sequence of date/time formats and walk through them. The first that doesn't throw an exception will be the one that we'll use.
- Here's a list of formats that you can try:
(def ^:dynamic *default-formats* [:date :date-hour-minute :date-hour-minute-second :date-hour-minute-second-ms :date-time :date-time-no-ms :rfc822 "YYYY-MM-dd HH:mm" "YYYY-MM-dd HH:mm:ss" "dd/MM/YYYY" "YYYY/MM/dd" "d MMM YYYY"])
- Notice that some of these are keywords and some are strings. Each needs to be handled differently. We'll define a protocol with the method
->formatter
, which attempts to convert each type to a date formatter, and the protocol for both the types to be represented in the format list:(defprotocol ToFormatter (->formatter [fmt])) (extend-protocol ToFormatter java.lang.String (->formatter [fmt] (formatter fmt)) clojure.lang.Keyword (->formatter [fmt] (formatters fmt)))
- Next,
parse-or-nil
will take a format and a date string, attempt to parse the date string, and returnnil
if there are any errors:(defn parse-or-nil [fmt date-str] (try (parse (->formatter fmt) date-str) (catch Exception ex nil)))
- With these in place, here is
normalize-datetime
. We just attempt to parse a date string with all of the formats, filter out anynil
values, and return the first non-nil. Because Clojure's lists are lazy, this will stop processing as soon as one format succeeds:(defn normalize-datetime [date-str] (first (remove nil? (map #(parse-or-nil % date-str) *default-formats*))))
Now we can try this out:
user=> (normalize-datetime "2012-09-12") #<DateTime 2012-09-12T00:00:00.000Z> user=> (normalize-datetime "2012/09/12") #<DateTime 2012-09-12T00:00:00.000Z> user=> (normalize-datetime "28 Sep 2012") #<DateTime 2012-09-28T00:00:00.000Z> user=> (normalize-datetime "2012-09-28 13:45") #<DateTime 2012-09-28T13:45:00.000Z>
There's more…
This approach to parse dates has a number of problems. For example, because some date formats are ambiguous, the first match might not be the correct one.
However, trying out a list of formats is probably about the best we can do. Knowing something about our data allows us to prioritize the list appropriately, and we can augment it with ad hoc formats as we run across new data. We might also need to normalize data from different sources (for instance, U.S. date formats versus the rest of the world) before we merge the data together.
- 手機(jī)安全和可信應(yīng)用開發(fā)指南:TrustZone與OP-TEE技術(shù)詳解
- 64位匯編語(yǔ)言的編程藝術(shù)
- Python貝葉斯分析(第2版)
- Rust Essentials(Second Edition)
- Hands-On Natural Language Processing with Python
- 深入理解Android:Wi-Fi、NFC和GPS卷
- Python忍者秘籍
- Express Web Application Development
- Cybersecurity Attacks:Red Team Strategies
- R Data Science Essentials
- C語(yǔ)言程序設(shè)計(jì)實(shí)訓(xùn)教程與水平考試指導(dǎo)
- Building Slack Bots
- Exploring SE for Android
- Python大規(guī)模機(jī)器學(xué)習(xí)
- PHP項(xiàng)目開發(fā)全程實(shí)錄(第4版)