官术网_书友最值得收藏!

Parsing dates and times

One difficult issue when normalizing and cleaning up data is how to deal with time. People enter dates and times in a bewildering variety of formats; some of them are ambiguous, and some of them are vague. However, we have to do our best to interpret them and normalize them into a standard format.

In this recipe, we'll define a function that attempts to parse a date into a standard string format. We'll use the clj-time Clojure library, which is a wrapper around the Joda Java library (http://joda-time.sourceforge.net/).

Getting ready

First, we need to declare our dependencies in the Leiningen project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clj-time "0.9.0-beta1"]])

Then, we need to load these dependencies into our script or REPL. We'll exclude second from clj-time to keep it from clashing with clojure.core/second:

(use '[clj-time.core :exclude (extend second)]
     '[clj-time.format])

How to do it…

In order to solve this problem, we'll specify a sequence of date/time formats and walk through them. The first that doesn't throw an exception will be the one that we'll use.

  1. Here's a list of formats that you can try:
    (def ^:dynamic *default-formats*
      [:date
       :date-hour-minute
       :date-hour-minute-second
       :date-hour-minute-second-ms
       :date-time
       :date-time-no-ms
       :rfc822
       "YYYY-MM-dd HH:mm"
       "YYYY-MM-dd HH:mm:ss"
       "dd/MM/YYYY"
       "YYYY/MM/dd"
       "d MMM YYYY"])
  2. Notice that some of these are keywords and some are strings. Each needs to be handled differently. We'll define a protocol with the method ->formatter, which attempts to convert each type to a date formatter, and the protocol for both the types to be represented in the format list:
    (defprotocol ToFormatter
      (->formatter [fmt]))
    
    (extend-protocol ToFormatter
      java.lang.String
      (->formatter [fmt]
     (formatter fmt))
      clojure.lang.Keyword
      (->formatter [fmt] (formatters fmt)))
  3. Next, parse-or-nil will take a format and a date string, attempt to parse the date string, and return nil if there are any errors:
    (defn parse-or-nil [fmt date-str]
      (try
        (parse (->formatter fmt) date-str)
        (catch Exception ex
          nil)))
  4. With these in place, here is normalize-datetime. We just attempt to parse a date string with all of the formats, filter out any nil values, and return the first non-nil. Because Clojure's lists are lazy, this will stop processing as soon as one format succeeds:
    (defn normalize-datetime [date-str]
      (first
        (remove nil?
                (map #(parse-or-nil % date-str)
                     *default-formats*))))

Now we can try this out:

user=> (normalize-datetime "2012-09-12")
#<DateTime 2012-09-12T00:00:00.000Z>
user=> (normalize-datetime "2012/09/12")
#<DateTime 2012-09-12T00:00:00.000Z>
user=> (normalize-datetime "28 Sep 2012")
#<DateTime 2012-09-28T00:00:00.000Z>
user=> (normalize-datetime "2012-09-28 13:45")
#<DateTime 2012-09-28T13:45:00.000Z>

There's more…

This approach to parse dates has a number of problems. For example, because some date formats are ambiguous, the first match might not be the correct one.

However, trying out a list of formats is probably about the best we can do. Knowing something about our data allows us to prioritize the list appropriately, and we can augment it with ad hoc formats as we run across new data. We might also need to normalize data from different sources (for instance, U.S. date formats versus the rest of the world) before we merge the data together.

主站蜘蛛池模板: 塘沽区| 旬邑县| 从化市| 木兰县| 汉川市| 大渡口区| 乌恰县| 景东| 陕西省| 北京市| 阜新市| 盘山县| 勐海县| 郑州市| 嵊泗县| 彩票| 蒙自县| 上饶市| 尼玛县| 内丘县| 竹山县| 邢台县| 凯里市| 鄂尔多斯市| 深水埗区| 定南县| 和田市| 洪泽县| 铜陵市| 平顶山市| 巩留县| 龙胜| 东台市| 灵璧县| 湘潭市| 四会市| 黄平县| 横山县| 敖汉旗| 杭锦后旗| 柳河县|