官术网_书友最值得收藏!

Maintaining consistency with synonym maps

One common problem with data is inconsistency. Sometimes, a value is capitalized, while sometimes it is not. Sometimes it is abbreviated, and sometimes it is full. At times, there is a misspelling.

When it's an open domain, such as words in a free-text field, the problem can be quite difficult. However, when the data represents a limited vocabulary (such as US state names, for our example here) there's a simple trick that can help. While it's common to use full state names, standard postal codes are also often used. A mapping from common forms or mistakes to a normalized form is an easy way to fix variants in a field.

Getting ready

For the project.clj file, we'll use a very simple configuration:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

We just need to make sure that the clojure.string/upper-case function is available to us:

(use '[clojure.string :only (upper-case)])

How to do it…

  1. For this recipe, we'll define the synonym map and a function to use it. Then, we'll see it in action. We'll define the mapping to a normalized form. I will not list all of the states here, but you should get the idea:
    (def state-synonyms
      {"ALABAMA" "AL",
       "ALASKA" "AK",
       "ARIZONA" "AZ",
       …
       "WISCONSIN" "WI",
       "WYOMING" "WY"})
  2. We'll wrap it in a function that makes the input uppercased before querying the mapping, as shown here:
    (defn normalize-state [state]
      (let [uc-state (upper-case state)]
        (state-synonyms uc-state uc-state)))
  3. Then, we just call normalize-state with the strings we want to fix:
    user=> (map normalize-state
            ["Alabama" "OR" "Va" "Fla"])
    ("AL" "OR" "VA" "FL")

How it works…

The only wrinkle here is that we have to normalize the input a little by making sure that it's uppercased before we can apply the mapping of synonyms to it. Otherwise, we'd also need to have an entry for any possible way in which the input can be capitalized.

See also

  • The Fixing spelling errors recipe later in this chapter
主站蜘蛛池模板: 中卫市| 海兴县| 涪陵区| 雷州市| 北川| 金寨县| 无极县| 张家港市| 辽中县| 报价| 大化| 克什克腾旗| 浮山县| 庆阳市| 永修县| 左贡县| 卢龙县| 汾阳市| 福泉市| 马龙县| 南召县| 常德市| 江油市| 蒙自县| 邮箱| 潼南县| 安义县| 宁阳县| 昌乐县| 徐汇区| 达日县| 蓬溪县| 香河县| 莱州市| 广昌县| 吉木乃县| 唐河县| 富顺县| 靖边县| 福建省| 桂东县|