- Clojure Data Analysis Cookbook(Second Edition)
- Eric Rochester
- 327字
- 2021-08-06 19:26:08
Maintaining consistency with synonym maps
One common problem with data is inconsistency. Sometimes, a value is capitalized, while sometimes it is not. Sometimes it is abbreviated, and sometimes it is full. At times, there is a misspelling.
When it's an open domain, such as words in a free-text field, the problem can be quite difficult. However, when the data represents a limited vocabulary (such as US state names, for our example here) there's a simple trick that can help. While it's common to use full state names, standard postal codes are also often used. A mapping from common forms or mistakes to a normalized form is an easy way to fix variants in a field.
Getting ready
For the project.clj
file, we'll use a very simple configuration:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
We just need to make sure that the clojure.string/upper-case
function is available to us:
(use '[clojure.string :only (upper-case)])
How to do it…
- For this recipe, we'll define the synonym map and a function to use it. Then, we'll see it in action. We'll define the mapping to a normalized form. I will not list all of the states here, but you should get the idea:
(def state-synonyms {"ALABAMA" "AL", "ALASKA" "AK", "ARIZONA" "AZ", … "WISCONSIN" "WI", "WYOMING" "WY"})
- We'll wrap it in a function that makes the input uppercased before querying the mapping, as shown here:
(defn normalize-state [state] (let [uc-state (upper-case state)] (state-synonyms uc-state uc-state)))
- Then, we just call
normalize-state
with the strings we want to fix:user=> (map normalize-state ["Alabama" "OR" "Va" "Fla"]) ("AL" "OR" "VA" "FL")
How it works…
The only wrinkle here is that we have to normalize the input a little by making sure that it's uppercased before we can apply the mapping of synonyms to it. Otherwise, we'd also need to have an entry for any possible way in which the input can be capitalized.
See also
- The Fixing spelling errors recipe later in this chapter
- C語言程序設計
- Microsoft Dynamics GP 2013 Reporting, Second Edition
- CKA/CKAD應試教程:從Docker到Kubernetes完全攻略
- Python數據分析從0到1
- iOS編程基礎:Swift、Xcode和Cocoa入門指南
- WebRTC技術詳解:從0到1構建多人視頻會議系統
- Mastering Linux Network Administration
- Python編程實戰
- 搞定J2EE:Struts+Spring+Hibernate整合詳解與典型案例
- Web Developer's Reference Guide
- Getting Started with React VR
- Python編程基礎教程
- Offer來了:Java面試核心知識點精講(框架篇)
- Mastering Object:Oriented Python(Second Edition)
- SQL Server 2014 Development Essentials