- Mastering Scala Machine Learning
- Alex Kozlov
- 267字
- 2021-07-02 16:33:20
Summarization of a numeric field
Let's look at the numeric data, even though most of the columns in the dataset are either categorical or complex. The traditional way to summarize the numeric data is a five-number-summary, which is a representation of the median or mean, interquartile range, and minimum and maximum. I'll leave the computations of the median and interquartile ranges till the Spark DataFrame is introduced, as it makes these computations extremely easy; but we can compute mean, min, and max in Scala by just applying the corresponding operators:
scala> import scala.sys.process._ import scala.sys.process._ scala> val nums = ( "gzcat chapter01/data/clickstream/clickstream_sample.tsv.gz" #| "cut -f 6" ).lineStream nums: Stream[String] = Stream(0, ?) scala> val m = nums.map(_.toDouble).min m: Double = 0.0 scala> val m = nums.map(_.toDouble).sum/nums.size m: Double = 3.6883642764024662 scala> val m = nums.map(_.toDouble).max m: Double = 33.0
Grepping across multiple fields
Sometimes one needs to get an idea of how a certain value looks across multiple fields—most common are IP/MAC addresses, dates, and formatted messages. For examples, if I want to see all IP addresses mentioned throughout a file or a document, I need to replace the cut
command in the previous example by grep -o -E [1-9][0-9]{0,2}(?:\\.[1-9][0-9]{0,2}){3}
, where the –o
option instructs grep
to print only the matching parts—a more precise regex for the IP address should be grep –o –E (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
, but is about 50% slower on my laptop and the original one works in most practical cases. I'll leave it as an excursive to run this command on the sample file provided with the book.
- 流量的秘密:Google Analytics網站分析與優化技巧(第2版)
- Spring 5企業級開發實戰
- Delphi程序設計基礎:教程、實驗、習題
- BeagleBone Media Center
- Unreal Engine 4 Shaders and Effects Cookbook
- UML 基礎與 Rose 建模案例(第3版)
- IBM Cognos TM1 Developer's Certification guide
- HTML5移動前端開發基礎與實戰(微課版)
- Deep Learning for Natural Language Processing
- Getting Started with JUCE
- C語言王者歸來
- Python全棧開發:數據分析
- R語言與網站分析
- 新手學ASP.NET 3.5網絡開發
- OpenStack Sahara Essentials