官术网_书友最值得收藏!

Summarization of a numeric field

Let's look at the numeric data, even though most of the columns in the dataset are either categorical or complex. The traditional way to summarize the numeric data is a five-number-summary, which is a representation of the median or mean, interquartile range, and minimum and maximum. I'll leave the computations of the median and interquartile ranges till the Spark DataFrame is introduced, as it makes these computations extremely easy; but we can compute mean, min, and max in Scala by just applying the corresponding operators:

scala> import scala.sys.process._
import scala.sys.process._
scala> val nums = ( "gzcat chapter01/data/clickstream/clickstream_sample.tsv.gz" #| "cut -f 6" ).lineStream
nums: Stream[String] = Stream(0, ?) 
scala> val m = nums.map(_.toDouble).min
m: Double = 0.0
scala> val m = nums.map(_.toDouble).sum/nums.size
m: Double = 3.6883642764024662
scala> val m = nums.map(_.toDouble).max
m: Double = 33.0

Grepping across multiple fields

Sometimes one needs to get an idea of how a certain value looks across multiple fields—most common are IP/MAC addresses, dates, and formatted messages. For examples, if I want to see all IP addresses mentioned throughout a file or a document, I need to replace the cut command in the previous example by grep -o -E [1-9][0-9]{0,2}(?:\\.[1-9][0-9]{0,2}){3}, where the –o option instructs grep to print only the matching parts—a more precise regex for the IP address should be grep –o –E (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?), but is about 50% slower on my laptop and the original one works in most practical cases. I'll leave it as an excursive to run this command on the sample file provided with the book.

主站蜘蛛池模板: 云霄县| 宿迁市| 牙克石市| 米易县| 彰武县| 寿阳县| 新营市| 凤凰县| 刚察县| 霍山县| 庄浪县| 绥中县| 化州市| 西畴县| 苗栗市| 天峻县| 贺兰县| 长治市| 昔阳县| 奇台县| 金秀| 林西县| 克山县| 德化县| 新建县| 嵊州市| 古田县| 南和县| 静乐县| 米泉市| 西峡县| 道孚县| 海阳市| 龙南县| 库车县| 汝城县| 石楼县| 栖霞市| 江城| 璧山县| 东丰县|