官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 455字
  • 2021-07-02 18:23:47

Visualizing data with Apache Zeppelin

Typically, we will generate many graphs to verify our hunches about the data. A lot of these quick and dirty graphs used during EDA are, ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers cannot typically cannot handle millions of data points. Hence, we have to summarize, sample, or model our data before we can effectively visualize it.

Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data was subsequently downloaded and visualized on the practitioner's workstations. Spark can eliminate many of these batch jobs to support interactive data visualization.

In this section, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin:

  1. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command:
      Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-all aurobindosarkar$ 
bin/zeppelin-daemon.sh start
  1. You should see the following message:
      Zeppelin start                                           [  OK  ]
  1. You should be able to see the Zeppelin home page at: http://localhost:8080/:
  1. Click on the Create new note link and specify a path and name for your notebook, as shown:
  2. In the next step, we paste the same code as in the beginning of this chapter to create a DataFrame for our sample Dataset:
  1. We can execute typical DataFrame operations, as follows:
  1. Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements' execution can be charted by clicking on the appropriate chart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data:
  1. We can create a scatter plot, as shown in the following figure:

You can also read the coordinate values of each of the points plotted:

  1. Additionally, we can create a textbox that accepts input values to make the experience interactive. In the following figure, we create a textbox that can accept different values for the age parameter and the bar chart is updated accordingly:
  1. Similarly, we can also create drop-down lists where the user can select the appropriate option:

And, the table of values or chart automatically gets updated:

We will explore more advanced visualizations using Spark SQL and SparkR in Chapter 8, Using Spark SQL with SparkR. In the next section, we will explore the methods used to generate samples from our data.

主站蜘蛛池模板: 富裕县| 登封市| 剑川县| 云霄县| 岱山县| 麻阳| 黄冈市| 蓬安县| 安陆市| 石渠县| 卫辉市| 夏河县| 东丰县| 麟游县| 河西区| 黄平县| 炉霍县| 梅河口市| 堆龙德庆县| 武功县| 甘南县| 会昌县| 新兴县| 祁连县| 双柏县| 满城县| 安阳市| 锦屏县| 红原县| 南丰县| 海盐县| 华阴市| 永川市| 双柏县| 南漳县| 株洲市| 驻马店市| 南宁市| 伊金霍洛旗| 泗洪县| 合水县|