官术网_书友最值得收藏!

Visualizing the types of data

Visualizing and communicating data is incredibly important, especially with young companies that are making data-driven decisions for the first time, or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings, or the way techniques work to audiences, both technical and non-technical. Different types of data have different ways of representation. When we talk about the categorical values, the ideal representation visuals would be these:

  • Bar charts
  • Pie charts
  • Pareto diagrams 
  • Frequency distribution tables

A bar chart would visually represent the values stored in the frequency distribution tables. Each bar would represent one categorical value. A bar chart is also a base line for a pareto diagram, which includes the relative and cumulative frequency for the categorical values:

Bar chart representing the  relative and cumulative frequency for the categorical values

If we'll add the cumulative frequency to the bar chart, we will have a pareto diagram of the same data:

Pareto diagram representing the relative and cumulative frequency for the categorical values

Another very useful type of visualization for categorical data is the pie chart. Pie charts display the percentage of the total for each categorical value. In statistics, this is called the relative frequency. The relative frequency is the percentage of the total frequency of each category. This type of visual is commonly used for market-share representations:

Pie chart representing the market share for Volkswagen
All the values are imaginary and are used just for demonstration purposes; these numbers don't represent a real market share by different brands in Volkswagen around the world, or in any city.

For numeric data, the ideal start would be a frequency distribution table, which will contain ordered or unordered values. Numeric data is very frequently displayed with histograms or scatter plots. When using intervals, the rule of thumb is to use 5 to 20 intervals, to have a meaningful representation of the data.

Let's create a table with 20 discrete data points, which we'll display visually. To create the table, we can use the following T-SQL script:

CREATE TABLE [dbo].[dataset](
[datapoint] [int] NOT NULL
) ON [PRIMARY]

To insert new values into the table, let's use the script:

INSERT [dbo].[dataset] ([datapoint]) VALUES (7)
INSERT [dbo].[dataset] ([datapoint]) VALUES (28)
INSERT [dbo].[dataset] ([datapoint]) VALUES (50)
etc. with more values to have 20 values in total

The table will include numbers in the range of 0 to 300, and the content of the table can be retrieved with this:

SELECT * FROM [dbo].[dataset]
ORDER BY datapoint

To visualize a descrete values dataset, we'll need to build a histogram. The histogram will have six intervals, and the interval length can be calculated as a (largest value ? smallest value) / number of intervals. When we build the frequency distribution table and the intervals for the histogram, we'll end up with the following results:

A histogram based on the absolute frequency of the discrete values will look such as this one:

主站蜘蛛池模板: 克山县| 利辛县| 邛崃市| 阳泉市| 比如县| 永和县| 绥德县| 吉林省| 静安区| 叶城县| 青岛市| 六安市| 林口县| 东安县| 黔江区| 沙洋县| 天津市| 资中县| 宣威市| 阳西县| 启东市| 陆良县| 正镶白旗| 武安市| 休宁县| 古蔺县| 清丰县| 浦北县| 民丰县| 台北县| 满城县| 垣曲县| 奉节县| 江孜县| 加查县| 舟曲县| 剑阁县| 六安市| 西吉县| 淳安县| 乌兰浩特市|