Visualizing the types of data

Visualizing and communicating data is incredibly important, especially with young companies that are making data-driven decisions for the first time, or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings, or the way techniques work to audiences, both technical and non-technical. Different types of data have different ways of representation. When we talk about the categorical values, the ideal representation visuals would be these:

Bar charts
Pie charts
Pareto diagrams
Frequency distribution tables

A bar chart would visually represent the values stored in the frequency distribution tables. Each bar would represent one categorical value. A bar chart is also a base line for a pareto diagram, which includes the relative and cumulative frequency for the categorical values:

Bar chart representing the relative and cumulative frequency for the categorical values

If we'll add the cumulative frequency to the bar chart, we will have a pareto diagram of the same data:

Pareto diagram representing the relative and cumulative frequency for the categorical values

Another very useful type of visualization for categorical data is the pie chart. Pie charts display the percentage of the total for each categorical value. In statistics, this is called the relative frequency. The relative frequency is the percentage of the total frequency of each category. This type of visual is commonly used for market-share representations:

Pie chart representing the market share for Volkswagen

All the values are imaginary and are used just for demonstration purposes; these numbers don't represent a real market share by different brands in Volkswagen around the world, or in any city.

For numeric data, the ideal start would be a frequency distribution table, which will contain ordered or unordered values. Numeric data is very frequently displayed with histograms or scatter plots. When using intervals, the rule of thumb is to use 5 to 20 intervals, to have a meaningful representation of the data.

Let's create a table with 20 discrete data points, which we'll display visually. To create the table, we can use the following T-SQL script:

CREATE TABLE [dbo].[dataset](
 [datapoint] [int] NOT NULL
) ON [PRIMARY]

To insert new values into the table, let's use the script:

INSERT [dbo].[dataset] ([datapoint]) VALUES (7)
INSERT [dbo].[dataset] ([datapoint]) VALUES (28)
INSERT [dbo].[dataset] ([datapoint]) VALUES (50)
etc. with more values to have 20 values in total

The table will include numbers in the range of 0 to 300, and the content of the table can be retrieved with this:

SELECT * FROM [dbo].[dataset]
ORDER BY datapoint

To visualize a descrete values dataset, we'll need to build a histogram. The histogram will have six intervals, and the interval length can be calculated as a (largest value ? smallest value) / number of intervals. When we build the frequency distribution table and the intervals for the histogram, we'll end up with the following results:

A histogram based on the absolute frequency of the discrete values will look such as this one:

官术网_书友最值得收藏!

Hands-On Data Science with SQL Server 2017

Visualizing the types of data