BUSINESS ANALYTICS
Finding Relationships among Variables
Introduction
The primary interest in data analysis is usually in relationships between
variables.
◦ The most useful numerical summary measure is correlation.
◦ The most useful graph is a scatterplot.
◦ To break down a numerical variable by a categorical variable, it is useful to
create side-by-side box plots.
◦ Excel’s® pivot table breaks down one variable by others so that all sorts of
relationships can be uncovered very quickly.
The diagram in the file Data Analysis Taxonomy.xlsx (next slide too)
gives you the big picture of which analyses are appropriate for which
data types and which tools are best for performing the various analyses.


– –
Categorical variables
(Section 2.3)
Counts of categories
Charts of counts
Purpose of analysis
Describe individual
variables
(Chapter 2)
Numerical variables
(Section 2.4)
Time series
Find relationships
between variables
(Chapter 3)
Categorical vs
categorical
(Sections 3.2, 3.5)
Tables of joint counts
(cross tabs or pivot
tables)
Charts of joint counts
Categorical vs numerical
(Sections 3.3, 3.5)
Summary measures by
category
Side by side boxplots
Cross sectional data
Summary measures
(mean, median,
standard deviation,
quartiles, etc.)
Histograms, box plots
Time series charts for
patterns Trend lines
Pivot tables
Numerical vs numerical
(Sections 3.4, 3.5)
Scatterplots
Correlations (and
covariances)
Pivot tables
Trend lines (regression)
Relationships Among
Categorical Variables
The most meaningful way to examine relationships between two
categorical variables is with counts and corresponding charts of the
counts.
◦ You can find counts of the categories of either variable separately, as well as
counts of the joint categories of the two variables.
◦ Corresponding percentages of totals and charts help tell the story.
It is customary to display all such counts in a table called a crosstabs (for
crosstabulations). This is also sometimes called a contingency table.
Example 1:
Smoking Drinking.xlsx (slide 1 of 2)
Objective: To use a crosstabs to explore the
relationship between smoking and drinking.
Solution: Data set lists the smoking and
drinking habits of 8761 adults.
Categories have been coded “N,” “O,” “H,” “S,”
and “D” for “Non,” “Occasional,” “Heavy,”
“Smoker,” and “Drinker.”
Example 1:
Drinking and Smoking.xlsx (slide 2 of 2)
To make the crosstabs, first enter the category heads in Excel and then use the COUNTIFS function to populate the table with the counts of the joint categories in the table.
After that, add up the sums across rows and down columns to get the totals.
The counts should then be expressed as percentages of the row and percentages of the column.
Relationships between categorical variables and a numeric variable are examined.
Data analysis’s comparison challenge is one of the most difficult problems to solve in the field. If you want to compare a numerical measure across two or more subpopulations, you’ll run into this problem.
Men and women are divided into two groups, and the numerical measure is their salaries, for example.
The subpopulations correspond to different regions of the country, and the numerical measure represents the cost of living in each region.
◦ The subpopulations correspond to different days of the week, and the numerical measure is the number of consumers who visit a particular fast-food restaurant on each given day of the week.
Formats that are stacked and unstacked
There are two different data types to choose from: stacked and unstacked.
In cases where there are two “long” variables, such as Gender and Salary, the data is stacked. Essentially, the concept is that the incomes of men are layered on top of the salaries of women.
When it comes to the vast majority of circumstances, this is the format you will see.
◦ Data in unstacked format will occasionally be seen when there are two “short” variables, such as Male Salary and Female Salary, which are both present in the dataset.
StatTools is capable of dealing with either format and can convert from stacked to unstacked or vice versa as required by the user.
Data that has been stacked and unstacked
Data that has been stacked and data that has been unstacked
Extra.xlsx is an example of a baseball salary spreadsheet (slide 1 of 2)
Objective: This course will teach you how to use StatTools to break down baseball salary based on a variety of categorical characteristics.
Solution: In addition to the same 2011 baseball data studied previously, this data set contains a number of additional category factors.
One-Variable Summary is a summary measure that may be created by selecting it from the Summary Statistics dropdown list.
After that, select Stacked from the Format drop-down menu. Then select the Cat variable that you want to categorize by and the Val variable that you want to summarize by using the options provided.
Extra.xlsx is an example of a baseball salary spreadsheet (slide 2 of 2)
Using the BoxWhisker Plot option from the Summary Graphs dropdown list and filling in the subsequent dialog box, you may create boxplots side by side.
To choose a Cat variable and a Val variable, use the Stacked format from the Format drop-down menu.
Among Numerical Variables, There Are Relationships
A new style of chart, called a scatterplot, as well as two new summary measures, correlation and covariance, are employed in order to investigate correlations between numerical variables.
These measurements can be used to any variables that are represented numerically in the display.
However, they are only acceptable for variables that are truly numerical in nature, and not for category variables that have been numerically coded.
Scatterplots
In a scatterplot, each point represents the values of an observation for two specified variables, and the scatterplot is composed of a scatter of points.
Smoking Drinking.xlsx (slide 2 of 2)
To create the crosstabs, enter the
category headings in Excel and use the
COUNTIFS function to fill the table with
counts of joint categories.
Next, sum across rows and down
columns to get totals.
Then express the counts as
percentages of row and percentages of
column.
Relationships Among Categorical
Variables and a Numerical Variable
The comparison problem is one of the most important problems in data
analysis. It occurs whenever you want to compare a numerical measure
across two or more subpopulations.
◦ Examples:
◦ The subpopulations are males and females, and the numerical measure is salary.
◦ The subpopulations are different regions of the country, and the numerical measure is the cost
of living.
◦ The subpopulations are different days of the week, and the numerical measure is the number of
customers going to a particular fast-food chain.
Stacked and Unstacked
Formats
There are two possible data formats, stacked and unstacked.
◦ The data are stacked if there are two “long” variables, such as Gender and
Salary. The idea is that the male salaries are stacked in with the female
salaries.
◦ This is the format you will see in the vast majority of situations.
◦ You will occasionally see data in unstacked format, when there are two
“short” variables, such as Male Salary and Female Salary.
StatTools is capable of dealing with either format and can convert from
stacked to unstacked or vice versa.
Stacked and Unstacked Data
Stacked Data Unstacked Data
Example 2:
Baseball Salaries 2011 Extra.xlsx (slide 1 of 2)
Objective: To learn methods in StatTools for breaking down baseball
salaries by various categorical variables.
Solution: Data set contains the same 2011 baseball data examined
previously, as well as several extra categorical variables.
Create summary measures by selecting One-Variable Summary from
the Summary Statistics dropdown list.
Next, click the Format button and choose Stacked. Then choose the Cat
variable you want to categorize by and the Val variable you want to
summarize.
Example 2:
Baseball Salaries 2011 Extra.xlsx (slide 2 of 2)
Create side-by-side
boxplots, by selecting BoxWhisker Plot from the
Summary Graphs
dropdown list and filling in
the resulting dialog box.
Select the Stacked format
so that you can choose a
Cat variable and a Val
variable.
Relationships Among Numerical
Variables
To study relationships among numerical variables, a new type of chart,
called a scatterplot, and two new summary measures, correlation and
covariance, are used.
These measures can be applied to any variables that are displayed
numerically.
However, they are appropriate only for truly numerical variables, not for
categorical variables that have been coded numerically.
Scatterplots
A scatterplot is a scatter of points, where each point denotes the values
of an observation for two selected variables.
◦ It is a graphical method for detecting relationships between two numerical
variables.
◦ The two variables are often labeled generically as X and Y, so a scatterplot is
sometimes called an X-Y chart.
◦ The purpose of a scatterplot is to make a relationship (or the lack of it)
apparent.
Example 3:
GolfStats.xlsx (slide 1 of 2)
Objective: To use scatterplots to search for relationships in the golf data.
Solution: Data set includes an observation (stats) for each of the top 200
earners on the PGA Tour.
In StatTools, designate a StatTools data set for a particular year.
Next, select Scatterplot from the Summary Graphs dropdown list and then
select at least one X variable and at least one Y variable.
Example 3:
GolfStats.xlsx (slide 2 of 2)
Trend Lines in Scatterplots
Once you have a scatterplot, Excel enables you to superimpose one of
several trend lines on the scatterplot.
◦ A trend line is a line or curve that “fits” the scatter as well as possible.
◦ This could be a straight line, or it could be one of several types of curves.
To do this, right-click on any point in the chart, select Add Trendline, and
fill out the resulting dialog box.
Scatterplot with Trend Line
and Equation Superimposed
Correlation and Covariance
(slide 1 of 4)
Correlation and covariance measure the strength and direction of a
linear relationship between two numerical variables.
◦ The relationship is “strong” if the points in a scatterplot cluster tightly
around some straight line.
◦ If this straight line rises from left to right, the relationship is positive and the measures will be
positive numbers.
◦ If it falls from left to right, the relationship is negative and the measures will be negative
numbers.
◦ The two numerical variables must be “paired” variables.
◦ They must have the same number of observations, and the values for any observation should be
naturally paired.
Correlation and Covariance
(slide 2 of 4)
Covariance is essentially an average of products of deviations from
means.
Excel has a built-in COVAR function, and StatTools also calculates
covariances automatically.
Covariance has a serious limitation as a descriptive measure because it
is very sensitive to the units in which X and Y are measured.
Correlation and Covariance
(slide 3 of 4)
Correlation is a unitless quantity that is unaffected by the measurement
scale.
The correlation is always between -1 and +1.
◦ The closer it is to either of these two extremes, the closer the points in a
scatterplot are to a straight line.
Excel has a built-in CORREL function, and StatTools also calculates
correlations automatically.
Correlation and Covariance
(slide 4 of 4)
Three important points about scatterplots, correlations, and
covariances:
◦ A correlation is a single-number summary of a scatterplot. It never conveys
as much information as the full scatterplot.
◦ You are usually on the lookout for large correlations, those near -1 or +1.
◦ Do not even try to interpret covariances numerically except possibly to check
whether they are positive or negative. For interpretive purposes,
concentrate on correlations.
Example 3 (Continued)
GolfStats.xlsx (slide 1 of 2)
Objective: To use correlations to understand relationships in the golf data.
Solution: In StatTools, create a table of correlations by selecting Correlation
and Covariance from the Summary Statistics dropdown list.
Fill in the resulting dialog box and check Correlations.
Example 3 (Continued)
GolfStats.xlsx (slide 2 of 2)
You can learn more about a correlation by creating the corresponding
scatterplot.
Pivot Tables
The pivot table is an Excel tool that allows you to break data down by
categories.
Sometimes pivot tables are used to display tables of counts, often called
crosstabs or contingency tables.
However, crosstabs typically list only counts, whereas pivot tables can
list counts, sums, averages, and other summary measures.
Example 4:
Elecmart Sales.xlsx (slide 1 of 2)
Objective: To use pivot tables to break down the customer order data by a
number of categorical variables.
Solution: Data set contains data on 400 customer orders during several
months for Elecmart company.
Create a pivot table by clicking the PivotTable button on the Insert ribbon.
Example 4:
Elecmart Sales.xlsx (slide 2 of 2)
Hiding Categories (Filtering)
You can filter out any items in a pivot table that you don’t want to see.
◦ Click the Row Labels dropdown arrow of the active field and check the items you
want to filter on.
◦ A pivot table with hidden categories is shown below.
Sorting on Values or
Categories
It is easy to sort in a pivot table, either by the numbers in the Values
area or by the labels in a Rows or Columns field.
◦ To sort by the numbers in the Values area, right-click any number and
select Sort.
◦ To sort on the labels of a Rows or Columns field, right-click any of the
categories and select Sort.
◦ You can also click the dropdown arrow for the field and get the
dialog box that allows both sorting and filtering.
Changing Locations of Fields
(Pivoting)
You can choose where to place variables in a pivot table.
◦ For example, to place the Region variable in the Columns area, drag the
Region button from the Rows area of the PivotTable Fields pane to the
Columns area.
Changing Field Settings
You can change various settings in the Field Settings dialog box.
◦ To get to this dialog box:
◦ Click the Field Setting button on the Analyze/Options ribbon.
◦ OR right-click any of the pivot table cells and select the Field Settings item.
◦ The pivot table with Value Field Settings changed to Average is shown below.
Pivot Charts
It is easy to accompany pivot tables with pivot charts.
◦ These charts adapt automatically to the underlying pivot table.
◦ To create a pivot chart, click anywhere inside the pivot table, select the PivotChart
button on the Analyze/Options ribbon, and select a chart type.
Multiple Variables in the
Values Area
More than a single variable can be placed in the Values area.
Also, a given variable in the Values area can be summarized by more than one
summarizing function.
Summarizing by Count
The variable in the Values area can be summarized by the Count function.
◦ This is useful when you want to know, for example, how many of the orders were
placed by females in the South.
◦ Right-click any number in the pivot table, select Value Field Settings, and select the
Count function.
Grouping
Categories in a Rows or Columns variable can be grouped.
Suppose you want to summarize Sum of Total Cost by Date.
◦ Starting with a blank pivot table, check both Date and Total Cost in the PivotTable
Fields pane.
◦ Then right-click any date and select Group.
Other Pivot Table Features
Showing/hiding subtotals and grand totals (check the Layout options on the
Design ribbon)
Dealing with blank rows, that is, categories with no data (right-click any
number, choose PivotTable Options, and check the options on the Layout &
Format tab)
Displaying the data behind a given number in a pivot table (double-click any
number in the Values area to get a new worksheet)
Formatting a pivot table with various styles (check the style options on the
Design ribbon)
Moving or renaming pivot tables (check the PivotTable and Action groups on
the Analyze/Options ribbon)
Refreshing pivot tables as the underlying data changes (check the Refresh
dropdown list on the Analyze/Options ribbon)
Creating pivot table formulas for calculated fields or calculated items (check
the Formulas dropdown list on the Analyze/Options ribbon)
Example 5:
Lasagna Triers.xlsx (slide 1 of 2)
Objective: To use pivot tables to explore which demographic variables help to
distinguish lasagna triers from nontriers.
Solution: Data set contains data on over 800 potential customers being
tracked by a frozen lasagna company.
Set up a pivot table that shows counts of triers and nontriers for different
categories of the variables.
Example 5:
Lasagna Triers.xlsx (slide 2 of 2)
Pivot Table and Pivot Chart for Examining the Effect of Gender
Slicers and Timelines
In Excel 2010, Microsoft added slicers—lists of the distinct values of any
variable, which you can then filter on.
◦ You add a slicer from the Analyze/Options ribbon under PivotTable
Tools.
In Excel 2013, a Timeline feature was added. A Timeline is like a slicer,
but it is specifically for filtering on a date variable.
Pivot Table with Slicers and a
Timeline

Published by
Write
View all posts