Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this session, students will be able to:

  • Continue practicing basic software commands
  • Learn how to explore the dataset, identifying the different types of variable stored
  • Calculate the different measures of location and spread
  • Plot frequency distributions and histograms

You can download a copy of the slides here: 1.3: Descriptive Statistics

Video A1.3 – Data Distributions and Descriptive Statistics (20 minutes)

A1.3a Practical: Stata – Summarising different types of variables

Remembering the skills you learned in the last section, please do the following:

  • Set working directory using the command ‘cd  <filepath>’
  • Open your dataset in Stata
  • Look at your dataset using ‘browse ‘
  • Get to know your variables using the commands ‘describe’ , ‘codebook’, or ‘inspect’

Question 1.3a.i: After completing the steps above, can you classifying the following variables as Categorical, Binary, Ordinal, Continuous? 

    • Prior CVD
    • SBP
    • Frailty 
    • Death 

Question 1.3a.ii: What are the measures of location (mean, median and mode) and the measures of spread (range, interquartile range and standard deviation) for the cholesterol levels of participants in dataset (use overall cholesterol ‘chol’ variable)?

A1.3a. Answers

Answer A1.3a.i: Prior CVD is a binary variable as it only has two values: 0 and 1; this is the same for the ‘death’ variable. SBP is a continuous variable. Frailty is an ordinal variable, with categories ascending from 1-5. We do not have any pure ‘categorical’ variables in this dataset, like ‘ethnicity’. All the grouped variables in this dataset (like age_grp, bmi_grp4) have an order to the categories so they ‘ordinal’.

Answer A1.3a.ii:

Mean = 5.51

Median = 5.47

Mode = 5.07

Standard deviation = 1.01

Range  = 8.53

Interquartile range = 1.30

Answers above produced using code below:

tab chol, sort

tabstat chol, stat(n mean median sd iqr range)

 

A1.3b PRACTICAL: Stata – Bar charts and histograms

Bar charts

Bar charts are a useful way of comparing groups by a particular characteristic. We can tell Stata what summary statistic we wish to include in the bar chart, for example, the frequency within each category of a variable, or the mean of one variable within each level of another categorical variable.

For categorical variables, it can be useful to look at frequencies within each level. To do this, we use the ‘graph bar’ command and include ‘(count),’ followed by ‘over ([variable name])’. The following code will present a bar chart comparing the frequencies within each age group category:

graph bar (count), over (age_grp)

To look at percentages within each category:

graph bar (percent), over (age_grp)

  • Explore the above command with some variables within your dataset.

We can also look at summary statistics of a continuous variable within each level of another, categorical variable. For example, the following code will produce a bar chart that presents the mean of vitamin D serum levels within each age group category:

graph bar vitd, over (age_grp)

To present the median of vitamin D by age group, you simply include (median) after the command ‘graph bar’:

graph bar (median) vitd, over (age_grp)

It is also possible to present the bar chart with multiple categorical variables. The following code will produce a bar chart presenting the mean vitamin D by age groups and history of cardiovascular disease.

graph bar vitd, over (age_grp) over (prior_cvd)

To add a title to the y-axis, we can use the following code:

graph bar (mean) vitd, over(age_grp) ytitle(Mean vitamin D concentration)

To remove labels or change the size, you can use the following code:

graph bar (mean) vitd, over(age_grp) ytitle(Mean vitamin D concentration) ylabel(, nolabels)

graph bar (mean) vitd, over(age_grp) ytitle(Mean vitamin D concentration) ylabel(, labels labsize(small))

If comparing multiple variables on one chart, it can be useful to change the colour of bars. To do this, add in the following code ‘bar (1, fcolour([insert colour])’:

graph bar (mean) vitd, bar (1,fcolor(black)) ytitle(Mean vitamin D concentration)

  • Try producing a number of different bar charts and play around with changing different features.

Histograms

When you want to look at the distribution of a variable, rather than comparing characteristics, you can use a histogram. A histogram can be produced for a continuous or categorical variable, as long as they are measured on an interval scale. Type ‘histogram [variable name]’.

histogram sbp

If the variable is not continuous, type ‘, discrete’ afterwards:

histogram bmi, discrete

A histogram is often used to check whether a variable is normally distributed. To add a normal distribution curve to the histogram, use the following code:

histogram bmi, discrete normal

To adjust the number of bins, include ‘, bin ([number of bins])’

histogram sbp, bin (20)

histogram sbp, bin (10)

You can also add a title and labels to the x-axis:

histogram bmi, discrete normal title (“Body Mass Index”)

histogram bmi_grp4, discrete normal title(“Body Mass Index”) xlabel (1 “Underweight” 2 “Normal weight” 3 “Overweight” 4 “Obese”)

It is also possible to show the percentage or frequency on a histogram. To do this, amend the code at the end of the histogram command.

histogram bmi, discrete percent

histogram bmi, discrete frequency

Grouping continuous data

There are different ways you can group continuous data to create a categorical variable. To this, firstly generate a duplicate variable, so you are not altering the original.

1. ‘xtile’

If you want to create a new variable with percentiles, the ‘xtile’ command is useful. For example, if you wish to produce deciles of systolic blood pressure:

xtile sbp10=sbp, nquantiles(10)

Or quartiles of systolic blood pressure:

xtile sbp4=sbp, nquantiles(4)

2. ‘cut’

If you want to create a variable with specific categories you can use the ‘egen’ function with the ‘cut’ command. The code below is an example of creating a new categorical systolic blood pressure variable. The new variable categories are <90 = low sbp; 90-<120 = normal sbp; 120-<130 = elevated sbp; ≥130 = high.

egen sbp_cat=cut(sbp), at(0,90, 120, 130, 231)

Note: that the max systolic blood pressure recorded in this population is 230, therefore, the cut off 231, includes all values below 231.

3. ‘recode’

The recoded command also works in the same way to the cut command above.

gen sbp_cat=recode(sbp, 90, 120, 130, 231)

4. ‘autocode’

The autocode command creates evenly spaced categories of a continuous variable:

gen [new var name]=autocode([original var name], [number of categories], [minimum], [maximum])

To create a new systolic blood pressure categorical variable, with 4 evenly spaced categories between 0 and 230:

gen sbp_cat=autocode(sbp, 4, 0, 230)

You can use the tab and tabstat commands to check that your new categorical variables include the correct categories. Use ‘label’ function to label the variable and the categories in your new variable.

  • Questions A1.3b:
  1. Which type of variable can you plot with a bar chart? When should you use a histogram?
  2. Plot a histogram of total cholesterol and describe the distribution.
  3. Can you change the number of bins used to plot the histogram? What is the effect of changing the number of bins?
  4. Split total cholesterol into groups and make a bar chart of the number of participants in each cholesterol group. Can you give this graph a title? Can you label the y axis and change the colour of the bars in the chart?
Answers

Answer A1.3b.i:

A bar chart can be used to compare the frequency and percentage of participants within each level of a categorical variable. They can also be used to look at summary statistics of continuous variables, but only within level of categorical variables.

Histograms should be used to look at the distribution of data.

Answer A1.3b.ii:

histogram chol

(normally distributed)

Answer A1.3b.iii:

histogram chol

With too few bins it becomes difficult to identify the distribution of the data

histogram chol, bin(3)

Answer A1.3b.iv:

tab chol

gen chol_cat=recode(chol, 0, 5, 7.5, 11)

label var chol_cat “Categories of cholesterol”

label define chol_cat 5 “Normal” 7.5 “High” 11 “Very high”

label values chol_cat chol_cat

tab chol_cat

graph bar (count), over(chol_cat) bar(1, fcolour(black)) ytitle (Frequency)

A1.3a PRACTICAL: R – Summarising different types of variables

There are many different ways to generate basic descriptive statistics for your dataset in R, some of which have already been covered.

Here, we will cover several basic functions. Remember that there is always more information to be found online, and many good resources exist only a Google search away!

Summary

The summary function can be used to generate a minimum, 1st quartile, median, median, 3rd quartile , maximum, and number of NA datapoints for all numeric variables within a dataset. See the following:

summary(object.name)

summary(whitehall.data)

Note that this produces summary statistics for the entire dataset. To focus in on a specific column variable, we can use the $ operator as previously:

summary(Whitehall.data$bmi)

It is useful here to check that your minimum, maximum and mean are reasonable values. Try to generate some summary statistics for systolic blood pressure (SBP). Are these figures reasonable?

You can also obtain summary statistics that are differentiated by another variable, such as ‘age group’, using the ‘by’ function, as follows:

by(object.name, object.name$variable, summary)

by(whitehall.data, whitehall.data$age_grp, summary)

Does this generate any interesting information, or do any differences become immediately apparent?

Measures of Central Tendency

a) Mean

To calculate the mean, we use the following function.

mean(object.name$variable)

mean(Whitehall.data$sbp)

The above function may seem complete. We note, from our previous investigation, that there are some NA values within this dataset. Resultantly, we must remember to tell RStudio to calculate the mean without these values, using na.rm = TRUE . You can double check the values to see the difference.

mean(object.name$variable, na.rm = TRUE)

mean(Whitehall.data$sbp, na.rm= TRUE)

b) Median

We can compute median using a similar function:

median(object.name$variable, na.rm = TRUE)

median(whitehall.data$sbp, na.rm=TRUE)

c) Mode

install.packages(“DescTools”)

library(DescTools)

Mode(x, na.rm = FALSE)

*if there are missing values you can to make “na.rm = TRUE”

Mode(whitehall.data$bmi_grp4,na.rm=TRUE)

[1] 3

attr(,”freq”)

[1] 2091

The most common category of BMI group is ‘3’, with N=2091.

d) Quantiles

The quantile function can be used to calculate the median, and first and third quartiles, by using a second argument to define the percentage range ‘probs’ as a decimal figure between 0 and 1.

Remember that the first quartile = 25% of values, and the third = 75%. In this way, we can also use the quantile function to compute any percentile cut-off for our data.

quantile(object.name$variable, probs = , na.rm=TRUE)

quantile(whitehall.data$sbp, probs=0.25, na.rm=TRUE)

e) IQR

To calculate IQR, we can simply use the following:

IQR(object.name$variable)

IQR(Whitehall.data$sbp)

See if you can figure out how to calculate IQR using the quantile function!

quantile(whitehall.data$sbp, probs=0.75, na.rm=TRUE) – quantile(whitehall.data$sbp, probs=0.25, na.rm=TRUE)

f) Range

To calculate range, we simply subtract the maximum from minimum values, as shown:

max(object.name$variable) – min(object.name$variable)

max(Whitehall.data$sbp) – min(Whitehall.data$sbp)

g) SD and Variance

To calculate standard deviation and variance, you can use the following simple functions:

sd(object.name$variable)

var(object.name$variable)

  • Question A1.3a.i: After completing the steps above, can you classify the following variables as Categorical, Binary, Ordinal, Continuous? 
    • Prior CVD
    • SBP
    • Frailty
    • Death 
  • Question A1.3a.ii: What are the measures of central tendency (mean, median and mode) and the measures of spread (range, interquartile range and standard deviation) for the measured cholesterol of participants in our dataset  (using the chol variable)? 
Answer

Answer A1.3a.i:

  • Prior CV: binary
  • SBP: continuous
  • Frailty: ordinal
  • Death: binary

The answer to this question could use some of the knowledge that we gained in the above practical exercises, A1.2b about the structure of the data.

Answer A1.3a.ii: 

To answer this question, we need to apply the knowledge we gained in exercise A1.3a.

For the measures of central tendency:

1) Mean

As earlier described, calculating mean and median is relatively simple. We just have to make sure to tell R how to treat missing values with our na.rm argument.

mean(whitehall.data$chol, na.rm=TRUE)

[1] 5.510199

2) Median

median(whitehall.data$chol, na.rm=TRUE)

[1] 5.47

3) Mode

Mode(whitehall.data$chol, na.rm = TRUE)

[1] 5.07

attr(,”freq”)

[1] 35

4) Range

As above, to calculate range we will subtract the maximum value from the minimum value.

max(whitehall.data$chol, na.rm=TRUE) – min(whitehall.data$chol, na.rm=TRUE)

[1] 8.53

5) IQR

As above, there are two ways to calculate IQR. Either method will give you the same answer, as below.

IQR(whitehall.data$chol, na.rm=TRUE)

quantile(whitehall.data$chol, probs=0.75, na.rm=TRUE) – quantile(whitehall.data$chol, probs=0.25, na.rm=TRUE)

[1] 1.3

6) SD

Calculating standard deviation can again be performed with a simple function.

sd(whitehall.data$chol, na.rm=TRUE)

[1] 1.00712

A1.3b PRACTICAL: R – Bar charts and histograms

R offers several powerful tools to make easily customisable visualisations of data.

Bar Charts

To make a simple barplot, we can use the barplot() function, and then input code in order to represent the data in different ways. The first step is to load several useful packages which will assist with our graphing. We then load these libraries. These open-source packages contain several additional tools that we can use.

install.library(tidyverse)

install.library(RColorBrewer)

library(tidyverse)

library(RColorBrewer)

We then generate a table, which we will assign as the object ‘bmi.counts’. It is this object we ask R to generate a barplot for.

bmi.counts <- table (whitehall.data$bmi)

Within the barplot function, we can label the X and Y axis using ‘xlab=’

and ‘ylab=’, title the graph using ‘main=’, and change the colour of the bars using ‘col=’. Here, the package RColorBrewer allows us to generate a vector of (20) contiguous colours.

barplot (bmi.counts, xlab = “BMI”, ylab = “Number of People”, main = “BMI of Whitehall Participants”, col=heat.colors(20))

There is also opportunity to create stacked bar graphs in R. We can change the orientation by using the argument ‘horiz = TRUE’

Histograms

We can make a simple histogram using the ‘hist()’ function in R. This function takes in a vector of values for which the histogram will be plotted. As with before, we will modify the histogram to add axis labels and a title.

hist(whitehall.data$bmi, xlab = “BMI”, ylab = “Number of People”, main = “BMI of Whitehall Participants”, col=heat.colors(12), density=100)

Grouping Continuous Data

Grouping continuous data is simply an extension of the skills we used to earlier create our binary variable. There are many reasons why you would want to categorise a continuous variable, and cut-offs need to be defined for different categories. BMI is a skewed variable and different parts of the distribution of BMI may have different relationships with diseases that we investigate.

Our first step is to create a new, empty column, which we will title ‘bmi.grouped’:

whitehall.data$bmi.grouped <- NA

Our subsequent code will act to populate this new column we’ve created in our data frame.

Our goal is to get R to assign values to this new column, which will be drawn from the existing BMI data recorded for those rows. Remember that each row is a collection (vector) of data that represents a different individual.

We will define BMI of less than 18.5 as 0, between 18 – 25 as 1,between 25-30 as 2, and greater than or equal to 30 as 3. This will be then coded in. This can be achieved with the below simple operators:

whitehall.data$bmi.grouped [whitehall.data$bmi<18.5]<-0

whitehall.data$bmi.grouped [whitehall.data$bmi >=18.5 & whitehall.data$bmi<25]<-1

whitehall.data$bmi.grouped [whitehall.data$bmi >=25 & whitehall.data$bmi <30]<-2

whitehall.data$bmi.grouped [whitehall.data$bmi >=30]<-3

It is important to also check that this variable is in the right format, using class().

class(whitehall.data$bmi.grouped)

[1] “numeric”

R views our new variable as ‘numeric’. Clearly, however, it is an ordered categorical variable. Therefore, we have to ensure that R views the variable as a Factor. Simultaneously, while completing this task, we can also label our groups using the labels=c() function, remembering to use “” to denote different category titles. This can be completed as follows:

whitehall.data$bmi.grouped <- factor (whitehall.data$bmi.grouped, labels=c(“<18.5”, “18.5-24.9”, “25-29.9”, “>30”))

  • Question A1.3b.i: Which type of variable can you plot with a bar chart? When should you use a histogram?
  • Question A1.3b.ii: Plot the bar chart that counts the number of participants in each BMI group and save it. Can you give this graph a title? Can you label the y axis and change the colour of the bars in the chart?
  • Question A1.3b.iii: Plot a histogram of SBP and describe the distribution.
  • Question A1.3b.iv: Regroup SBP in a different way, and decide which grouping best represents the data.
  • Question A1.3b.v: Can you change the number of bins used to plot the histogram of SBP? What is the effect of changing the number of bins?
Question A1.3b Answers

Answer A1.3b.i :

Categorical variables (such as our newly created BMI categories) can be visually represented using a bar graph.

Histograms should be employed for continuous numerical variables. In our current dataset, variables that could be appropriately represented using a histogram include systolic blood pressure, blood cholesterol, and LDL levels.

Answer A1.3b.ii:

To generate this bar graph, we will follow the same process as we did before we had grouped BMI into the new variable. Note that we’ve changed the number of colours on our contiguous spectrum to 4, to reflect the new number of categories.

bmi.grouped.graph <- table (whitehall.data$bmi.grouped)

barplot (bmi.grouped.graph, horiz = TRUE, xlab = “BMI”, ylab = “Number of People”, main = “BMI of Whitehall Participants”, col=heat.colors(4))

Answer A1.3b.iii:

As above, the code for this histogram is as follows:

hist(whitehall.data$sbp, xlab = “SBP”, ylab = “Number of People”, main = “Histogram of SBP of Whitehall Participants”, col=heat.colors(12), density=100)

There appears to be a small right skew (positive skew) to this distribution but it is approximately normally distributed.

Answer A1.3b.iv:

We will group SBP according to the American College of Cardiology/American Heart Association guidelines for hypertension management, which classes blood pressure according to the following categories. The process for coding this new variable is near identical to the BMI categorisation — have a go yourself!

whitehall.data$sbp.grouped<- NA

whitehall.data$sbp.grouped [whitehall.data$sbp<120]<-1

whitehall.data$sbp.grouped [whitehall.data$sbp >=120 & whitehall.data$sbp <130]<-2

whitehall.data$sbp.grouped [whitehall.data$sbp >=130 & whitehall.data$sbp <140]<-3

whitehall.data$sbp.grouped [whitehall.data$sbp >=140]<-4
class(whitehall.data$sbp.grouped)

whitehall.data$sbp.grouped <- factor(whitehall.data$sbp.grouped, labels=c(“Normotensive”, “Elevated”, “Stage 1 Hypertension”, “Stage 2 Hypertension”))

Answer A1.3b.v:

Normally, R automatically calculates the size of each bin of the histogram. We may not find, however, that the default bins offer an appropriate or sufficient visualisation of the data. We can change the number of bins by adding an additional argument to our hist()code, breaks=()

hist(whitehall.data$sbp, xlab = “SBP”, ylab = “Number of People”, main = “Histogram of SBP of Whitehall Participants”, col=heat.colors(12), density=100, breaks = 1000)

If we try two extremes, we end up with two different graphs — and the distribution and skew is easier to assess.

We could say that this appears to be a more ‘normal’ distribution — even though the underlying data hasn’t changed!

hist(whitehall.data$sbp, xlab = “SBP”, ylab = “Number of People”, main = “Histogram of SBP of Whitehall Participants”, col=heat.colors(12), density=100, breaks = 10).

When represented with larger bins, however, the distribution does not seem to be quite as normally distributed. Some of the information at lower levels of SBP and in the far right tail is obscured.

A1.3a PRACTICAL: SPSS – Summarising different types of variables

Open the FoSSA Whitehall data set in SPSS.

Check that you have set up and classified all of your variables correctly in the previous practical.

We are now going to run some descriptive statistics on one of the variables.

In SPSS you can run a whole range of descriptives at once using the ‘Explore’ function.

Go to the menu bar at the top of the window and select

Analyze >> Descriptive Statistics >> Explore

When the Explore window opens, move the variable you are interested in into the ‘Dependent List’ box by selecting it and then clicking on the blue arrow next to the box.

At the bottom of the box in the ‘Display’ section, select ‘Statistics’, as we do not want to explore plots of the data at this stage.

Click OK to run the analysis.

Use this method to find the measures of central tendency (mean and median) and the measures of spread (range, interquartile range and standard deviation) for the cholesterol levels of participants in dataset (use overall cholesterol ‘chol’ variable).

Answer

Once you have run the analysis the answer tables pop up in a separate output window. You will see something that looks like this.

From this you can extract the descriptive statistics you were asked for in the question

Mean = 5.51

Median = 5.47

Standard deviation = 1.01

Range = 8.53

Interquartile range = 1.30

There are other options under the Descriptive Statistics heading from the Analyze menu which you can investigate to see what other statistics you can run on your data.

A1.3b PRACTICAL: SPSS – Bar charts and histograms

Bar Charts

Bar charts are a useful way of comparing groups by a particular characteristic. The most straightforward way to do this in SPSS is to use the Chart Builder function.

We can tell SPSS what summary statistic we wish to include in the bar chart, for example, the frequency within each category of a variable, or the mean of one variable within each level of another categorical variable.

Select

Graphs >> Chart Builder

A warning on ‘Define Variable Properties’ will pop up. You should have properly categorised all of your variables within the previous sections. So you can just press ‘OK’ to move on to creating a chart.

You will then see the ‘Chart Builder’ window open. Select ‘Bar’ from the ‘Gallery’ menu on the bottom left, then drag and drop the type of bar chart you are interested in into your previews window in the centre.

In this example we will select ‘Simple Bar’.

You can drag and drop your variables into the chart. Remember we want a categorical variable on the x axis, to define the groups, and then a continuous variable on the y axis, to define the height of the bar.

The ‘Element Properties’ tab on the right-hand side allows you to edit the elements of the chart. Once you select a simple bar chart the default selection is ‘Bar1’

The ‘Statistics’ box allows you to define which statistic to display. ‘Count’ will display the number (frequency) of individuals in a category and does not require a continuous variable on the y axis. ‘Mean’ will show the mean of each category as the height of the bar, ‘Median’ will show the median of each category, and so on. You can also select the box which says ‘Display error bars’ and decide what you would like these error bars to show.

You can also select to edit the properties of the x and y axes. This allows you to customise scales or write axis labels at this stage.

If you click on ‘Chart Appearance’ this allows you to define the colours or your bars as well at this stage.

If you forget to do any of these formatting steps in the set up though don’t worry. SPSS allows you to edit your graph in the output window as well. All you need to do it double click on it and a new editing window will open.

Histograms

When you want to look at the distribution of a variable, rather than comparing characteristics, you can use a histogram. A histogram can be produced for a continuous or categorical variable, as long as they are measured on an interval scale.

In SPSS you just need to open the Chart Builder, select ‘Histogram’ from the menu on the bottom left and then drag and drop the ‘Simple Histogram’ into the preview window.

Once you have done this you will see the option in the element properties tab to add a normal curve, which can help with assessing the distribution of the data. You can also click on ‘Set Parameters’ which allows you to change the number or range of the bins (groups) in your histogram.

Grouping continuous data

You can group continuous data to create a new categorical variable in SPSS in the following way

Select

Transform >> Recode into Different Variables

Move your existing continuous variable into the centre box, then name your new variable in the Output Variable section on the right, then press ‘Change’.

Then select ‘Old and New Values’. In the Old Value section use the ‘range’ functions to define the range of values you want to code into each category. Then select a number for your category in the New Value section. Then press ‘Add’ and your change will appear in the Old — > New box on the right.

Once you have set up all of your recoded variables press continue and then OK, then your new variable will appear in your data and variable views. You can now allocate a measurement type and labels for each of the values (low, medium, high for example) in the variable view as you did when you initially set up your data.

Now try to answer the following questions

  1. Which type of variable can you plot with a bar chart? When should you use a histogram?
  2. Plot a histogram of total cholesterol and describe the distribution.
  3. Can you change the number of bins used to plot the histogram? What is the effect of changing the number of bins?
  4. Split total cholesterol into groups and make a bar chart of the number of participants in each cholesterol group. Can you give this graph a title? Can you label the y axis and change the colour of the bars in the chart?
Answer
  1. A bar chart can be used to compare the frequency and percentage of participants within each level of a categorical variable. They can also be used to look at summary statistics of continuous variables, but only within level of categorical variables. Histograms should be used to look at the distribution of data.

The histogram indicates that the data are normally distributed. Note how the bars approximately follow the normal curve which has been added.

III.  With too few bins it becomes difficult to identify the distribution of the data

Example with bins set to 10

Example with bins set to 5

IV.

Example using 0-5mmol/l = Normal, 5.01-7.5 mmol/l = High, 7.51+ mmol/l = Very high.

You can change the bars to any colour you wish within the Chart Builder or the editing window.

Subscribe
Notify of
guest

13 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Juliet

I am getting an error message for this command. What might be the problem?
histogram bmi_grp4, discrete normal title(“Body Mass Index”) xlabel (1 “Underweight” 2 “Normal weight” 3 “Overweight” 4 “Obese”)

JAWERIYA

Hi, the slides that are available for download are different than the ones in the video.

Grace

Am experiencing the same issue with installing the color brewer option

Grace

I have also tried..install.packages(DescTools) but also in vain

Grace

Yes

Grace
install.packages("RDCOMClient", repos="http://www.omegahat.net/R")
Grace

I am having a challenge calculating the mode due to the version of R am using, 4.3.1. I have upgraded it to 4.3.2 but DescTools still isn’t being installed. Please assist.

Grace

mode(Whitehall_fossa$chol,na.rm=TRUE)

Alhassane

I think, I just watched one of the precise and concise summary of descriptive statistics ever. Thank to the Graph and Oxford team. ❤️

Mamadou Saidou Alareny

Great explanation!

13
0
Questions or comments?x