Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

In section A1.3 you learnt about data types and data distributions. Many of the tests that we use for continuous variables assume that the data follow a Gaussian or normal distribution. You have learnt how to create a histogram and use this to assess if the data follow the appropriate distribution. But what do we do if they don’t?

In these situations it is often acceptable to ‘transform’ your variable. This means to apply a mathematical function to each observed value, and then analyse the transformed values. This is a perfectly valid and widely accepted approach to dealing with skewed data. You just need to take care to use it appropriately. For example, you must ensure that if you are comparing means between two or more groups, that you perform the same transformation on all groups in the analysis so that the means are still comparable. Also, you must make sure that you interpret the results of any statistical tests on transformed data appropriately. It is often best to transform any means and mean differences back into the original units.

The table below shows the most useful transformations to deal with different deviations from the normal distribution.

Form of data

Transformation

Slightly right-skewed

Square root

Moderately right-skewed

Logarithmic

Very right-skewed

Reciprocal

Left-skewed

Square

Counts

Square root

Proportions

Arcsine square root

If you have decided to use a transformation it is best to plot your raw data (histogram or Q-Q plot depending on your preference), create a new variable with the transformed data, and then plot the transformed variable in the same way to check that the transformation has had the desired effect.

Sometimes variable transformations do not help. For example, if the measures of central tendency are all at the very extreme values of a variable then it is unlikely that any transformation will be useful. In these cases, non-parametric tests may be preferred. We will cover when and how to use non-parametric tests in Module B3.

A1.6 PRACTICAL: R

In this example we are going to take the natural log (ln) of the variable HDL-C in the Whitehall FoSSA data set

The natural log is the log to the base of the constant e (also known as Euler’s number), which is approximately 2.718281828459.

If you are interested in why this number is so important, a quick Google of the term Euler’s number should suffice, but you do not need to know why we use this value in order to run this analysis. (NB: Euler’s number is not to be confused with Euler’s constant, which is a different thing entirely!)

Firstly, plot and inspect a histogram of the variable hdlc. Adding a normal curve might help to see what we are looking at.

To do this in R you write:

data$variable_log <- log(data$variable)

Sometimes variable transformations do not help. For example, if the measures of central tendency are all at the very extreme values of a variable then it is unlikely that any transformation will be useful. In these cases, non-parametric tests may be preferred.

Use this function to log transform the ‘hdlc’ variable. Now create a new histogram of your transformed variable. How does this compare to your original? Does this transformation help your data? What might you do differently?

Answer
Whitehall_fossa$hdl_log<- log(Whitehall_fossa$hdlc)

Then to look at the histogram we write:

hist(Whitehall_fossa$hdl_log, breaks=20)

You can see that the data are closer to following the normal curve in this histogram compared to the previous. In the first histogram the most common values (taller bars) are slightly outside of the normal curve to the left of the mean, and there is a more gradual slope on the right-hand side of the mean (peak of the normal curve). This means there are more values in the bins to the right of the mean than to the left of the mean. This is a (very) slight right skew.

In the histogram of the transformed data the highest points are closer to the mean, so this could be classed as a slight improvement, but we are more seeing some of the most common values slightly to the right of the mean, and a more gradual slop off to the left, so a very minimal left skew. This wouldn’t worry us if we saw a skew this slight in the raw data, but in transformed data this is an ‘over-correction’ and indicates that the log transformation might not be the best option for this variable. Referring back to the table at the top of the page, we would select the square root transformation here instead.

In reality, the original hdlc variable was quite close to normality, and a sufficiently large sample size that we would not be worried, but it is a good candidate to experiment on.

A1.6 PRACTICAL: Stata

In this example we are going to take the natural log (ln) of the variable HDL-C in the Whitehall FoSSA data set

The natural log is the log to the base of the constant e (also known as Euler’s number), which is approximately 2.718281828459.

If you are interested in why this number is so important, a quick Google of the term Euler’s number should suffice, but you do not need to know why we use this value in order to run this analysis. (NB: Euler’s number is not to be confused with Euler’s constant, which is a different thing entirely!)

Firstly, plot and inspect a histogram of the variable hdlc. Adding a normal curve might help to see what we are looking at.

To perform a transformation, type “help math functions” and you will see all the potential functions you can apply using the command “generate”. The setup is:

generate new_variable=function(old variable)

 If we want to log transform the variable ‘hdlc’, we would type:

gen log_hdl=ln( hdlc)

Sometimes variable transformations do not help. For example, if the measures of central tendency are all at the very extreme values of a variable then it is unlikely that any transformation will be useful. In these cases, non-parametric tests may be preferred.

Use this function to log transform the ‘hdlc’ variable. Now create a new histogram of your transformed variable. How does this compare to your original? Does this transformation help your data? What might you do differently?

Answer

The resultant histogram of your log transformed data should look like this:

You can see that the data are closer to following the normal curve in this histogram compared to the previous. In the first histogram the most common values (taller bars) are slightly outside of the normal curve to the left of the mean, and there is a more gradual slope on the right-hand side of the mean (peak of the normal curve). This means there are more values in the bins to the right of the mean than to the left of the mean. This is a (very) slight right skew.

In the histogram of the transformed data the highest points are closer to the mean, so this could be classed as a slight improvement, but we are more seeing some of the most common values slightly to the right of the mean, and a more gradual slop off to the left, so a very minimal left skew. This wouldn’t worry us if we saw a skew this slight in the raw data, but in transformed data this is an ‘over-correction’ and indicates that the log transformation might not be the best option for this variable. Referring back to the table at the top of the page, we would select the square root transformation here instead.

In reality, the original hdlc variable was quite close to normality, and a sufficiently large sample size that we would not be worried, but it is a good candidate to experiment on.

A1.6 PRACTICAL: SPSS

In this example we are going to take the natural log (ln) of the variable HDL-C in the Whitehall FoSSA data set

The natural log is the log to the base of the constant e (also known as Euler’s number), which is approximately 2.718281828459.

If you are interested in why this number is so important, a quick Google of the term Euler’s number should suffice, but you do not need to know why we use this value in order to run this analysis. (NB: Euler’s number is not to be confused with Euler’s constant, which is a different thing entirely!)

Firstly, plot and inspect a histogram of the variable hdlc. Adding a normal curve might help to see what we are looking at.

To run the transformation

Select

Transform >> Compute Variable

Put the name of the new variable you want (log_hdlc) in the ‘Target Variable’ box.

Select the function Ln from the box on the bottom right titled ‘Function and Special Variables’ and move it up to the ‘Numeric Expression’ box. To save you scrolling through all of the functions, this is grouped within ‘Arithmetic’, so if you select that in the ‘Function group’ box directly above it reduced the number of choices in the list.

Select the name of your original variable (hdlc) from the left hand side, and move it across into the Numeric Expression to replace the question mark. On some occasions the question mark does not disappear when you move the variable into the expression. If this happens just delete it, as leaving it in will stop SPSS from creating the new variable.

Press OK to run, and your new variable will appear in the data and variable views.

Now create a new histogram of your transformed variable. How does this compare to your original? Does this transformation help your data? What might you do differently?

Answer

The resultant histogram of your log transformed data should look like this.

You can see that the data are closer to following the normal curve in this histogram compared to the previous. In the first histogram the most common values (taller bars) are slightly outside of the normal curve to the left of the mean, and there is a more gradual slope on the right-hand side of the mean (peak of the normal curve). This means there are more values in the bins to the right of the mean than to the left of the mean. This is a (very) slight right skew.

In the histogram of the transformed data the highest points are closer to the mean, so this could be classed as a slight improvement, but we are more seeing some of the most common values slightly to the right of the mean, and a more gradual slop off to the left, so a very minimal left skew. This wouldn’t worry us if we saw a skew this slight in the raw data, but in transformed data this is an ‘over-correction’ and indicates that the log transformation might not be the best option for this variable. Referring back to the table at the top of the page, we would select the square root transformation here instead.

In reality, the original hdlc variable was quite close to normality, and a sufficiently large sample size that we would not be worried, but it is a good candidate to experiment on.

Subscribe
Notify of
guest

2 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
David

Hello
Thank you for this clear content.
Please, can you provide me an easy synthaxe for normal distribution on the Histogramm.
Kind Regards

2
0
Questions or comments?x