Higher Applications of Mathematics: R Studio Workbook

clellandmaths

1 About this guide

This guide will help you complete the practical work for the statistics component of your course. You’ll learn how to use R Studio to load data, calculate key statistics, create visualizations, and perform common statistical tests.

Using software is an essential skill, but the most important part is understanding what you are doing, why you are doing it, and what your results mean.

2 Loading a Dataset to R Studio

The first steps in R Studio are always the same, no matter what analysis you want to run.

Because we are running this in the browser, your teacher has already loaded the data for you! You don’t need to set a working directory.

Note on Running Code: In this workbook, clicking the “Run” button will execute all the code in the box at the same time. Also, because we don’t have an “Environment” tab to view our files, you should always add head(your_file) to the bottom of your box if you want to see a visual output to prove your data loaded!

Enter the code to load your data: In the empty code block below, type the following three lines of code exactly as they appear. * coffee <- read.csv("coffee_shop.csv") * attach(coffee) * head(coffee)

To run your code, click the Run button.

3 Descriptive Statistics

Let’s inspect the coffee_shop.csv data. It contains data for a sample of coffee shop customers:

Variable	Data Type	Description
CustomerID	Numerical (Discrete)	A unique ID for each customer
Age	Numerical (Discrete)	Age of the customer in years
Gender	Categorical	Customer’s gender (M/F)
VisitFrequency	Categorical	How often they visit (Daily, Weekly, Monthly)
Spent	Numerical (Continuous)	Average amount spent per week

Now, let’s calculate some basic descriptive statistics for the numerical variables. Type the following commands one at a time into the empty block below and run them: * mean(Age) * median(Age) * sd(Age)

Task: What does the summary command give you? Type summary(Age) into the box below and run it.

You should see the Min (minimum), 1st Qu. (first quartile), Median, Mean, 3rd Qu. (third quartile), and Max (maximum) values. This is a very useful command!

Now, try finding the Interquartile Range (IQR) for the amount Spent. Type IQR(Spent) below:

Challenge: Can you work out how to calculate the semi-interquartile range (SIQR)? (Hint: It’s just the IQR divided by 2). Type your calculation below:

3.1 🏋️ Exercise 1

🧹 Clear Old Data

Why do we do this? In R, if you open a new dataset that has the same column names (like ‘Age’) as your old dataset, R gets confused. We usually use the detach() command to fix this, but it can be hard to remember exactly which file you used last! To make life easier, we have created a custom command just for this workbook. Typing reset() will automatically find and wipe any old data for you!

Type reset() into the box below and run it:

Now, load the file house_prices.csv. Give it a sensible name (e.g., houses), attach the variable names, and look at the first 6 lines. Type the following: * houses <- read.csv("house_prices.csv") * attach(houses) * head(houses)

Use the code block below to calculate statistics for the Age of the houses. Type: * mean(Age) * sd(Age) * summary(Age) * IQR(Age)

Now calculate the same set of descriptive statistics for the Price of the houses.

4 📦 Boxplots

In this activity, we’re going to use the gym_members.csv file.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now, import the new file and inspect the variables by typing the following three commands: * gym <- read.csv("gym_members.csv") * attach(gym) * head(gym)

Now, let’s create a simple boxplot for the Age of gym members. Type the following into the box: boxplot(Age, main="Age of Gym Members", horizontal=TRUE)

4.1 Customizing Your Boxplot

We can easily add more detail, like axis labels and colour. Type the following: boxplot(Age, main="Age of Gym Members", horizontal=TRUE, xlab="Years", col="lightblue")

4.2 ⚖️ Creating Comparative Boxplots

Boxplots are most powerful when comparing two or more groups. Let’s compare the ages of male and female members.

We use the tilde symbol ~ to do this. The formula is Numerical_Variable ~ Categorical_Variable.

Type the following code: boxplot(Age~Gender, main="Age of Gym Members", horizontal=TRUE, xlab="Years", ylab="Gender", col="lightblue")

💡 Quick Question: Look at the plot. What do the small circles indicate?

Answer: These are outliers. They are data points that fall significantly outside the range of the other data in that group. R identifies and plots them individually.

4.3 Handling Categorical Data

Look back at your gym_members.csv file. What type of data is Gender? It’s categorical. We can’t find a “mean” gender, but we can count how many members are in each category.

In R Studio, use the table command to get a simple frequency count. Type table(Gender) below:

4.4 🏋️ Exercise 2

For this exercise, you’ll need the student_data.csv file.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Load student_data.csv, attach it, and check the variable names by typing the following three lines:

students <- read.csv("student_data.csv")
attach(students)
head(students)

Produce summary data for Height and StudyHours.

Produce a boxplot for StudyHours.

Produce a comparative boxplot to compare the Height of students based on their smoking status (Smokes).

5 📋 Tables in R Studio

You can create frequency tables in R Studio or Excel. For simple counts, R Studio is very fast.

Let’s re-load our coffee_shop.csv dataset.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now, load, attach, and inspect the coffee data. Type the following: * coffee <- read.csv("coffee_shop.csv") * attach(coffee) * head(coffee)

To create a simple frequency table counting the VisitFrequency, type table(VisitFrequency) :

To create a contingency table (two-way table) showing visit frequency by gender, type table(Gender, VisitFrequency) :

5.1 Tables with Proportions and Percentages

To calculate proportions in a table, type prop.table(table(VisitFrequency)) :

To show percentages, multiply by 100. Type prop.table(table(VisitFrequency)) * 100 :

6 📊 Histograms

Histograms are an excellent way to quickly check the distribution of numerical data (for example, to see if it’s normally distributed, i.e., “bell-shaped”).

Let’s use the student_data.csv file again.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now, give the new data a name, attach the variables, and look at the first 6 rows: * students <- read.csv("student_data.csv") * attach(students) * head(students)

Now, enter hist(Height) to generate the histogram:

6.1 🏋️ Exercise 3

Using the student data you just loaded, generate histograms for StudyHours and Pulse.

Describe the shape of the distribution for each. Do they appear broadly normally distributed?
Now, try to generate a histogram for Smokes. Type hist(Smokes) below:

Explain why you got an error message. (Hint: What type of data is Smokes? What kind of data do histograms require?)

7 📈 Scattergraphs

Scattergraphs (or scatterplots) are used to visualize the relationship, if any, between two numerical variables.

Let’s use the student dataset to see if there is a relationship between study hours and pulse rate. Type the following code: plot(StudyHours, Pulse, main="Study Hours vs. Pulse Rate")

Looking at the plot, we can see a negative linear relationship. As study hours increase, the pulse rate tends to decrease.

7.1 🏋️ Exercise 4

For this exercise, we will use the fast_food_data.csv file.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now, type the following commands to load the new data, attach the variable names, and check the first six lines: * fastfood <- read.csv("fast_food_data.csv") * attach(fastfood) * head(fastfood)

What relationship would you expect between Fat and Calories? Produce a scatterplot to check this.

Would you expect a relationship between Sodium (salt) and Sugar content? Produce a scatterplot to check.

8 🧮 Correlation and Linear Regression

The scatterplot shows us the direction of a relationship, but the correlation coefficient (r) tells us its strength (from -1, perfect negative, to +1, perfect positive).

8.1 Correlation

Let’s switch back to our student_data.csv example.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now, re-load the student data, attach it, and check the variables: * students <- read.csv("student_data.csv") * attach(students) * head(students)

Always start with a scattergraph to visually check that a linear relationship exists. Calculate the correlation coefficient (r) by typing cor(StudyHours, Pulse) :

This shows a moderate negative correlation.

Test for statistical significance by typing cor.test(StudyHours, Pulse) :

Interpreting the output: * The p-value is 2.74e-09. This is scientific notation for \(2.74 \times 10^{-9}\), which is a very small number. * Since p < 0.05, we can reject the null hypothesis. The result is statistically significant. * We can conclude there is a statistically significant, moderate, negative linear relationship between study hours and pulse rate.

8.2 Linear Regression

Now that we know a significant relationship exists, we can apply a linear regression model, or “line of best fit”.

Generate the model by typing lm(Pulse ~ StudyHours) :

R gives us the (Intercept) c and the gradient (StudyHours) m.
The line of best fit is: y = -1.896x + 89.684
Or: Pulse = -1.896 × StudyHours + 89.684

Fit the line to the graph by typing the following two lines of code: * plot(StudyHours, Pulse, main="Study Hours vs. Pulse Rate") * abline(lm(Pulse ~ StudyHours), col="red")

Check the ‘goodness of fit’ (R²) by typing summary(lm(Pulse ~ StudyHours)) :

Our R² value is 0.3859 (or 38.6%).
This tells us that 38.6% of the variation in students’ pulse rates can be explained by the variation in their study hours.

8.3 🎯 Making Predictions

We can use our model to predict a pulse rate for a given number of study hours.

Let’s predict the pulse rate for a student who studies 10 hours per week. Type the following: predict(lm(Pulse~StudyHours), newdata=data.frame(StudyHours=10), interval="pred")

Interpretation: * fit: The single best-fit prediction is 70.7 * lwr/upr: This is the 95% prediction interval. We are 95% confident that an individual student who studies 10 hours will have a pulse rate between 61.4 and 80.1

8.4 🏋️ Exercise 5

(Use the fast_food_data.csv file for this exercise)

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now load and inspect the fast food data. Type the following: * fastfood <- read.csv("fast_food_data.csv") * attach(fastfood) * head(fastfood)

Generate a labelled scattergraph showing Fat on the x-axis and Calories on the y-axis.

Calculate the correlation coefficient (r). Comment on the direction, strength, and statistical significance.

Apply a linear regression line to your graph and write down the equation.

Calculate the coefficient of determination (R²) and comment on what this tells us.

1. Predict the calorie value for a menu item with 34g of fat. b) Predict the calorie value for a menu item with 65g of fat.

Comment on which prediction is more likely to be accurate and why. (Hint: This is about extrapolation.)

9 🔬 t-tests

A t-test is a significance test used to compare the means of two samples of numerical data.

There are two main types: 1. Paired t-test: Used when the two samples are related (e.g., before/after measurements). 2. Independent t-test: Used when the two samples are separate and unrelated.

9.1 Paired t-test Example

A group of 15 participants joined a diet program. We measured weight before and after.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Now, load the diet data and run the test. Type the following commands into the block below: * diet <- read.csv("diet_study.csv") * attach(diet) * head(diet) * t.test(WeightBefore, WeightAfter, paired=TRUE)

Interpreting the output: * p-value: 3.71e-07 is much less than 0.05 * Conclusion: Result is statistically significant. We reject the null hypothesis. * Mean weight loss was 4.7 kg * 95% CI: [3.5, 5.8] kg (entirely positive, confirming significant weight loss)

9.2 🏋️ Exercise 6

A teacher wants to compare exam results from two classes using different teaching methods.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Load exam_scores.csv, attach it, and view the variables. Type: * exams <- read.csv("exam_scores.csv") * attach(exams) * head(exams)

Write suitable null and experimental hypotheses.
Should this be paired or independent? Why?
Run the correct t-test.

Comment on the findings.

10 ⚖️ z-tests (Test of Two Proportions)

We use a z-test when our data is categorical and we want to compare proportions of two groups.

Example: Comparing university plans between two schools: * School A: 30 out of 50 students said “Yes” * School B: 38 out of 60 students said “Yes”

For this test, we don’t need a CSV file. Just type the following code: prop.test(x=c(30, 38), n=c(50, 60))

Interpretation: * Proportions: 60% (School A) vs 63.3% (School B) * p-value: 0.7801 > 0.05 * Conclusion: Not significant. We fail to reject the null hypothesis. * 95% CI includes zero, confirming no significant difference

10.1 🏋️ Exercise 7

Two driving instructors are comparing pass rates: * Instructor 1: 25 out of 35 pupils passed * Instructor 2: 22 out of 37 pupils passed

Is there a significant difference? Run a z-test and interpret the results.

11 📝 Past Paper Practice

This section contains genuine SQA exam-style questions. In your assessment, you will be given a dataset and asked to perform statistical analysis, state your findings, and draw conclusions in context.

11.1 2026 Paper: Question 3 (Tyres)

A tyre manufacturer has collected data to investigate the relationship between the stopping distance (metres) and the tread depth (millimetres) of a new type of tyre. You must refer to the spreadsheet file Q3 Tyres.csv for the data.

🧹 Clear Old Data

Remember, we usually use detach() to clear data, but in this workbook we use our custom shortcut. Clear the previous data by typing reset() and clicking Run:

Setup: Now load the file Q3 Tyres.csv, attach it, and show the first 6 lines to check the variable names. Type your code below:

(a) Construct a scatter plot of stopping distance on tread depth. (Hint: Stopping distance is the dependent ‘y’ variable, so it goes second in your plot command).

(b) (i) Find the correlation coefficient between stopping distance and tread depth. (ii) Interpret the correlation coefficient.

Write your interpretation (b ii) down on a piece of paper, just like in the real exam!

(c) Find the equation of the regression line of stopping distance on tread depth.

(d) Estimate the stopping distance for a tyre with a tread depth of 6.2 millimetres.

11.1.1 Check Your Answers

Click the boxes below to reveal the answers and video walk-through!

✅ Click here to reveal the written answers

(a) Scatter Plot
Code: plot(Depth, Distance)

(b) Correlation
Code: cor.test(Depth, Distance)
(i) Answer: -0.986 (or -0.99)
(ii) Interpretation: There is a strong negative linear relationship between Tread Depth and Stopping Distance. As tread depth decreases, stopping distance increases.

(c) Regression Line
Code: lm(Distance ~ Depth)
Answer: Stopping Distance = 43.944 – 2.376 × Tread Depth

(d) Estimate
Code: predict(lm(Distance~Depth), newdata=data.frame(Depth=6.2), interval="pred")
Answer: 29.2 Metres

📺 Click here for the Video Solution

Stuck? Watch Mr Clelland walk through this exact question on YouTube: 👉 Watch the 2026 Q3 Tyres Solution

12 ❓ R Studio Troubleshooting

R Studio can be fussy, especially when you’re starting. The number one rule is: DON’T PANIC! 99% of all errors are small typos.

12.1 😟 Problem: “I can’t load my file!”

You get an error like ‘cannot open file’ or ‘No such file or directory’.

Note: Because we are using WebR in this interactive document, your teacher has already pre-loaded the datasets for you! You do not need to set a working directory.

If you are still getting file loading errors, try these solutions: * Check the File Name: R is case-sensitive. “student_data.csv” ≠ “Student_Data.csv” * Add the .csv Extension: Must include in your code * Check Your Quotes: Must use straight quotes ""

12.2 😟 Problem: “My file is loaded, but mean(Age) doesn’t work!”

You get ‘object not found’ errors.

💡 Solutions: * Did you attach() the data? * Did you attach() the right thing? Names must match * Check variable spelling with head()

12.3 😟 Problem: “My plot or test command isn’t working!”

💡 Solutions: * Comma vs. Tilde (~): Use ~ for comparisons (like boxplots/models), comma for correlations and scatterplots. * Check Data Types: Can’t calculate mean of categorical data * Check Variable Order: lm(Y ~ X) vs plot(X, Y)

13 📚 R Code Cheat Sheet

Here’s a one-page summary of the most common commands:

13.1 1. Loading and Viewing Data

13.2 2. Descriptive Statistics

13.3 3. Tables

13.4 4. Visualizations

13.5 5. Correlation and Regression

13.6 6. Hypothesis Tests

14 📌 Important Notes

Data Files: Always use the ‘datafile.csv’ files to match this guide’s outputs.
Statistical Significance: p < 0.05 means statistically significant
Correlation ≠ Causation: Even strong correlations don’t prove cause and effect
Check Assumptions: Always visualize data before running tests

Good luck with your statistical analysis!