DEV Community

Afroza Nowshin
Afroza Nowshin

Posted on

Learning Statistics with R, Part-1

Before starting to learn machine learning and data analysis, it is crucial that one learns the basics of statistics. When I work in a data engineering pipeline, first I try to see the statistical properties of the given dataset. While there are built-in functions in Python and R, we tend to skip what does these functions do behind the scene. Today, I will explain how to do statistical data analysis with R.

Let's say, you are a statistics noob. Today is Monday and you want to know what to wear- cotton if the max temperature is above 25°C and silk or linen if the max temperature is below 25°C. Can statistics help you?

Well, weather and climate data trend relies on historical data analysis and finding pattern with various machine learning algorithm. But to understand the basics of statistics, we need to collect some data, like the following:

Day Max Temperature (°C)
2024-04-28 22
2024-04-29 24
2024-04-30 26
2024-05-01 27
2024-05-02 23
2024-05-03 24
2024-05-04 24
2024-05-05 25
2024-05-06 21
2024-05-07 20

For making prediction and statistical analysis, you will need to collect a lot of data unlike this one. Your first order of business is make a dataframe out of this table. In R, you will do it this way:

# Create data frame
temperature_data <- data.frame(
  Day = c("2024-04-28", "2024-04-29", "2024-04-30", "2024-05-01", "2024-05-02", "2024-05-03", "2024-05-04", "2024-05-05", "2024-05-06", "2024-05-07"),
  MaxTemperature = c(22, 24, 26, 27, 23, 24, 24, 25, 21, 20)
)

Enter fullscreen mode Exit fullscreen mode

With this dataframe, we are good to go for some statistics. First, we will check the 'distribution' of the data. This distribution means some 'bin's of the same width; a histogram plot. The height of each bin corresponds to the number of data points that fall inside that bin.

Temperature Range (°C) Frequency
20-21 2
22-23 1
23-24 3
24-25 2
25-26 2
26-27 2

R will build the required data intervals and create the bins for you with its hist() function.

temperatures <- c(22, 24, 26, 27, 23, 24, 24, 25, 21, 20)
# Plotting the histogram
hist(temperatures, breaks=7, col="blue", main="Histogram of Maximum Temperatures",
     xlab="Temperature (°C)", ylab="Frequency", ylim=c(0,3))

Enter fullscreen mode Exit fullscreen mode

The output will look like the following:

histogram

Notice that the number of bins are not strictly equal to the number that you set for the breaks parameters. This parameter rather suggests the number of bins that R prefers. R will divide into the optimum number of intervals that are either equal to or not equal to the value of breaks parameter.

Before going to learn the necessary R commands and functions to get an idea about the statistical property of the data, we need to understand why we need to know these properties in the first place. Let us arrange the data points from the lowest to the highest:

20, 21, 22, 23, 24, 24, 24, 25, 26, 27

The average or mean value is 23.6, and the middle value or median can be calculated by averaging 5th and 6th values, which in our case is 24. As our dataset has even number of values, we are taking the average of the middle two values. If the dataset had odd number of values, we could just take the middle value for the median. Another less used data property is mode , which indicates the highest occurring data points in a dataset. These three properties give us an overview of the data, for example, for an evenly spaced bigger dataset, the graph looks like a bellcurve. Mean is the center of the distribution for a dataset; think of it like the center of a blending machine; mean is the blades.

bell curve

This is called a 'Normal Distribution'. In a normal distribution, the mean and median are equal which is denoted by 0 in the picture. The data points are mostly in the hump of the curve, about 68% of them. Following code will help you to draw the density plot:

plot(density(temperatures), type = 'l', col = "darkred", lwd = 2, 
     main = "Temperature Density Plot", xlab = "Temperature (°C)", ylab = "Density")

Enter fullscreen mode Exit fullscreen mode

density

Since our data points are very few, and not 'ideal', the curve looks distorted, which is okay because real-life datasets are not always ideal with equal mean and median values. There are built-in functions in R to check these values:

# Calculate mean
mean_temp <- mean(temperatures)

# Calculate median
median_temp <- median(temperatures)

Enter fullscreen mode Exit fullscreen mode

Median is an important property of a dataset. To find out the outliers of a dataset, you need to find out the Interquartile Range . Basically, range is the difference between the minimum and the maximum data points. What is interquartile range then?

We know that in the sorted dataset, the median is 24, which is also called the 2nd Quartile (Q2) or 50th Percentile. This Quartile splits the whole dataset in a way such that, 50% of the data points are lower than this point and 50% of the data points are higher than this point. After this split, the median of the lower chunk of the dataset will be called the 1st Quartile(Q1) or the 25th percentile. Similarly, the median of the higher chunk is the 3rd Quartile (Q3) or the 75th percentile. These 3 quartiles splits the distribution into 4 equal portions, each containing 25% of the entire dataset. Interquartile range or IQR is the difference between Q3 and the Q1. Following is the graphical representation of these properties, which is called a Boxplot or Box and Whiskers plot.

boxplot

The code to draw boxplot is:

# Define the temperature data
temperatures <- c(22, 24, 26, 27, 23, 24, 24, 25, 21, 20)

# Calculate the quartiles
quartiles <- quantile(temperatures)

# Create a boxplot
boxplot(temperatures, main="Boxplot of Temperatures",
        ylab="Temperature (degrees)",
        col="lightblue", border="darkblue")

# Adjust label positions to avoid overlap
# The offset can be adjusted for better alignment
offset <- 0.7

# Add quartile labels
text(x = 1, y = quartiles[2] - offset, labels = paste("Q1 =", quartiles[2]), col="red", pos=3)
text(x = 1, y = quartiles[3], labels = paste("Q2 (Median) =", quartiles[3]), col="blue", pos=3)
text(x = 1, y = quartiles[4] + offset, labels = paste("Q3 =", quartiles[4]), col="red", pos=3)
Enter fullscreen mode Exit fullscreen mode

To find out the outliers, you need to define the "bounds" beyond which you will discard the data points. The formula is like the following:

# Define lower and upper bounds to identify outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identifying outliers
outliers <- temperatures[temperatures < lower_bound | temperatures > upper_bound]

# Print outliers
print(outliers)
Enter fullscreen mode Exit fullscreen mode

The output is zero since there are no outliers in our dataset. In real life, you will have datasets that will contain outliers. As part of data cleaning process, we identify the outliers and then replace the outliers with the median (Q2) value. Why?

Mean value heavily skews a dataset if there are extreme outliers, but median does not. Let's look at the following data points:

10,20,21,22,23,24,1000

The mean of this dataset is 160. This value is heavily influenced by the outlier 1000. How to reduce this effect? Just replace the outlier value with the median, in our case it is 22.

In the following post, we will explore some more concepts and R functions to finally answer the question, what to wear tomorrow?

Top comments (0)