Compute the Coefficient of Determination (R²) in C

CCBeginner
Practice Now

Introduction

In this lab, we will learn how to compute the Coefficient of Determination (R²) in C. The lab covers the following steps:

First, we will compute the predicted y values using linear regression. We will create a program that calculates the predicted values based on a simple linear regression model. Then, we will compute the R² value using the explained and total variations. Finally, we will print the R² value.

This lab provides a practical approach to understanding the concept of the Coefficient of Determination and its implementation in C programming, which is a valuable skill for statistical data analysis.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL c(("`C`")) -.-> c/CompoundTypesGroup(["`Compound Types`"]) c(("`C`")) -.-> c/FunctionsGroup(["`Functions`"]) c(("`C`")) -.-> c/FileHandlingGroup(["`File Handling`"]) c/CompoundTypesGroup -.-> c/arrays("`Arrays`") c/FunctionsGroup -.-> c/math_functions("`Math Functions`") c/FileHandlingGroup -.-> c/write_to_files("`Write To Files`") c/FileHandlingGroup -.-> c/read_files("`Read Files`") subgraph Lab Skills c/arrays -.-> lab-435156{{"`Compute the Coefficient of Determination (R²) in C`"}} c/math_functions -.-> lab-435156{{"`Compute the Coefficient of Determination (R²) in C`"}} c/write_to_files -.-> lab-435156{{"`Compute the Coefficient of Determination (R²) in C`"}} c/read_files -.-> lab-435156{{"`Compute the Coefficient of Determination (R²) in C`"}} end

Compute Predicted y Using Regression

In this step, we'll learn how to compute predicted y values using linear regression in C. We'll create a program that calculates the predicted values based on a simple linear regression model.

First, let's create a C file for our regression calculation:

cd ~/project
nano regression_prediction.c

Now, enter the following code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Function to compute predicted y values
void computePredictedY(double *x, double *y, int n, double slope, double intercept, double *predicted_y) {
    for (int i = 0; i < n; i++) {
        predicted_y[i] = slope * x[i] + intercept;
    }
}

int main() {
    // Sample data points
    double x[] = {1.0, 2.0, 3.0, 4.0, 5.0};
    double y[] = {2.0, 4.0, 5.0, 4.0, 5.0};
    int n = sizeof(x) / sizeof(x[0]);

    // Predefined slope and intercept (for demonstration)
    double slope = 0.6;
    double intercept = 1.5;

    // Array to store predicted y values
    double predicted_y[n];

    // Compute predicted y values
    computePredictedY(x, y, n, slope, intercept, predicted_y);

    // Print original and predicted y values
    printf("Original vs Predicted Y Values:\n");
    for (int i = 0; i < n; i++) {
        printf("X: %.1f, Original Y: %.1f, Predicted Y: %.1f\n",
               x[i], y[i], predicted_y[i]);
    }

    return 0;
}

Compile the program:

gcc -o regression_prediction regression_prediction.c -lm

Run the program:

./regression_prediction

Example output:

Original vs Predicted Y Values:
X: 1.0, Original Y: 2.0, Predicted Y: 2.1
X: 2.0, Original Y: 4.0, Predicted Y: 2.7
X: 3.0, Original Y: 5.0, Predicted Y: 3.3
X: 4.0, Original Y: 4.0, Predicted Y: 3.9
X: 5.0, Original Y: 5.0, Predicted Y: 4.5

Let's break down the key components of this code:

  1. computePredictedY() function calculates predicted y values using the linear regression equation: y = mx + b
  2. We use predefined slope (0.6) and intercept (1.5) for demonstration
  3. The program prints both original and predicted y values for comparison

Compute R² Using Explained/Total Variation

In this step, we'll extend our previous regression program to calculate the Coefficient of Determination (R²), which measures how well the regression model fits the data.

First, let's modify our existing C file:

cd ~/project
nano r_squared_calculation.c

Enter the following comprehensive code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Function to calculate mean of an array
double calculateMean(double *arr, int n) {
    double sum = 0.0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    return sum / n;
}

// Function to compute R-squared
double computeRSquared(double *x, double *y, int n, double slope, double intercept) {
    // Calculate predicted y values
    double predicted_y[n];
    double total_variation = 0.0;
    double explained_variation = 0.0;

    // Calculate mean of actual y values
    double y_mean = calculateMean(y, n);

    // Compute variations
    for (int i = 0; i < n; i++) {
        // Predicted y value
        predicted_y[i] = slope * x[i] + intercept;

        // Total variation (distance from mean)
        total_variation += pow(y[i] - y_mean, 2);

        // Explained variation (distance from predicted)
        explained_variation += pow(y[i] - predicted_y[i], 2);
    }

    // Calculate R-squared
    return 1 - (explained_variation / total_variation);
}

int main() {
    // Sample data points
    double x[] = {1.0, 2.0, 3.0, 4.0, 5.0};
    double y[] = {2.0, 4.0, 5.0, 4.0, 5.0};
    int n = sizeof(x) / sizeof(x[0]);

    // Predefined slope and intercept (for demonstration)
    double slope = 0.6;
    double intercept = 1.5;

    // Compute and print R-squared
    double r_squared = computeRSquared(x, y, n, slope, intercept);

    printf("Regression Analysis Results:\n");
    printf("Slope: %.2f\n", slope);
    printf("Intercept: %.2f\n", intercept);
    printf("R-squared (R²): %.4f\n", r_squared);

    return 0;
}

Compile the program:

gcc -o r_squared_calculation r_squared_calculation.c -lm

Run the program:

./r_squared_calculation

Example output:

Regression Analysis Results:
Slope: 0.60
Intercept: 1.50
R-squared (R²): 0.5600

Key components of R² calculation:

  1. calculateMean() computes the average of an array
  2. computeRSquared() calculates R² using the formula: 1 - (Explained Variation / Total Variation)
  3. Total Variation measures the spread of actual y values around their mean
  4. Explained Variation measures the spread of predicted values from actual values
  5. R² ranges from 0 to 1, with higher values indicating a better model fit

Print R² Value

In this final step, we'll create a comprehensive program that reads data from a file, calculates the regression parameters, and prints the R² value with detailed interpretation.

First, create a sample data file:

cd ~/project
nano regression_data.txt

Add sample regression data:

1.0 2.0
2.0 4.0
3.0 5.0
4.0 4.0
5.0 5.0

Now, create the final R² calculation program:

nano r_squared_print.c

Enter the following code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Function to calculate linear regression parameters
void calculateRegressionParameters(double *x, double *y, int n,
                                   double *slope, double *intercept) {
    double sum_x = 0, sum_y = 0, sum_xy = 0, sum_x_squared = 0;

    for (int i = 0; i < n; i++) {
        sum_x += x[i];
        sum_y += y[i];
        sum_xy += x[i] * y[i];
        sum_x_squared += x[i] * x[i];
    }

    // Calculate slope and intercept using least squares method
    *slope = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x * sum_x);
    *intercept = (sum_y - *slope * sum_x) / n;
}

// Function to compute R-squared
double computeRSquared(double *x, double *y, int n, double slope, double intercept) {
    double total_variation = 0.0;
    double explained_variation = 0.0;
    double y_mean = 0.0;

    // Calculate mean of y
    for (int i = 0; i < n; i++) {
        y_mean += y[i];
    }
    y_mean /= n;

    // Compute variations
    for (int i = 0; i < n; i++) {
        total_variation += pow(y[i] - y_mean, 2);
        double predicted_y = slope * x[i] + intercept;
        explained_variation += pow(y[i] - predicted_y, 2);
    }

    // Calculate R-squared
    return 1 - (explained_variation / total_variation);
}

// Function to interpret R-squared value
void interpretRSquared(double r_squared) {
    printf("\nR² Interpretation:\n");
    if (r_squared < 0.3) {
        printf("Weak model fit: The model explains less than 30%% of the variance.\n");
    } else if (r_squared < 0.5) {
        printf("Moderate model fit: The model explains 30-50%% of the variance.\n");
    } else if (r_squared < 0.7) {
        printf("Good model fit: The model explains 50-70%% of the variance.\n");
    } else {
        printf("Excellent model fit: The model explains over 70%% of the variance.\n");
    }
}

int main() {
    FILE *file;
    int n = 0, max_lines = 100;
    double x[100], y[100];
    double slope, intercept, r_squared;

    // Open the data file
    file = fopen("regression_data.txt", "r");
    if (file == NULL) {
        printf("Error opening file!\n");
        return 1;
    }

    // Read data from file
    while (fscanf(file, "%lf %lf", &x[n], &y[n]) == 2) {
        n++;
        if (n >= max_lines) break;
    }
    fclose(file);

    // Calculate regression parameters
    calculateRegressionParameters(x, y, n, &slope, &intercept);

    // Compute R-squared
    r_squared = computeRSquared(x, y, n, slope, intercept);

    // Print results
    printf("Regression Analysis Results:\n");
    printf("Number of Data Points: %d\n", n);
    printf("Slope: %.4f\n", slope);
    printf("Intercept: %.4f\n", intercept);
    printf("R-squared (R²): %.4f\n", r_squared);

    // Interpret R-squared
    interpretRSquared(r_squared);

    return 0;
}

Compile the program:

gcc -o r_squared_print r_squared_print.c -lm

Run the program:

./r_squared_print

Example output:

Regression Analysis Results:
Number of Data Points: 5
Slope: 0.6000
Intercept: 1.5000
R-squared (R²): 0.5600

R² Interpretation:
Good model fit: The model explains 50-70% of the variance.

Key points:

  1. Reads data from an external file
  2. Calculates regression parameters using least squares method
  3. Computes R² value
  4. Provides an interpretation of the R² value
  5. Helps understand the model's predictive power

Summary

In this lab, we learned how to compute the predicted y values using a simple linear regression model in C. We created a program that takes in the x and y data points, as well as the slope and intercept of the regression line, and then calculates the predicted y values. The key steps involved computing the predicted y values based on the regression equation and then printing the original and predicted y values for comparison.

Next, we will learn how to compute the coefficient of determination (R²) using the explained and total variation of the data.

Other C Tutorials you may like