Compute the Pearson Correlation Coefficient in C

CCBeginner
Practice Now

Introduction

In this lab, we will learn how to compute the Pearson correlation coefficient in C. The lab covers three main steps: reading paired (x,y) data, computing the necessary sums, and using the formula to calculate the correlation coefficient. We will create a C program that allows users to input data points, and then the program will perform the correlation analysis and output the result.

The lab provides a step-by-step guide, starting with the implementation of the data input functionality, followed by the calculation of the sums required for the correlation formula, and finally, the printing of the correlation coefficient.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL c(("`C`")) -.-> c/UserInteractionGroup(["`User Interaction`"]) c(("`C`")) -.-> c/CompoundTypesGroup(["`Compound Types`"]) c(("`C`")) -.-> c/FunctionsGroup(["`Functions`"]) c/UserInteractionGroup -.-> c/output("`Output`") c/CompoundTypesGroup -.-> c/arrays("`Arrays`") c/UserInteractionGroup -.-> c/user_input("`User Input`") c/CompoundTypesGroup -.-> c/structures("`Structures`") c/FunctionsGroup -.-> c/math_functions("`Math Functions`") subgraph Lab Skills c/output -.-> lab-435163{{"`Compute the Pearson Correlation Coefficient in C`"}} c/arrays -.-> lab-435163{{"`Compute the Pearson Correlation Coefficient in C`"}} c/user_input -.-> lab-435163{{"`Compute the Pearson Correlation Coefficient in C`"}} c/structures -.-> lab-435163{{"`Compute the Pearson Correlation Coefficient in C`"}} c/math_functions -.-> lab-435163{{"`Compute the Pearson Correlation Coefficient in C`"}} end

Read Paired (x,y) Data

In this step, we will learn how to read paired (x,y) data for calculating the Pearson correlation coefficient in C. We'll create a program that allows users to input paired numerical data and store it for further analysis.

First, let's create a C source file for our data input functionality:

cd ~/project
nano correlation_input.c

Now, add the following code to the file:

#include <stdio.h>
#define MAX_POINTS 100

int main() {
    double x[MAX_POINTS], y[MAX_POINTS];
    int n, i;

    printf("Enter the number of data points (max %d): ", MAX_POINTS);
    scanf("%d", &n);

    printf("Enter x and y coordinates:\n");
    for (i = 0; i < n; i++) {
        printf("Point %d (x y): ", i + 1);
        scanf("%lf %lf", &x[i], &y[i]);
    }

    printf("\nData Points Entered:\n");
    for (i = 0; i < n; i++) {
        printf("Point %d: (%.2f, %.2f)\n", i + 1, x[i], y[i]);
    }

    return 0;
}

Compile the program:

gcc -o correlation_input correlation_input.c

Run the program and enter some sample data:

./correlation_input

Example output:

Enter the number of data points (max 100): 5
Enter x and y coordinates:
Point 1 (x y): 1 2
Point 2 (x y): 2 4
Point 3 (x y): 3 5
Point 4 (x y): 4 4
Point 5 (x y): 5 5

Data Points Entered:
Point 1: (1.00, 2.00)
Point 2: (2.00, 4.00)
Point 3: (3.00, 5.00)
Point 4: (4.00, 4.00)
Point 5: (5.00, 5.00)

Let's break down the code:

  1. We define a maximum number of data points (MAX_POINTS) to prevent memory overflow.
  2. The program prompts the user to enter the number of data points.
  3. It then allows the user to input x and y coordinates for each point.
  4. Finally, it prints out the entered data points to confirm input.

Compute Sums and Use Formula for Correlation

In this step, we will extend our previous program to compute the necessary sums for calculating the Pearson correlation coefficient. We'll modify the correlation_input.c file to include calculations for the correlation formula.

Open the previous file:

cd ~/project
nano correlation_input.c

Update the code with the following implementation:

#include <stdio.h>
#include <math.h>
#define MAX_POINTS 100

double calculatePearsonCorrelation(double x[], double y[], int n) {
    double sum_x = 0, sum_y = 0, sum_xy = 0;
    double sum_x_squared = 0, sum_y_squared = 0;

    // Compute necessary sums
    for (int i = 0; i < n; i++) {
        sum_x += x[i];
        sum_y += y[i];
        sum_xy += x[i] * y[i];
        sum_x_squared += x[i] * x[i];
        sum_y_squared += y[i] * y[i];
    }

    // Pearson correlation coefficient formula
    double numerator = n * sum_xy - sum_x * sum_y;
    double denominator = sqrt((n * sum_x_squared - sum_x * sum_x) *
                               (n * sum_y_squared - sum_y * sum_y));

    return numerator / denominator;
}

int main() {
    double x[MAX_POINTS], y[MAX_POINTS];
    int n, i;

    printf("Enter the number of data points (max %d): ", MAX_POINTS);
    scanf("%d", &n);

    printf("Enter x and y coordinates:\n");
    for (i = 0; i < n; i++) {
        printf("Point %d (x y): ", i + 1);
        scanf("%lf %lf", &x[i], &y[i]);
    }

    double correlation = calculatePearsonCorrelation(x, y, n);

    printf("\nData Points Entered:\n");
    for (i = 0; i < n; i++) {
        printf("Point %d: (%.2f, %.2f)\n", i + 1, x[i], y[i]);
    }

    printf("\nPearson Correlation Coefficient: %.4f\n", correlation);

    return 0;
}

Compile the program with math library:

gcc -o correlation_input correlation_input.c -lm

Run the program with sample data:

./correlation_input

Example output:

Enter the number of data points (max 100): 5
Enter x and y coordinates:
Point 1 (x y): 1 2
Point 2 (x y): 2 4
Point 3 (x y): 3 5
Point 4 (x y): 4 4
Point 5 (x y): 5 5

Data Points Entered:
Point 1: (1.00, 2.00)
Point 2: (2.00, 4.00)
Point 3: (3.00, 5.00)
Point 4: (4.00, 4.00)
Point 5: (5.00, 5.00)

Pearson Correlation Coefficient: 0.8528

Key points about the Pearson correlation calculation:

  1. We compute necessary sums: x, y, xy, xÂē, yÂē
  2. Apply the Pearson correlation coefficient formula
  3. Use sqrt() from math.h for calculation
  4. Return the correlation coefficient between -1 and 1

Print the Correlation Coefficient

In this final step, we'll enhance our program to provide a comprehensive interpretation of the Pearson correlation coefficient and create a more user-friendly output.

Open the previous file:

cd ~/project
nano correlation_input.c

Update the code with the following implementation:

#include <stdio.h>
#include <math.h>
#define MAX_POINTS 100

double calculatePearsonCorrelation(double x[], double y[], int n) {
    double sum_x = 0, sum_y = 0, sum_xy = 0;
    double sum_x_squared = 0, sum_y_squared = 0;

    for (int i = 0; i < n; i++) {
        sum_x += x[i];
        sum_y += y[i];
        sum_xy += x[i] * y[i];
        sum_x_squared += x[i] * x[i];
        sum_y_squared += y[i] * y[i];
    }

    double numerator = n * sum_xy - sum_x * sum_y;
    double denominator = sqrt((n * sum_x_squared - sum_x * sum_x) *
                               (n * sum_y_squared - sum_y * sum_y));

    return numerator / denominator;
}

void interpretCorrelation(double correlation) {
    printf("\nCorrelation Coefficient Interpretation:\n");
    printf("Correlation Value: %.4f\n", correlation);

    if (correlation > 0.8) {
        printf("Strong Positive Correlation\n");
    } else if (correlation > 0.5) {
        printf("Moderate Positive Correlation\n");
    } else if (correlation > 0.3) {
        printf("Weak Positive Correlation\n");
    } else if (correlation > -0.3) {
        printf("No Linear Correlation\n");
    } else if (correlation > -0.5) {
        printf("Weak Negative Correlation\n");
    } else if (correlation > -0.8) {
        printf("Moderate Negative Correlation\n");
    } else {
        printf("Strong Negative Correlation\n");
    }
}

int main() {
    double x[MAX_POINTS], y[MAX_POINTS];
    int n, i;

    printf("Pearson Correlation Coefficient Calculator\n");
    printf("----------------------------------------\n");
    printf("Enter the number of data points (max %d): ", MAX_POINTS);
    scanf("%d", &n);

    printf("Enter x and y coordinates:\n");
    for (i = 0; i < n; i++) {
        printf("Point %d (x y): ", i + 1);
        scanf("%lf %lf", &x[i], &y[i]);
    }

    double correlation = calculatePearsonCorrelation(x, y, n);

    printf("\nData Points Entered:\n");
    for (i = 0; i < n; i++) {
        printf("Point %d: (%.2f, %.2f)\n", i + 1, x[i], y[i]);
    }

    interpretCorrelation(correlation);

    return 0;
}

Compile the program:

gcc -o correlation_calculator correlation_input.c -lm

Run the program with sample data:

./correlation_calculator

Example output:

Pearson Correlation Coefficient Calculator
----------------------------------------
Enter the number of data points (max 100): 5
Enter x and y coordinates:
Point 1 (x y): 1 2
Point 2 (x y): 2 4
Point 3 (x y): 3 5
Point 4 (x y): 4 4
Point 5 (x y): 5 5

Data Points Entered:
Point 1: (1.00, 2.00)
Point 2: (2.00, 4.00)
Point 3: (3.00, 5.00)
Point 4: (4.00, 4.00)
Point 5: (5.00, 5.00)

Correlation Coefficient Interpretation:
Correlation Value: 0.8528
Strong Positive Correlation

Key improvements:

  1. Added interpretCorrelation() function
  2. Provides detailed explanation of correlation strength
  3. Categorizes correlation into different levels
  4. Enhanced user interface with a title and clear output

Summary

In this lab, we learned how to read paired (x,y) data for calculating the Pearson correlation coefficient in C. We created a program that allows users to input paired numerical data and store it for further analysis. We also extended the program to compute the necessary sums for calculating the Pearson correlation coefficient using the formula.

The key steps covered in this lab include reading paired (x,y) data, computing the sums required for the correlation formula, and printing the final correlation coefficient. By following these steps, you can implement the Pearson correlation calculation in your own C programs.

Other C Tutorials you may like