Python で混同行列を実装する

はじめに

このプロジェクトでは、分類モデルの性能評価における基本的なツールである混同行列を実装する方法を学びます。混同行列は、モデルの予測を詳細に分解し、改善すべき領域を特定し、モデルの強みと弱みに関する貴重な洞察を得ることができます。

🎯 タスク

このプロジェクトでは、以下を学びます。

分類問題の混同行列を計算するための confusion_matrix 関数を実装する方法
エッジケースを処理し、そのロバスト性を向上させるために confusion_matrix 関数をテストおよび改善する方法
confusion_matrix 関数を文書化して、ユーザーフレンドリーで理解しやすくする方法
confusion_matrix 関数をより大きな機械学習プロジェクトに統合し、分類モデルの性能評価に使用する方法

🏆 成果

このプロジェクトを完了すると、以下のことができるようになります。

分類問題の混同行列を計算して解釈すること
エッジケースを処理し、関数のロバスト性を向上させるための手法を適用すること
文書化およびコードのユーザーフレンドリー化のベストプラクティスを実装すること
より大きな機械学習プロジェクトの文脈で混同行列を適用すること

混同行列関数を実装する

このステップでは、confusion_matrix.py ファイルに confusion_matrix 関数を実装します。この関数は、分類問題の混同行列を計算します。

confusion_matrix 関数には 3 つの入力が必要です。

labels：異なるクラスを表すラベルのリスト。
preds：予測のリストで、各予測は labels リスト内のクラスに対応する確率のリストです。
ground_truth：正解ラベルのリスト。

関数は、混同行列を 2 次元リストとして返す必要があり、各内部リストは行列の 1 行を表します。

以下は、confusion_matrix 関数のスターターコードです。

def confusion_matrix(
    labels: List, preds: List[List[float]], ground_truth: List
) -> List[List[int]]:
    """
    Compute the confusion matrix for a classification problem.

    The function takes a list of labels, a list of predictions (each as a list of probabilities
    for each class), and a list of ground truth labels, and returns a confusion matrix.
    The confusion matrix is a square matrix where entry (i, j) is the number of times class i
    was predicted when the true class was j.

    Parameters:
    labels (List): A list of labels representing the different classes.
    preds (List[List[float]]): A list of predictions where each prediction is a list of
                               probabilities corresponding to the classes in the labels list.
    ground_truth (List): A list of ground truth labels.

    Returns:
    List[List[int]]: The confusion matrix represented as a list of lists where each list
                     represents a row in the matrix.
    """
    ## This creates a square matrix with dimensions equal to the number of classes, initializing all elements to zero. Each row and column corresponds to a class label.
    matrix = [[0 for _ in range(len(labels))] for _ in range(len(labels))]

    ## This loop pairs each prediction with its corresponding ground truth label and processes them one by one.
    for pred, truth in zip(preds, ground_truth):
        ## Uses NumPy to find the index of the highest probability in the prediction list, which corresponds to the predicted class.
        pred_index = np.argmax(pred)
        ## Finds the index of the true class label in the `labels` list.
        truth_index = labels.index(truth)
        ## This line increments the cell at the intersection of the predicted class row and the true class column in the confusion matrix, effectively counting the occurrence of this specific prediction-truth pair.
        matrix[pred_index][truth_index] += 1

    ## After processing all predictions, the function returns the computed confusion matrix.
    return matrix

confusion_matrix 関数では、分類問題の混同行列を計算するロジックを実装します。

混同行列関数をテストする

このステップでは、提供された例を使って confusion_matrix 関数をテストします。

confusion_matrix.py ファイルに以下のコードを追加します。

if __name__ == "__main__":
    labels = ["Python", "Java", "C++"]
    preds = [
        [0.66528198, 0.21971853, 0.11499949],
        [0.34275858, 0.05847305, 0.59876836],
        [0.47650585, 0.26353373, 0.25996042],
        [0.76153846, 0.15384615, 0.08461538],
        [0.04691943, 0.9478673, 0.00521327],
    ]
    ground_truth = ["Python", "C++", "Java", "C++", "Java"]
    matrix = confusion_matrix(labels, preds, ground_truth)
    print(matrix)

confusion_matrix.py ファイルを実行して例を実行します。