# Introduction An imbalanced dataset refers to a dataset where the number of samples in each class is highly imbalanced. For example, in a binary classification problem, if there are 999 positive class samples and only 1 negative class sample in the training set, the model will have a very low penalty cost for predicting positive class samples. As a result, the training accuracy can be as high as 99.9%. However, when the model goes online, you will find that it cannot predict negative class samples at all, as the saying goes, "mighty offline, stupid online." This situation is quite common in the medical field, such as binary classification prediction for a certain type of cancer. The training set consists of the features of all patients in a hospital and whether they have the cancer label. However, the incidence rate of this type of cancer in the local area is very low, so the dataset is imbalanced. The model is likely to predict that a patient does not have this type of cancer. But when this model is used to predict in another area with a high incidence rate, the accuracy of the modelโs predictions will be very low. Common solutions to deal with imbalanced datasets include: 1. Designing loss functions that penalize the model more when it predicts samples belonging to the majority class incorrectly, such as the commonly used Focal Loss. 2. Upsampling the minority class samples to increase their frequency in a batch of training data. 3. Downsampling the majority class samples to reduce their frequency in a batch of training data. In this challenge, we will be working with an imbalanced dataset where the number of samples in each class is highly imbalanced. The task is to implement the functionality of upsampling and downsampling to make the sample distribution as balanced as possible within a batch. This will help address the challenges faced in predicting minority class samples accurately and improve the overall accuracy of the model.
Click the virtual machine below to start practicing