Mastering Pandas DataFrame Duplicated Method

Introduction

In this lab, we will learn about the duplicated() method in the Pandas library for Python. The duplicated() method is used to find duplicate rows in a DataFrame.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Import the necessary libraries

First, we need to import the pandas library as pd.

import pandas as pd

Create a DataFrame

Next, let's create a DataFrame to work with. We will use the following example DataFrame:

df = pd.DataFrame({'Name': ['Navya', 'Vindya', 'Navya', 'Vindya', 'Sinchana', 'Sinchana'],
                   'Skills': ['Python', 'Java', 'Python', 'Java', 'Java', 'Java']})

Find duplicated rows

To find duplicated rows in the DataFrame, we can use the duplicated() method. By default, it considers all columns for identifying duplicates. It returns a boolean Series where True represents duplicate rows and False represents unique rows.

duplicates = df.duplicated()
print(duplicates)

Specify columns for identifying duplicates

If we want to consider only certain columns for identifying duplicates, we can pass the column label(s) to the subset parameter of the duplicated() method.

duplicates_subset = df.duplicated(subset=['Skills'])
print(duplicates_subset)

Specify duplicate marking

The keep parameter of the duplicated() method determines how duplicates should be marked. By default, it is set to 'first', which marks all duplicates as True except for the first occurrence. We can also set it to 'last' or False to mark duplicates differently.

duplicates_keep_last = df.duplicated(keep='last')
print(duplicates_keep_last)

duplicates_keep_false = df.duplicated(keep=False)
print(duplicates_keep_false)

Summary

In this lab, we learned how to use the duplicated() method in Pandas to find duplicate rows in a DataFrame. We saw how to identify duplicates based on certain columns, specify duplicate marking, and obtain a boolean Series representing duplicate rows. The duplicated() method is a useful tool for data cleaning and identifying duplicated data.