Remove Duplicates from a spreadsheet based on all or subset of columns | Google sheets/Excel | Python/Colab | Data Solution for beginners

– efficient & ready-to-execute for slow loading sheets in Pyhton / Google Colab

Goal of this tutorial: efficient and simple removal of duplicates from spreadsheet data!

In this tutorial and video, I show you how to remove duplicates such as in this data. Fast and actionable. Specially for bigger data where loading the sheet or excel files of data are slow.

Video Tutorial of this topic, complementing this blog, is available here:

Step1: Initial Settings

This first block run the needed libraries and connect the code to google sheets, where input data exist in this example.

For reading data from google sheets refer to this tutorial: https://winswithdata.com/?p=12
For reading data from google drive refer to this tutorial: https://winswithdata.com/?p=1

Connect this colab code to your google drive

Let’s take a look at data and some duplication examples in it.

Example of duplication in the data.

example of duplicated in data – this is a fake dataset

Filter or slice the data if needed

Step2: if needed filter or slice the data to focus on the sample you want. The referred video tutorial link provide more information on this.

Refer to this tutorial for more explanation on filtering data: https://winswithdata.com/?p=34

Define condition to select a slice or filter a subset of data if needed

Removing Duplicates

Step3: Remove duplicates in scenario 1. Duplicates based on all columns

Remove duplicated based on multiple selected data columns in python

Step4: remove duplicates based on subset of columns, not all. Let’s have a look at all columns to choose first.

In this example, if there are repeated information based on first name, and last name, and id, delete the repeated ones.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Remove Duplicates from a spreadsheet based on all or subset of columns | Google sheets/Excel | Python/Colab | Data Solution for beginners

– efficient & ready-to-execute for slow loading sheets in Pyhton / Google Colab

Step1: Initial Settings

Filter or slice the data if needed

Removing Duplicates

Remove duplicated based on multiple selected data columns in python

Step5: last step, saving the data in google sheets!

Related links for this video:

Leave a Reply Cancel reply