Articles – TruEducate Learning Academy

Introduction to Python Programming Language

May 21, 2023 by vinai@brandrich.com

What is Python Programming Language?

Python is a high-level, interpreted, and general-purpose programming language. It was designed to emphasize code readability and simplicity, allowing developers to express concepts in fewer lines of code compared to other programming languages. Python’s design philosophy emphasizes code readability and a clean syntax, making it easy to understand and write.

Here are some key features of Python:

Readability: Python’s syntax is designed to be clear and readable, using indentation and whitespace to delimit code blocks instead of using braces like many other programming languages. This feature contributes to the overall readability of Python code.

Interpreted: Python is an interpreted language, which means that code is executed line by line, without the need for explicit compilation. This allows for rapid development and easier debugging since errors are often reported as they occur.

Dynamically Typed: Python is dynamically typed, meaning that variable types are determined automatically during runtime. Developers do not need to explicitly declare variable types, making Python code more flexible and reducing the amount of code needed.

Cross-platform: Python is available on multiple platforms, including Windows, macOS, and various Unix-based systems. This allows developers to write code once and run it on different platforms without modification.

Vast Standard Library: Python comes with an extensive standard library that provides a wide range of modules and functions for tasks such as file I/O, networking, regular expressions, and more. This allows developers to accomplish many common programming tasks without having to rely on external libraries.

Rich Ecosystem: Python has a vibrant ecosystem with a vast collection of third-party libraries and frameworks. These libraries expand the functionality of Python and enable developers to tackle specialized tasks such as web development, data analysis, machine learning, and more.

Python’s versatility and ease of use have contributed to its widespread adoption across various domains, from web development and scientific computing to machine learning and automation. It is considered a beginner-friendly language, making it an excellent choice for those new to programming, while also being powerful enough for experienced developers to build complex applications.

What is the Python programming language used for?

Python is a versatile and widely used programming language that serves multiple purposes. It was created by Guido van Rossum and first released in 1991. Here are some common uses of Python:

Web Development: Python offers various frameworks like Django and Flask, which enable developers to build scalable and robust web applications. It provides tools for handling tasks like URL routing, database interaction, and template rendering.

Data Analysis and Scientific Computing: Python has become one of the leading languages in the field of data analysis and scientific computing. Libraries such as NumPy, Pandas, and SciPy provide powerful tools for numerical computing, data manipulation, and statistical analysis. Additionally, Python’s visualization libraries like Matplotlib and Seaborn help in creating appealing and informative visualizations.

Machine Learning and Artificial Intelligence: Python has gained significant popularity in the field of machine learning and AI due to its extensive libraries and frameworks. Libraries like TensorFlow, Keras, and PyTorch provide the necessary tools for developing and training machine learning models. Python’s simplicity and readability make it an excellent choice for experimenting and prototyping in the AI domain.

Scripting and Automation: Python is often used for scripting and automation tasks, thanks to its clear syntax and ease of use. It can automate repetitive tasks, file handling, and system administration, making it useful in areas such as system scripting, network programming, and web scraping.

Desktop GUI Applications: Python offers frameworks like PyQt and Tkinter that allow developers to create cross-platform desktop applications with graphical user interfaces (GUI). These frameworks provide tools for creating windows, buttons, menus, and other GUI elements.

Game Development: Python, along with libraries such as Pygame, provides a platform for creating simple to moderately complex games. Pygame offers functionalities for graphics, sound, and user input, making it a suitable choice for developing 2D games.

Prototyping and Rapid Development: Python’s simplicity, readability, and vast collection of libraries make it ideal for prototyping and rapid development. It allows developers to quickly test ideas, iterate on them, and develop proof-of-concept applications efficiently.

These are just a few examples of what Python can be used for. Its versatility, ease of use, and extensive ecosystem of libraries have made it one of the most popular programming languages across various domains.

Is Python easy to learn or hard?

Python is often regarded as an easy programming language to learn, especially for beginners. There are several reasons why Python is considered beginner-friendly:

Readability: Python’s syntax is designed to be clear and readable, resembling English-like statements. The use of indentation instead of braces for code blocks and the avoidance of complex symbols make Python code easier to understand.

Simple and Concise Syntax: Python emphasizes simplicity and reduces the amount of code needed to express concepts. This reduces the cognitive load on beginners and allows them to focus on understanding programming concepts rather than dealing with complex syntax.

Large Community and Learning Resources: Python has a large and active community, which means there are numerous learning resources available, including tutorials, documentation, forums, and online communities. This makes it easier for beginners to find help and guidance when they encounter difficulties.

Versatility: Python can be used for various purposes, such as web development, data analysis, machine learning, and more. This versatility allows beginners to explore different domains and find their areas of interest.

Abundance of Libraries and Frameworks: Python has a vast ecosystem of libraries and frameworks that provide pre-built solutions for common tasks. Beginners can leverage these resources to accelerate their learning process and build functional applications without starting from scratch.

However, it’s worth noting that while Python is considered easy to learn, mastering any programming language requires practice, effort, and a solid understanding of fundamental programming concepts. As you progress in your programming journey, you may encounter more advanced topics and challenges that require deeper knowledge and experience.

Ultimately, Python’s simplicity, readability, and extensive learning resources make it an excellent choice for beginners, allowing them to grasp programming concepts quickly and start building useful applications.

Is Python language similar to C++?

Python and C++ are both programming languages, but they have distinct differences in terms of syntax, design philosophy, and use cases. Here are some points of comparison between Python and C++:

Syntax: Python and C++ have significantly different syntaxes. Python uses whitespace and indentation to delimit code blocks, while C++ uses braces ({}) for this purpose. Python’s syntax aims to prioritize readability and simplicity, while C++ syntax allows for more low-level control and explicitness.

Design Philosophy: Python focuses on code readability and simplicity. It emphasizes writing clean and concise code. C++, on the other hand, prioritizes performance and efficiency. It allows for manual memory management and provides more control over hardware resources.

Typing: Python is dynamically typed, meaning that variable types are determined at runtime. In contrast, C++ is statically typed, requiring variable types to be declared explicitly during compilation. Static typing in C++ allows for more efficient code execution but may require more effort from the programmer.

Memory Management: Python features automatic memory management through garbage collection, relieving developers from manually managing memory allocation and deallocation. In C++, memory management is explicit and requires manual allocation and deallocation of memory through concepts like pointers and the “new” and “delete” operators. This manual memory management can lead to more efficient memory usage but also introduces potential risks if not handled correctly.

Performance: C++ is generally considered to have better performance than Python. C++ programs can be compiled to machine code, enabling them to run more efficiently. Python, being an interpreted language, may have performance limitations in certain scenarios. However, Python offers various tools and libraries that can leverage compiled code to boost performance in critical sections.

Use Cases: Python is often used for web development, data analysis, scientific computing, machine learning, and automation tasks. It excels in rapid prototyping and scripting. C++ is commonly used in systems programming, game development, embedded systems, high-performance computing, and other areas that require fine-grained control over hardware resources.

While Python and C++ have their own strengths and areas of application, it is worth noting that they are not mutually exclusive. It is common to see projects that combine both languages, utilizing Python for higher-level tasks and C++ for performance-critical components or library bindings. Understanding the differences between Python and C++ can help you choose the most appropriate language for a given project.

Why is Python so popular?

Python has gained immense popularity for a variety of reasons. Here are some key factors contributing to Python’s popularity:

Readability and Simplicity: Python’s syntax is designed to be highly readable and easy to understand. Its clear and concise code structure, combined with the use of whitespace and indentation for code blocks, makes Python code more approachable and less daunting for beginners and experienced programmers alike. This focus on simplicity and readability reduces the barrier to entry and enhances the productivity of developers.

Versatility and Flexibility: Python is a versatile language that can be used across various domains and applications. Whether it’s web development, data analysis, scientific computing, machine learning, artificial intelligence, automation, or scripting, Python provides extensive libraries and frameworks that cater to different needs. Its ability to integrate with other languages and platforms also contributes to its versatility.

Large and Active Community: Python boasts a vibrant and supportive community of developers, educators, and enthusiasts. This community contributes to the growth and development of the language by creating libraries, frameworks, and tools, as well as providing extensive documentation and resources. The active community ensures that Python remains up to date, provides solutions to common problems, and offers support to newcomers.

Abundance of Libraries and Frameworks: Python has a vast ecosystem of libraries and frameworks that cover a wide range of functionalities and domains. These libraries, such as NumPy, Pandas, TensorFlow, Django, and Flask, among many others, provide ready-to-use solutions for common tasks, allowing developers to build applications more efficiently. The availability of these resources saves time and effort in development and contributes to Python’s popularity.

Easy Integration and Interoperability: Python can seamlessly integrate with other languages and platforms, making it a convenient choice for software development. It supports interoperability with languages like C, C++, Java, and .NET, allowing developers to leverage existing code and libraries. Python’s ability to interface with different systems and technologies makes it a preferred language for building complex applications and integrating with existing software ecosystems.

Rapid Prototyping and Development: Python’s simplicity, expressiveness, and large standard library enable developers to prototype ideas and develop applications quickly. The language allows developers to focus on the core logic of their applications rather than getting bogged down by intricate syntax or low-level details. This rapid development capability makes Python an attractive choice for startups, small projects, and agile development environments.

Education and Learning: Python is widely adopted in educational settings, including schools, universities, and online learning platforms. Its simplicity and readability make it an excellent language for teaching programming concepts, making it accessible to beginners. Python’s popularity in education translates into a large pool of developers who are well-versed in the language and contribute to its growth and adoption.

These factors, along with Python’s continuous evolution and support from the Python Software Foundation (PSF), have propelled its popularity and widespread adoption across diverse industries and domains.

Where is Python used in real life?

Python is widely used in various industries and applications in real life. Here are some common areas where Python is used:

Web Development: Python is frequently used for web development, powering popular frameworks like Django and Flask. These frameworks enable developers to build robust, scalable, and feature-rich web applications.

Data Analysis and Science: Python has become the language of choice for data analysis and scientific computing. Libraries such as NumPy, Pandas, and SciPy provide powerful tools for data manipulation, analysis, and visualization. Additionally, Jupyter Notebooks, an interactive coding environment, is commonly used for data exploration and sharing.

Machine Learning and Artificial Intelligence: Python has a strong presence in machine learning and AI. Libraries like TensorFlow, Keras, and PyTorch offer powerful capabilities for building and training machine learning models. Python’s simplicity and rich ecosystem make it a popular choice for researchers and developers in this field.

Automation and Scripting: Python’s easy-to-read syntax and versatility make it an excellent choice for automation tasks and scripting. From automating repetitive tasks to creating complex workflows, Python can be used to streamline processes and improve efficiency.

Scientific Research: Python is widely used in scientific research across various domains, including physics, biology, astronomy, and more. It provides researchers with tools for data analysis, simulations, modeling, and visualization.

Finance and Trading: Python is extensively used in the finance industry for tasks like algorithmic trading, risk management, and data analysis. Libraries like Pandas and NumPy, combined with tools such as Jupyter Notebooks, enable analysts and traders to process large amounts of financial data efficiently.

Game Development: Python is employed in game development, both for indie games and large-scale productions. Libraries such as Pygame offer functionality for creating 2D games, and engines like Unity support Python scripting for game logic.

Web Scraping and Automation: Python’s versatility and libraries like Beautiful Soup and Scrapy make it an excellent choice for web scraping tasks. It enables developers to extract data from websites and automate repetitive data collection processes.

System Administration: Python is used for system administration tasks, such as writing scripts for managing servers, automating deployment processes, and configuring network devices.

Education: Python is widely adopted as an introductory programming language in educational settings due to its simplicity and readability. It is used to teach programming concepts and fundamentals to beginners.

These are just a few examples of Python’s applications, but its versatility allows it to be used in various other fields and domains as well.

How to Learn Python Programming?

Learning Python programming can be an exciting and rewarding journey. Here’s a step-by-step guide to help you get started:

Set Clear Goals: Determine why you want to learn Python. Understanding your goals will help you stay focused and motivated throughout the learning process.

Learn the Basics: Start with the fundamentals of Python, including variables, data types, operators, control flow statements (if-else, loops), functions, and input/output. Online tutorials, interactive coding platforms, or beginner-friendly books can be great resources for learning the basics.

Practice with Exercises: Reinforce your understanding of Python concepts by solving coding exercises. Websites like LeetCode, HackerRank, and Codecademy offer coding challenges and exercises tailored for Python learners.

Dive into Python Libraries: Explore popular Python libraries such as NumPy, Pandas, and Matplotlib for data analysis and visualization. Read documentation, follow tutorials, and work on small projects to gain hands-on experience with these libraries.

Build Projects: Projects provide practical experience and help solidify your knowledge. Start with small projects, such as building a calculator or a simple text-based game. As you progress, challenge yourself with more complex projects aligned with your interests.

Join a Community: Engage with the Python community to connect with fellow learners and experienced developers. Participate in forums, attend local meetups, and join online communities like Reddit’s r/learnpython or the Python Discord server. You can learn from others, ask questions, and share your projects.

Read Books and Online Resources: Supplement your learning with Python books and online resources. Some popular books include “Python Crash Course” by Eric Matthes, “Automate the Boring Stuff with Python” by Al Sweigart, and “Fluent Python” by Luciano Ramalho. Online platforms like Real Python, Python.org, and freeCodeCamp offer tutorials, articles, and documentation.

Work on Real-World Problems: Challenge yourself with real-world problems or mini-projects that align with your interests. For example, you can create a web scraper, automate a repetitive task, or build a personal website using Python and web frameworks like Django or Flask.

Learn from Code Examples and Open-Source Projects: Study and analyze code examples and open-source projects on platforms like GitHub. Understanding how experienced developers structure their code and solve problems can enhance your programming skills.

Stay Updated and Continuously Learn: Python and its ecosystem evolve over time, so it’s essential to stay updated. Follow Python-related blogs, subscribe to newsletters, and explore new libraries and frameworks to expand your knowledge.

Remember, practice and consistency are key to mastering Python programming. Be patient, embrace challenges, and enjoy the learning process. Happy coding!

Replacing Strings in Files with Python

May 20, 2023 by Vinai Prakash

String manipulation is a common task in programming, and Python provides powerful tools to handle it efficiently. In this article, we will explore various methods to replace strings in any text file using the Python programming language. We will cover different scenarios and provide multiple examples to demonstrate each approach. By the end, you will clearly understand how to replace strings in files programmatically, whether you need to update a single file or process multiple files simultaneously. You can even replace special characters in files with the new string using these techniques. Let’s dive in and discover the techniques to accomplish this task effortlessly.

I. Reading and Writing Files in Python:

Before we delve into string replacement, it’s essential to understand how to read and write files in Python programming language. These fundamental operations will lay the foundation for manipulating strings within files using any Python program.

Python provides the built-in open() function to access files. Here’s an example that opens an input file in read mode and prints its contents:

# open the input file as a read only file.

with open('filename.txt', 'r') as file:

    contents = file.read()

    print(contents)

To write data to a new file, we can open it in write mode using the ‘w’ parameter. The following example demonstrates how to replace the entire contents of a file with new data:

# old file contents are overwritten by the new string. This works even if you have a huge file of data.

with open('filename.txt', 'w') as file:

    file.write("New content goes here!") #the target file will get the replacement string.

Now that we have the basics covered, let’s explore different methods to replace specific strings within files. This is by far the best way or best method to replace any text with a new string, and create a new file for the output. Enough theory. Let’s look at a practical Python Code example. Here’s the sample code, in Python.

II. String Replacement Techniques for regular expression pattern:

Using the replace() Method:

The replace() method is a convenient way to replace occurrences of a specific string within another string. By combining it with file read and write operations, we can replace targeted strings in a file. Here’s an example in the following script in Python code:

with open('filename.txt', 'r') as file:
    contents = file.read()
    modified_contents = contents.replace('old_string', 'new_string')

    with open('filename.txt', 'w') as file:
        file.write(modified_contents)

Now you can open the original file, and see that all occurrences of the substring have been replaced with the specified string. The old string should be nowhere to be seen. With a small script in Python, we can do wonders.

Utilizing Regular Expression Pattern Matching Technique:

Python’s re module enables powerful pattern matching using regular expressions. We can leverage regular expressions to perform advanced string replacements. Consider the following example, where we replace multiple occurrences of a word with another replacement string. Create a new Python file. And write the following Python tutorial code.

import re

with open('filename.txt', 'r') as file:

    contents = file.read()

    modified_contents = re.sub(r'\bword\b', 'replacement', contents)

    with open('filename.txt', 'w') as file:

        file.write(modified_contents)

With this method, you can keep calling it to replace multiple text strings in the original file. Make sure to create a backup file before starting the string substitutions. The entire file is scanned, and simple replacements are done in no time. These Python coding techniques are best for complete beginner or even experienced users of Python script. It may be best to create a Python function to do this if you have to replace multiple occurrences of substrings in a list of files.

Replacing Strings in Multiple Files:

Often, we need to replace strings in multiple files simultaneously. We can employ a loop that iterates over a list of filenames to accomplish this. Here’s an example that replaces a string in multiple files:

import glob

files = glob.glob('path/to/files/*.txt')

target_string = 'old_string'

replacement_string = 'new_string'

for file in files:

with open(file, 'r') as f:

contents = f.read()

modified_contents = contents.replace(target_string, replacement_string)

with open(file, 'w') as f:

f.write(modified_contents) # creation of output file

Modifying Files in Place:

If the file size is small, we can read the entire contents into memory, perform the string replacement, and write it back to the same file. However, for larger files, it’s more efficient to read and write line by line. Here’s an example that replaces strings in a file line by line:

input_file = 'filename.txt'
output_file = 'filename_modified.txt'

target_string = 'old_string'
replacement_string = 'new_string'

with open(input_file, 'r') as file_in, open(output_file, 'w') as file_out:
for line in file_in:
    modified_line = line.replace(target_string, replacement_string)
file_out.write(modified_line)

Conclusion:

Python programming language provides several techniques to replace strings in files efficiently. Whether you need to replace a single occurrence or process multiple files, the methods presented in this article will equip you with the necessary knowledge to achieve your goal. Simply put in the search string, and then replace it with the replacement string.

Remember to handle file operations carefully, considering factors such as file size, memory usage, and performance requirements. With the flexibility and power of Python, you can confidently manipulate strings in files and streamline your data processing tasks. Happy coding!

How To Open Excel Files in Python: Multiple Methods Explored

May 20, 2023 by vinai@brandrich.com

How To Open Excel Files in Python: Multiple Methods Explored

Python programming language provides various libraries and modules to work with Microsoft Excel files, making it effortless to extract, manipulate, and analyze data stored in any Excel sheet. In this step-by-step guide, we will explore multiple ways to open Excel files using different libraries in Python. By the end of this article, you’ll have a solid understanding of the various approaches available, allowing you to choose the most suitable one for your needs.

We then show you step by step examples of different ways we can use to process the Excel data, and remove duplicates, row by row. This way, you will a solid understanding of how to open an Excel workbook in Python, and then how to process the data to de-duplicate it. After all, we can only process clean data in Python… so these are essential steps before any data analysis in Python.

Introduction to Excel File Handling in Python:

Excel files, commonly known as spreadsheets, are widely used for data storage and analysis. Python offers several libraries that facilitate reading and writing Excel files. In this article, we will explore four popular libraries: pandas, openpyxl, xlrd and xlwt, and pyxlsb. These 4 Python packages are most versatile and easy to work with. They are best for beginners writing their first Python script.

Method 1 of Opening Excel File in Python: Using the pandas Library:

The Python pandas library is a powerful tool for data manipulation and analysis. It provides an easy-to-use interface to work with an Excel spreadsheet file. To open an Excel file using a pandas dataframe, follow these steps:

a. Installing pandas module and openpyxl:

pip install pandas openpyxl

b. Importing the necessary modules:

import pandas as pd

c. Opening an Excel file:

df = pd.read_excel('file.xlsx')

d. Accessing the data:

print(df.head())

As you can see in the above Python code, the pandas Python library is a pretty easy to use library. It picks up the data on the first sheet. We can also use the sheet name to refer to any specific sheet too.

Method 2 of Opening Excel File in Python: Using the openpyxl Library:

The openpyxl library provides a wide range of functionalities to read, write, and manipulate Excel files. To open an Excel file using openpyxl, follow these steps:

# a. Installing openpyxl
pip install openpyxl

# b. Importing the necessary modules
import openpyxl

# c. Opening an Excel file:
workbook = openpyxl.load_workbook('file.xlsx') 
worksheet = workbook.active

# d. Accessing the data:
for row in worksheet.iter_rows(values_only=True):
    print(row)

Method 3 of Opening Excel File in Python: Using the xlrd and xlwt Libraries:

The xlrd and xlwt libraries are widely used for reading and writing Excel files in Python. While xlrd allows you to read data from Excel files, xlwt enables you to create and modify Excel files. To open an Excel file using xlrd and xlwt, follow these steps:

# a. Installing xlrd and xlwt:
pip install xlrd xlwt

# b. Importing the necessary modules:
import xlrd

# c. Opening an Excel file with an xlsx extension:
workbook = xlrd.open_workbook('file.xls')
worksheet = workbook.sheet_by_index(0)

# d. Accessing the data:
for row in range(worksheet.nrows):
    print(worksheet.row_values(row))

Method 4 of Opening Excel File in Python: Using the pyxlsb Library:

The pyxlsb library is specifically designed for handling binary Excel files (.xlsb format). It provides a fast and efficient way to extract data from such files. To open an Excel file using pyxlsb, study the steps in the following code.

# a. Installing pyxlsb:
pip install pyxlsb

# b. Importing the necessary modules:
from pyxlsb import open_workbook

#c. Opening an Excel file with the xlsb extension:
with open_workbook('file.xlsb') as wb:
    with wb.get_sheet(1) as sheet:
         for row in sheet.rows():
             values = [item.v for item in row]
             print(values)

In these 4 earlier examples, we explored four different methods to open and read Excel files in Python. Each method utilizes a specific library, such as pandas, openpyxl, xlrd and xlwt, and pyxlsb, to achieve the desired outcome. The code examples provided should help you get started with opening Excel files and extracting data using the most appropriate method for your needs. Remember to install the required Python libraries and modules and adapt the code samples to your specific file names and data requirements. With these tools at your disposal, you can easily work with Microsoft Excel files in Python and unlock a world of data analysis possibilities.

Next, we look at detailed examples of opening an Excel file in Python, and then using the data in the Excel sheet, we check if the name value is duplicated or not in a specific column. We then create another Excel file, which contains the de-duplicated data only. These methods work on Excel Sheets in any Microsoft Excel workbook, and they also work with any CSV file.

Keep in mind that you may have to modify the file path of the spreadshet file. And if the data is on different sheets, you can process the data on each Excel sheet, one by one, in a loop too.

Here are two detailed examples that demonstrate opening an Excel file in Python, processing the rows step by step, and removing duplicate names from the data. The examples will use the pandas library for data manipulation and the openpyxl library for writing the de-duplicated data into another Excel file. You can use this code in your Python module diretly.

Example 1: Using pandas data frame to Remove Duplicate Names from Excel Data

import pandas as pd

# Step 1: Open Excel file first. This function opens the first Excel sheet.
df = pd.read_excel('input_file.xlsx')

# Step 2: Processing the rows step by step
unique_names = set()
deduplicated_data = []

for index, row in df.iterrows():
    name = row['Name']
    if name not in unique_names:
        unique_names.add(name)
        deduplicated_data.append(row)

# Step 3: Creating a new DataFrame with the de-duplicated data
deduplicated_df = pd.DataFrame(deduplicated_data)

# Step 4: Writing the de-duplicated data to a new Excel file
deduplicated_df.to_excel('output_file.xlsx', index=False)

Explanation:

We start by importing the necessary module, pandas, which will be used for data manipulation.
We open the input Excel file using the read_excel function from pandas, and store the data in a DataFrame called df.
We initialize an empty set called unique_names to store unique names encountered in the data, and an empty list called deduplicated_data to store the de-duplicated rows.
We iterate over each row in the DataFrame using the iterrows() function. For each row, we extract the name from the ‘Name’ column.
We check if the name is already present in the unique_names set. If it is not present, we add it to the set and append the entire row to the deduplicated_data list.
After processing all the rows, we create a new DataFrame called deduplicated_df using the deduplicated_data.
Finally, we write the de-duplicated data from deduplicated_df to a new Excel file called ‘output_file.xlsx’ using the to_excel function. The index=False parameter ensures that the index column is not included in the output file.

Example 2: Using openpyxl to Remove Duplicate Names from Excel Data

import openpyxl

# Step 1: Opening the Excel file
workbook = openpyxl.load_workbook('input_file.xlsx')
worksheet = workbook.active

# Step 2: Processing the rows step by step
unique_names = set()
deduplicated_rows = []

for row in worksheet.iter_rows(values_only=True):
    name = row[0]  # Assuming the name is in the first column (column index 0)
    if name not in unique_names:
        unique_names.add(name)
        deduplicated_rows.append(row)

# Step 3: Creating a new Workbook and Worksheet to store the de-duplicated data
output_workbook = openpyxl.Workbook()
output_worksheet = output_workbook.active

# Step 4: Writing the de-duplicated data to the new Worksheet
for row in deduplicated_rows:
    output_worksheet.append(row)

# Step 5: Saving the new Workbook as an Excel file
output_workbook.save('output_file.xlsx')

Explanation:

We import the openpyxl module, which will be used for working with Excel files.
We open the input Excel file using the load_workbook function from openpyxl and store it in the workbook object. We then access the active worksheet using workbook.active and store it in the worksheet object.
We initialize an empty set called unique_names to store unique names encountered in the data and an empty list called deduplicated_rows to store the de-duplicated rows.
We iterate over each row in the worksheet using the iter_rows function with values_only=True to retrieve the cell values directly. For each row, we extract the name from the first column (assuming it is in the first column, represented by column index 0).
We check if the name is already present in the unique_names set. If it is not present, we add it to the set and append the entire row to the deduplicated_rows list.
After processing all the rows, we create a new Workbook using openpyxl.Workbook() and access the active worksheet using output_workbook.active, which is where we will store the de-duplicated data.
We iterate over each row in the deduplicated_rows list and append it to the output worksheet using the append method.
Finally, we save the new Workbook as an Excel file called ‘output_file.xlsx’ using the save method.
Note: In both examples, make sure to replace ‘input_file.xlsx’ with the actual name of your input Excel file. The de-duplicated data will be written to a new Excel file called ‘output_file.xlsx’.

The above code checks for duplicates in a single cell in any file name. However, the whole row may be duplicated too. And we must learn how to handle this too.

What happens if the whole row is duplicated in the Excel spreadsheet?

We can check for the entire row contents too. Some specific rows may be exact duplicates. And we want to remove it. Here are modified examples where we check if the entire row of data is duplicated and remove it based on the entire row’s duplication.

Example 1: Using pandas to Remove Duplicate Rows from Excel Data

import pandas as pd

# Step 1: Opening the Excel file
df = pd.read_excel('input_file.xlsx')

# Step 2: Removing duplicate rows based on the entire row's duplication
deduplicated_df = df.drop_duplicates()

# Step 3: Writing the de-duplicated data to a new Excel file
deduplicated_df.to_excel('output_file.xlsx', index=False)

Explanation:

We start by importing the necessary module, pandas, which will be used for data manipulation.
We open the input Excel file using the read_excel function from pandas, and store the data in a DataFrame called df.
We use the drop_duplicates function on the DataFrame df to remove duplicate rows based on the entire row’s duplication. This function keeps only the first occurrence of each unique row and drops subsequent duplicates.
The result, deduplicated_df, is a new DataFrame containing the de-duplicated data.
Finally, we write the de-duplicated data from deduplicated_df to a new Excel file called ‘output_file.xlsx’ using the to_excel function. The index=False parameter ensures that the index column is not included in the output file.

Example 2: Using openpyxl to Remove Duplicate Rows from Excel Data

import openpyxl

# Step 1: Opening the Excel file
workbook = openpyxl.load_workbook('input_file.xlsx')
worksheet = workbook.active

# Step 2: Removing duplicate rows based on the entire row's duplication
unique_rows = set()
deduplicated_rows = []

for row in worksheet.iter_rows(values_only=True):
    if row not in unique_rows:
        unique_rows.add(row)
        deduplicated_rows.append(row)

# Step 3: Creating a new Workbook and Worksheet to store the de-duplicated data
output_workbook = openpyxl.Workbook()
output_worksheet = output_workbook.active

# Step 4: Writing the de-duplicated data to the new Worksheet
for row in deduplicated_rows:
    output_worksheet.append(row)

# Step 5: Saving the new Workbook as an Excel file
output_workbook.save('output_file.xlsx')

Explanation:

We import the openpyxl module, which will be used for working with Excel files.
We open the input Excel file using the load_workbook function from openpyxl and store it in the workbook object. We then access the active worksheet using workbook.active and store it in the worksheet object.
We initialize an empty set called unique_rows to store unique rows encountered in the data and an empty list called deduplicated_rows to store the de-duplicated rows.
We iterate over each row in the worksheet using the iter_rows function with values_only=True to retrieve the cell values directly. For each row, we check if the entire row is already present in the unique_rows set. If it is not present, we add it to the set and append the entire row to the deduplicated_rows list.
After processing all the rows, we create a new Workbook using openpyxl.Workbook() and access the active worksheet using output_workbook.active, which is where we will store the de-duplicated data.
We iterate over each row in the deduplicated_rows list and append it to the output worksheet using the append method.
Finally, we save the new Workbook as an Excel file called ‘output_file.xlsx’ using the save method.
Note: In both examples, make sure to replace ‘input_file.xlsx’ with the actual name of your input Excel file. The de-duplicated data will be written to a new Excel file called ‘output_file.xlsx’.

Conclusion:

With so many different examples of opening Excel files in Python, and then reading them to process the data, and eliminating duplicates shows you the power of Python programming. In just a few lines, we can open any Excel file, process it, and clean it. Then begins the interesting work of analyzing the data, which we will explore in another article.

Hope you enjoyed the best information I have compiled from my years of working on Python coding.It will help you to process any Excel spreadsheet document, irrespective of the total number of rows. And in these methods, the file extension can be .xls or .xlsx. We just assume the Microsoft Excel file to be present in the current working directory.

Do write a comment and let me know what you liked, and if this information helped you.

Cheers,
Vinai

How To Find Duplicates in a Python List

May 20, 2023 by vinai@brandrich.com

Finding duplicates in a Python List and Removing duplicates from a Python list variable are quite common tasks. And that’s because Python Lists are prone to collecting duplicates in them. Checking if there are duplicates in a list variable is a common task for Python programmers.

Fortunately, it is relatively easy to check for duplicates in Python. And once you spot them, you can do several action items

List Duplicate Values only
Remove Duplicates Values and create a new list without any duplicates
Change the current list by removing only the duplicates, essentially deduplicating the existing list.
Just evaluate the list for duplicates, and report if there are duplicates in this list.
Count the duplicates in the list.

But before we delve deeper into these tasks, it is better to quickly understand what Lists are and why duplicates can exist in Python lists.

I also want you to know about the Set data type in the Python programming language. Once you know their unique points and differences, you will better appreciate the methods used to identify and remove duplicates from a Python list.

And if you need help with Python, Join our Python Programming Fundamentals Training Course.

What is a List in Python?

A list in Python is like an array. It is a collection of objects stored in a single variable. A list is changeable. You can add or remove elements from Python lists. A list can be sorted too. But by default, a list is not sorted.

A Python list can also contain duplicates, and it can also contain multiple elements of different data types. This way, you can store integers, floating point numbers, positive or negatives, strings, and even boolean values in a list.

Python lists can also contain other lists within it and can grow to any size. But lists are considered slower in accessing elements, as compared to Tuples. So some methods are more suited for small lists, and others are better for large lists. It largely depends on the list size.

You define a list by enclosing the elements in square brackets. Each element is separated by commas within the list.

What is a Set in Python?

A Set is another data type available in Python. Here also, you can store multiple items in a Set. But a set differs from a Python list in that a Set can not contain duplicates.

You can define a Set with curly braces, as compared to a list, which is defined by using square brackets.

A Set in Python is not ordered or indexed. It is possible that every time you access a particular index from a set, you get a different value.

Once you have created a Set in Python, you can add elements but can’t change the existing elements.

Now that you have a basic list comprehension and Set datatype understanding in Python, we will explore the identification and removal of duplicates in Python Lists.

Multiple Ways To Check if duplicates exist in a Python List

The length of the List & length of the Set are different
Check each element in set. if yes, dup; if not, append.
Check for list.count() for each element.

We will be using Python 3 as the language. So as long as you have any version of the Python 3 compiler, you are good to go.

Method 1: Using the length of a list to identify if it contains duplicate elements.

Let’s write the Python program to check this.

# this input list contains duplicates
mylist = [5, 3, 5, 2, 1, 6, 6, 4] # 5 & 6 are duplicate numbers.

# find the length of the list
print(len(mylist))
8

# create a set from the list
myset = set(mylist)

# find the length of the Python set variable myset
print(len(myset))
6

# create a set from the list
myset = set(mylist)

# find the length of the Python set variable myset
print(len(myset))
6

As you can see, the length of the mylist variable is 8, and the myset length is 6.

# create a set from the list
myset = set(mylist)

# find the length of the Python set variable myset
print(len(myset))

Output:

Here’s the final Python program – the full code can be copied and pasted into a Python program and used to check if identical items exist in a list or not.

# this input list contains duplicates
mylist = [5, 3, 5, 2, 1, 6, 6, 4] # 5 & 6 are duplicate numbers.

# find the length of the list
print(len(mylist))

# create a set from the list
myset = set(mylist)

# find the length of the Python set variable myset
print(len(myset))

# compare the length and print if the list contains duplicates
if len(mylist) != len(myset):
    print("duplicates found in the list")
else:
    print("No duplicates found in the list")

Output:

8
6
duplicates found in the list

Alternatively, we can create a function that will check if duplicate items exist, and will return a True or a False to alert us of duplicates.

Here is the complete function to check if duplicates exist in Python list

def is_duplicate(anylist):
    if type(anylist) != 'list':
        return("Error. Passed parameter is Not a list")
    if len(anylist) != len(set(anylist)):
        return True
    else:
        return False

mylist = [5, 3, 5, 2, 1, 6, 6, 4] # you can see some repeated number in the list.
if is_duplicate(mylist):
    print("duplicates found in list")
else:
    print("no duplicates found in list")

The output of this Python code is:

duplicates found in list

Method 2: Listing Duplicates in a List & Listing Unique Values – Sorted

In this method, we will create different lists for different use – one to have the duplicate keys or repeated values, and different lists for the unique keys. A few lines of code can do magic in a Python program.

# the given list contains duplicates
mylist = [5, 3, 5, 2, 1, 6, 6, 4] # the original list of integers with duplicates

newlist = [] # empty list to hold unique elements from the list
duplist = [] # empty list to hold the duplicate elements from the list
for i in mylist:
    if i not in newlist:
        newlist.append(i)
    else:
        duplist.append(i) # this method catches the first duplicate entries, and appends them to the list

# The next step is to print the duplicate entries, and the unique entries
print("List of duplicates", duplist)
print("Unique Item List", newlist) # prints the final list of unique items

Output:

List of duplicates [5, 6]
Unique Item List [5, 3, 2, 1, 6, 4]

And if you want to sort the list items after removing the duplicates, you can use the inbuilt function called sort on the list of numbers.

# sorting the list
newlist.sort() # the sort method sorts all the values
print("The sorted list", newlist) # this prints the sorted list

Output:

The sorted list [1, 2, 3, 4, 5, 6]
Join our Python Programming Fundamentals Training Course.

Join the Most Popular Python Programming Training Course

Method 3: Listing only Duplicate values with the Count Method

This method iterates over each element of the entire list and checks if the count of each element is greater than 1. If yes, that item is added to a set. If you remember, a set cannot contain any duplicates by design. In the following code, for items that exist more than once, only those repeated elements are added to the set.

# the mylist variable represents a duplicate list.
mylist = [5, 3, 5, 2, 1, 6, 6, 4] # the original input list with repeated elements.

dup = {x for x in mylist if mylist.count(x) > 1}
print(dup)

#To count the number of list elements that were duplicated, you can run
print(len(dup))

Output:

{5, 6}
2

Keep in mind that the listed duplicate values might have existed once, or eve

The fastest way to Remove Duplicates From Python Lists

One of the fastest ways to remove duplicates is to create a set from the list variable. All this can be done in just a single Python statement. This is the fastest method, which is more suited for large lists.

Here’s the final code in Python – probably the best way…

# this list contains duplicate number 5 & 6
mylist = [5, 3, 5, 2, 1, 6, 6, 4]
myunique = set(mylist) # prints the final list without any duplicates
print(myunique)

Output:

{1, 2, 3, 4, 5, 6}

How to Avoid Duplicates in a Python List

The first thing you must consider is – Why am I using a list in Python?

Because it can collect duplicates. If you are clear that duplicates don’t exist in whatever you are collecting or storing, then don’t use a list. Instead, a better way is to use a Set. A set is built to reject duplicates, which is a better solution. You should explore sets a bit more to gain a better set comprehension. It can be a real-time saver as this is a more efficient way.

If you don’t care about the order, then just using set(mylist) will do the job of removing any duplicates. I use this, even in the worst case scenario where the incoming entire list is a dirty list of multiple duplicate elements.

Alternatively, if you really must use a list because of the things you can do with a list data type, then do a simple check before you add any element.

For example, you can sort a list, but not a Set in Python. It can be useful for large lists.

So before you add any new element in a list, just do a quick check for the existence of the value. If the element exists, then don’t store it. Simple!

The methods discussed above work on any list of elements. So if you want to find duplicate strings or duplicate integers or duplicate floating numbers or any kind of duplicate objects, you can use these Python programs.

Hope the different ways to find duplicates, list them, and finally remove duplicate elements altogether from any Python list using simple programs and methods will come in handy for your processing and list comprehension.

Beginner in Python?

Join the Beginner’s Python Programming Course.,

Or Join the Python For Data Analysis training course if you know the basics and want to join the more advanced level of Python Data Analysis techniques.

New Course Now Available

August 4, 2017December 16, 2016 by TruEducate Super Admin

Two new courses have been launched:

1. PMP exam Preparation Program – with over 35 hours of video lessons, to help project managers achieve the PMP certification by clearing the PMP exam quickly.

2. Analyze Data Using Excel Pivot Tables – This excellent course teaches ways to quickly analyze data using Excel Pivot Tables. Can be used with any version of Excel – 2007, 2010, 2013 or 2016. Simple and effective techniques to find the hidden gems in your data and glean new insights. *** HIGHLY RECOMMENDED ***

What is Python Programming Language?

Here are some key features of Python:

What is the Python programming language used for?

Is Python easy to learn or hard?

Is Python language similar to C++?

Why is Python so popular?

Where is Python used in real life?

How to Learn Python Programming?

I. Reading and Writing Files in Python:

II. String Replacement Techniques for regular expression pattern:

Using the replace() Method:

Utilizing Regular Expression Pattern Matching Technique:

Replacing Strings in Multiple Files:

Modifying Files in Place:

Conclusion:

Table of Contents:

Introduction to Excel File Handling in Python:

Method 1 of Opening Excel File in Python: Using the pandas Library:

Method 2 of Opening Excel File in Python: Using the openpyxl Library:

Method 3 of Opening Excel File in Python: Using the xlrd and xlwt Libraries:

Method 4 of Opening Excel File in Python: Using the pyxlsb Library:

Example 1: Using pandas data frame to Remove Duplicate Names from Excel Data

Example 2: Using openpyxl to Remove Duplicate Names from Excel Data

What happens if the whole row is duplicated in the Excel spreadsheet?

Example 1: Using pandas to Remove Duplicate Rows from Excel Data

Example 2: Using openpyxl to Remove Duplicate Rows from Excel Data

Conclusion:

What is a List in Python?

What is a Set in Python?

Multiple Ways To Check if duplicates exist in a Python List

Method 1: Using the length of a list to identify if it contains duplicate elements.

Method 2: Listing Duplicates in a List & Listing Unique Values – Sorted

Join the Most Popular Python Programming Training Course

Method 3: Listing only Duplicate values with the Count Method

The fastest way to Remove Duplicates From Python Lists

How to Avoid Duplicates in a Python List

Beginner in Python?