How To Open Excel Files in Python: Multiple Methods Explored

How To Open Excel Files in Python: Multiple Methods Explored

Python programming language provides various libraries and modules to work with Microsoft Excel files, making it effortless to extract, manipulate, and analyze data stored in any Excel sheet. In this step-by-step guide, we will explore multiple ways to open Excel files using different libraries in Python. By the end of this article, you’ll have a solid understanding of the various approaches available, allowing you to choose the most suitable one for your needs.

We then show you step by step examples of different ways we can use to process the Excel data, and remove duplicates, row by row. This way, you will a solid understanding of how to open an Excel workbook in Python, and then how to process the data to de-duplicate it. After all, we can only process clean data in Python… so these are essential steps before any data analysis in Python.

Table of Contents:

  • Introduction to Excel File Handling in Python
  • Method 1: Using the pandas Library
  • Method 2: Using the openpyxl Library
  • Method 3: Using the xlrd and xlwt Libraries
  • Method 4: Using the pyxlsb Library
  • Conclusion

Introduction to Excel File Handling in Python:

Excel files, commonly known as spreadsheets, are widely used for data storage and analysis. Python offers several libraries that facilitate reading and writing Excel files. In this article, we will explore four popular libraries: pandas, openpyxl, xlrd and xlwt, and pyxlsb. These 4 Python packages are most versatile and easy to work with. They are best for beginners writing their first Python script.

Method 1 of Opening Excel File in Python: Using the pandas Library:

The Python pandas library is a powerful tool for data manipulation and analysis. It provides an easy-to-use interface to work with an Excel spreadsheet file. To open an Excel file using a pandas dataframe, follow these steps:

a. Installing pandas module and openpyxl:

pip install pandas openpyxl

b. Importing the necessary modules:

import pandas as pd

c. Opening an Excel file:

df = pd.read_excel('file.xlsx')

d. Accessing the data:

print(df.head())

As you can see in the above Python code, the pandas Python library is a pretty easy to use library. It picks up the data on the first sheet. We can also use the sheet name to refer to any specific sheet too.

Method 2 of Opening Excel File in Python: Using the openpyxl Library:

The openpyxl library provides a wide range of functionalities to read, write, and manipulate Excel files. To open an Excel file using openpyxl, follow these steps:

# a. Installing openpyxl
pip install openpyxl
# b. Importing the necessary modules
import openpyxl
# c. Opening an Excel file:
workbook = openpyxl.load_workbook('file.xlsx') 
worksheet = workbook.active
# d. Accessing the data:
for row in worksheet.iter_rows(values_only=True):
    print(row)

Method 3 of Opening Excel File in Python: Using the xlrd and xlwt Libraries:

The xlrd and xlwt libraries are widely used for reading and writing Excel files in Python. While xlrd allows you to read data from Excel files, xlwt enables you to create and modify Excel files. To open an Excel file using xlrd and xlwt, follow these steps:

# a. Installing xlrd and xlwt:
pip install xlrd xlwt
# b. Importing the necessary modules:
import xlrd
# c. Opening an Excel file with an xlsx extension:
workbook = xlrd.open_workbook('file.xls')
worksheet = workbook.sheet_by_index(0)
# d. Accessing the data:
for row in range(worksheet.nrows):
    print(worksheet.row_values(row))

Method 4 of Opening Excel File in Python: Using the pyxlsb Library:

The pyxlsb library is specifically designed for handling binary Excel files (.xlsb format). It provides a fast and efficient way to extract data from such files. To open an Excel file using pyxlsb, study the steps in the following code.

# a. Installing pyxlsb:
pip install pyxlsb
# b. Importing the necessary modules:
from pyxlsb import open_workbook
#c. Opening an Excel file with the xlsb extension:
with open_workbook('file.xlsb') as wb:
    with wb.get_sheet(1) as sheet:
         for row in sheet.rows():
             values = [item.v for item in row]
             print(values)

In these 4 earlier examples, we explored four different methods to open and read Excel files in Python. Each method utilizes a specific library, such as pandas, openpyxl, xlrd and xlwt, and pyxlsb, to achieve the desired outcome. The code examples provided should help you get started with opening Excel files and extracting data using the most appropriate method for your needs. Remember to install the required Python libraries and modules and adapt the code samples to your specific file names and data requirements. With these tools at your disposal, you can easily work with Microsoft Excel files in Python and unlock a world of data analysis possibilities.

Next, we look at detailed examples of opening an Excel file in Python, and then using the data in the Excel sheet, we check if the name value is duplicated or not in a specific column. We then create another Excel file, which contains the de-duplicated data only. These methods work on Excel Sheets in any Microsoft Excel workbook, and they also work with any CSV file.

Keep in mind that you may have to modify the file path of the spreadshet file. And if the data is on different sheets, you can process the data on each Excel sheet, one by one, in a loop too.

Here are two detailed examples that demonstrate opening an Excel file in Python, processing the rows step by step, and removing duplicate names from the data. The examples will use the pandas library for data manipulation and the openpyxl library for writing the de-duplicated data into another Excel file. You can use this code in your Python module diretly.

Example 1: Using pandas data frame to Remove Duplicate Names from Excel Data

import pandas as pd

# Step 1: Open Excel file first. This function opens the first Excel sheet.
df = pd.read_excel('input_file.xlsx')

# Step 2: Processing the rows step by step
unique_names = set()
deduplicated_data = []

for index, row in df.iterrows():
    name = row['Name']
    if name not in unique_names:
        unique_names.add(name)
        deduplicated_data.append(row)

# Step 3: Creating a new DataFrame with the de-duplicated data
deduplicated_df = pd.DataFrame(deduplicated_data)

# Step 4: Writing the de-duplicated data to a new Excel file
deduplicated_df.to_excel('output_file.xlsx', index=False)

Explanation:

  1. We start by importing the necessary module, pandas, which will be used for data manipulation.
  2. We open the input Excel file using the read_excel function from pandas, and store the data in a DataFrame called df.
  3. We initialize an empty set called unique_names to store unique names encountered in the data, and an empty list called deduplicated_data to store the de-duplicated rows.
  4. We iterate over each row in the DataFrame using the iterrows() function. For each row, we extract the name from the ‘Name’ column.
  5. We check if the name is already present in the unique_names set. If it is not present, we add it to the set and append the entire row to the deduplicated_data list.
  6. After processing all the rows, we create a new DataFrame called deduplicated_df using the deduplicated_data.
  7. Finally, we write the de-duplicated data from deduplicated_df to a new Excel file called ‘output_file.xlsx’ using the to_excel function. The index=False parameter ensures that the index column is not included in the output file.

Example 2: Using openpyxl to Remove Duplicate Names from Excel Data

import openpyxl

# Step 1: Opening the Excel file
workbook = openpyxl.load_workbook('input_file.xlsx')
worksheet = workbook.active

# Step 2: Processing the rows step by step
unique_names = set()
deduplicated_rows = []

for row in worksheet.iter_rows(values_only=True):
    name = row[0]  # Assuming the name is in the first column (column index 0)
    if name not in unique_names:
        unique_names.add(name)
        deduplicated_rows.append(row)

# Step 3: Creating a new Workbook and Worksheet to store the de-duplicated data
output_workbook = openpyxl.Workbook()
output_worksheet = output_workbook.active

# Step 4: Writing the de-duplicated data to the new Worksheet
for row in deduplicated_rows:
    output_worksheet.append(row)

# Step 5: Saving the new Workbook as an Excel file
output_workbook.save('output_file.xlsx')

Explanation:

  1. We import the openpyxl module, which will be used for working with Excel files.
  2. We open the input Excel file using the load_workbook function from openpyxl and store it in the workbook object. We then access the active worksheet using workbook.active and store it in the worksheet object.
  3. We initialize an empty set called unique_names to store unique names encountered in the data and an empty list called deduplicated_rows to store the de-duplicated rows.
  4. We iterate over each row in the worksheet using the iter_rows function with values_only=True to retrieve the cell values directly. For each row, we extract the name from the first column (assuming it is in the first column, represented by column index 0).
  5. We check if the name is already present in the unique_names set. If it is not present, we add it to the set and append the entire row to the deduplicated_rows list.
  6. After processing all the rows, we create a new Workbook using openpyxl.Workbook() and access the active worksheet using output_workbook.active, which is where we will store the de-duplicated data.
  7. We iterate over each row in the deduplicated_rows list and append it to the output worksheet using the append method.
  8. Finally, we save the new Workbook as an Excel file called ‘output_file.xlsx’ using the save method.
  9. Note: In both examples, make sure to replace ‘input_file.xlsx’ with the actual name of your input Excel file. The de-duplicated data will be written to a new Excel file called ‘output_file.xlsx’.

The above code checks for duplicates in a single cell in any file name. However, the whole row may be duplicated too. And we must learn how to handle this too.

What happens if the whole row is duplicated in the Excel spreadsheet?

We can check for the entire row contents too. Some specific rows may be exact duplicates. And we want to remove it. Here are modified examples where we check if the entire row of data is duplicated and remove it based on the entire row’s duplication.

Example 1: Using pandas to Remove Duplicate Rows from Excel Data

import pandas as pd

# Step 1: Opening the Excel file
df = pd.read_excel('input_file.xlsx')

# Step 2: Removing duplicate rows based on the entire row's duplication
deduplicated_df = df.drop_duplicates()

# Step 3: Writing the de-duplicated data to a new Excel file
deduplicated_df.to_excel('output_file.xlsx', index=False)

Explanation:

  1. We start by importing the necessary module, pandas, which will be used for data manipulation.
  2. We open the input Excel file using the read_excel function from pandas, and store the data in a DataFrame called df.
  3. We use the drop_duplicates function on the DataFrame df to remove duplicate rows based on the entire row’s duplication. This function keeps only the first occurrence of each unique row and drops subsequent duplicates.
  4. The result, deduplicated_df, is a new DataFrame containing the de-duplicated data.
  5. Finally, we write the de-duplicated data from deduplicated_df to a new Excel file called ‘output_file.xlsx’ using the to_excel function. The index=False parameter ensures that the index column is not included in the output file.

Example 2: Using openpyxl to Remove Duplicate Rows from Excel Data

import openpyxl

# Step 1: Opening the Excel file
workbook = openpyxl.load_workbook('input_file.xlsx')
worksheet = workbook.active

# Step 2: Removing duplicate rows based on the entire row's duplication
unique_rows = set()
deduplicated_rows = []

for row in worksheet.iter_rows(values_only=True):
    if row not in unique_rows:
        unique_rows.add(row)
        deduplicated_rows.append(row)

# Step 3: Creating a new Workbook and Worksheet to store the de-duplicated data
output_workbook = openpyxl.Workbook()
output_worksheet = output_workbook.active

# Step 4: Writing the de-duplicated data to the new Worksheet
for row in deduplicated_rows:
    output_worksheet.append(row)

# Step 5: Saving the new Workbook as an Excel file
output_workbook.save('output_file.xlsx')

Explanation:

  1. We import the openpyxl module, which will be used for working with Excel files.
  2. We open the input Excel file using the load_workbook function from openpyxl and store it in the workbook object. We then access the active worksheet using workbook.active and store it in the worksheet object.
  3. We initialize an empty set called unique_rows to store unique rows encountered in the data and an empty list called deduplicated_rows to store the de-duplicated rows.
  4. We iterate over each row in the worksheet using the iter_rows function with values_only=True to retrieve the cell values directly. For each row, we check if the entire row is already present in the unique_rows set. If it is not present, we add it to the set and append the entire row to the deduplicated_rows list.
  5. After processing all the rows, we create a new Workbook using openpyxl.Workbook() and access the active worksheet using output_workbook.active, which is where we will store the de-duplicated data.
  6. We iterate over each row in the deduplicated_rows list and append it to the output worksheet using the append method.
  7. Finally, we save the new Workbook as an Excel file called ‘output_file.xlsx’ using the save method.
  8. Note: In both examples, make sure to replace ‘input_file.xlsx’ with the actual name of your input Excel file. The de-duplicated data will be written to a new Excel file called ‘output_file.xlsx’.

Conclusion:

With so many different examples of opening Excel files in Python, and then reading them to process the data, and eliminating duplicates shows you the power of Python programming. In just a few lines, we can open any Excel file, process it, and clean it. Then begins the interesting work of analyzing the data, which we will explore in another article.

Hope you enjoyed the best information I have compiled from my years of working on Python coding.It will help you to process any Excel spreadsheet document, irrespective of the total number of rows. And in these methods, the file extension can be .xls or .xlsx. We just assume the Microsoft Excel file to be present in the current working directory.

Do write a comment and let me know what you liked, and if this information helped you.

Cheers,
Vinai

vinai@brandrich.com