Python: Pandas Series and DataFrames

Python Pandas library is a perfect tool for deep analysis and modification of large data. It provides two basic data structures which are Series and DataFrame with several functions to create, clean, and index the data. Since Pandas embeds all such features, it naturally becomes invaluable for complex statistical tasks ranging from basic data cleaning to analysis. In this tutorial, we’ll cover the fundamental concepts of Pandas in detail and include multiple examples to make it more useful.

Contents

Introduction to Series and DataFrames Pandas Series in Python Pandas DataFrame in Python Indexing and Accessing Data Elements Basic Operations on Series and DataFrames Python Pandas – Common Use Cases and Examples Going Beyond the Basics

Introduction to Series and DataFrames

Wes McKinney is a software developer and data analyst who had a major role in the development of the Pandas library. He created Pandas to address the challenges he faced in handling financial data and performing data analysis in Python. The first release of the library was in 2008 as an OSS module. Since then, it remained free to use and has become the most widely used library for data analysis in Python.

As said earlier, the core data structures in Pandas are namely Series and DataFrames. The whole data processing story revolves around them and the methods they provide.

Pandas Series in Python

A Pandas Series is a singular array that can hold various types of data. Similar to a column in a table, it supports efficient indexing. All of this info stays in one variable, a Python object, making data manipulation straightforward and efficient in Python.

Create a Series in Pandas

Imagine we have a dataset containing information about the daily temperatures in a city. In this dataset, the variable of interest is “Temperature,” and each day’s temperature measurement represents a data point.

Here’s how we can represent this dataset using a Pandas Series in Python:

import pandas as pds

# Sample dataset: Daily temperatures for a week
t_list = [25, 28, 26, 30, 29, 27, 31]

# Creating a Pandas Series
t_series = pds.Series(t_list, name='Temperature')

# Displaying the Series
print(t_series)

In this example:

Variable: The variable of interest is “Temperature”. It represents the daily temperatures in the city.
Corresponding data points: Each element in the Pandas Series (t_data) represents a specific data point – the temperature on a specific day.

The resulting Pandas Series, “t_series” combines the single variable “Temperature” along with its corresponding data points. This simple arrangement of labels and data makes it easy to perform various operations and analyses.

Let’s explore the key properties and methods of a Series in Pandas. This will equip us with practical knowledge to use them effectively.

Properties of Pandas Series

A series mainly consists of the following three properties.

Index: Each element in a Series has a unique label or index that we can use to access the specific data points.

data = [10.2, 20.1, 30.3, 40.5]
series = pds.Series(data, index=["a", "b", "c", "d"])
print(series["b"])  # Access element by label
print(series[1])    # Access element by position

Data Type: All elements in a Series share the same data type. It is important for consistency and enabling smooth operations.

print(series.dtype)  # Output: int64

Shape: The shape of a Series is simply the number of elements it contains.

print(series.shape)  # Output: (4,)

Common Methods of Pandas Series

The Pandas series provides many methods to support various data analysis tasks.

Selection: The loc and iloc methods return the elements by label or position.

print(series.loc["c"])  # Access by label
print(series.iloc[2])   # Access by position

Arithmetic and Comparison: A Series object allows the calculations directly on its data using arithmetic operators (+, -, *, /) and comparison operators (==, !=, <, >).

new_series = series * 2
print(new_series)

Missing Value Handling: The methods like dropna or fillna help in identifying and handling missing values (NaNs).

series.iloc[1] = npy.nan # npy here is an object of numpy
print(series.dropna())  # Drop rows with missing values

Aggregation: A Series object also offers methods such as mean, sum, min, and max to do aggregate operations.

print(series.mean())  # Calculate mean

Time Series Analysis: A Series object has methods – resample for analyzing time-based data.

dates = pds.date_range(start="2024-01-01", periods=4)
temp_series = pds.Series([10, 12, 15, 18], index=dates)
# Calculate monthly avg temperature
print(temp_series.resample("M").mean())

With the above information, you should be comfortable with using the Pandas Series in Python.

Pandas DataFrame in Python

A Pandas DataFrame is like a table, holding data in a structured way with rows and columns. It’s like an entire spreadsheet where each column is a Pandas Series. Just as a Series is a single variable, a data frame is a collection of these variables, making it easy to organize, analyze, and manipulate data efficiently in Python.

Create a DataFrame in Pandas

Imagine we have a more comprehensive dataset that includes not just daily temperatures but also additional information, such as humidity, wind speed, and precipitation, for a city over a week. A Pandas DataFrame is a perfect tool to handle such structured data efficiently.

Here’s how we can represent and work with this dataset using a Pandas DataFrame in Python:

import pandas as pds

# Sample dataset: Daily weather data for a week
weather = {
    'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
    'Temperature': [25, 28, 26, 30, 29, 27, 31],
    'Humidity': [60, 55, 70, 45, 50, 65, 40],
    'Wind Speed': [12, 10, 8, 15, 14, 11, 13],
    'Precipitation': [0, 0.1, 0, 0, 0, 0.2, 0]
}

# Creating a Pandas data frame
dfr = pds.DataFrame(weather)

# Displaying the data frame
print(dfr)

In this example:

Variables: We created multiple variables of interest, such as “Temperature,” “Humidity,” “Wind Speed,” and “Precipitation”. They represent different aspects of daily weather conditions.
Corresponding data points: Each row in the Pandas DataFrame represents a specific day, and the columns provide data points.
The resulting Pandas DataFrame, “dfr”: Combines all the variables into a single table, making it easy to explore, analyze, and manipulate.

This simple arrangement of rows and columns in a Pandas DataFrame simplifies working with complex datasets, providing a versatile tool for data analysis in Python.

Let’s now dive into the essential properties and methods of a data frame in Pandas. This exploration will provide us with practical knowledge to use them effectively.

Properties of Pandas DataFrame

A data frame possesses several crucial properties that define its structure and characteristics.

Columns: a data frame has a group of columns. Each column holds a specific kind of data, like names, ages, or scores. By using the column names, we can easily pick out the info we want.

import pandas as pds

# Creating a DataFrame
data = {'Name': ['Soumya', 'Meenakshi', 'Manya'],
        'Age': [25, 32, 20],
        'City': ['Banglore', 'Gurgaon', 'Delhi']}

dfr = pds.DataFrame(data)
print(dfr['Age'])  # Accessing the 'Age' column

Index: Similar to a Series, a data frame has an index that uniquely identifies each row.

# Setting a custom index
df.set_index('Name', inplace=True)
print(df.loc['Bob'])  # Accessing row by index

Shape: The shape property indicates the number of rows and columns in a data frame.

print(df.shape)  # Output: (3, 2)

Common Methods of Pandas DataFrame

Pandas DataFrames offer various methods to facilitate data analysis and manipulation.

Head and Tail: It would quickly let us inspect the top or bottom rows of a data frame.

print(dfr.head(2))  # Display the first 2 rows
print(dfr.tail(1))  # Display the last row

Describe: We can get summary statistics for numerical columns by using the describe().

print(dfr.describe())

Pandas GroupBy: It helps us perform group-wise operations on the DataFrame.

grouped_data = dfr.groupby('City')['Age'].mean()
print(grouped_data)

Sorting: Sort the data frame based on one or more columns.

sorted_df = dfr.sort_values(by='Age', ascending=False)
print(sorted_df)

Handling Missing Values: Methods like dropna or fillna assist in identifying and handling missing values.

dfr.iloc[1, 1] = None  # Introducing a missing value
print(dfr.dropna())  # Drop rows with missing values

Merging and Concatenating: Combine multiple DataFrames.

# Concatenation
dfr2 = pds.DataFrame({'Name': ['David'], 'Age': [28], 'City': ['Chicago']})
concatenated_df = pds.concat([dfr, dfr2])
print(concatenated_df)

Aggregation: DataFrame methods like mean, sum, min, and max allow for aggregate operations.

print(dfr.mean())  # Calculate mean

With these insights, you’re now equipped to harness the power of Pandas DataFrames for efficient data analysis in Python.

Indexing and Accessing Data Elements

Let’s get hands-on with the Pandas Series and DataFrame to see how indexing fits in for practical scenarios.

Series Indexing

Imagine we’re handling sales data for a product, month by month. Now, we want to figure out how to quickly find the sales for March, May, or any month.

import pandas as pds

# Sample data: Monthly sales of a product
sales_data = [150, 200, 180, 250, 300]
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May']

# Creating a Pandas Series
sales_series = pds.Series(sales_data, index=months, name='Monthly Sales')

# Displaying the Series
print("Original Sales Series:")
print(sales_series)
print()

# Use indexing for data points by label
print("Sales in March:", sales_series['Mar'])
print("Sales in May:", sales_series['May'])
print()

# Use indexing for data points by position
print("Sales in the second month (Feb):", sales_series[1])
print("Sales in the last month (May):", sales_series[-1])
print()

# Slicing the Series
print("Sales from Jan to Mar:")
print(sales_series['Jan':'Mar'])

DataFrame Indexing

Take a case where we have to track sales data for a product and find sales for March, May, or any month in a flash. Let’s use Pandas DataFrame to achieve this.

import pandas as pds

# Imagine we're dealing with monthly sales and expenses data for a business
data = {'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
        'Sales': [150, 200, 180, 250, 300],
        'Expenses': [80, 90, 70, 100, 120]}

# Creating a Pandas data frame
sales_df = pds.DataFrame(data)

# Displaying the data frame
print("Original Sales DataFrame:")
print(sales_df)
print()

# Using indexing for specific info
print("Sales in March:", sales_df['Sales'][2])  # Sales column in the row with label 'Mar'
print("Sales in May:", sales_df.at[4, 'Sales'])  # Sales column in the row at position 4 (index 4)
print()

# Slicing the data frame
print("Sales and Expenses from Jan to Mar:")
print(sales_df.loc[:2, ['Month', 'Sales', 'Expenses']])

Basic Operations on Series and DataFrames

Let’s now quickly go through some important basic operations that we might have to perform on data frames.

Series Operations

At times, we may have to apply arithmetic, logical, and comparison operators directly on Series objects. The below example provides a glimpse of these operations.

sr1 = pds.Series([1, 2, 3])
sr2 = pds.Series([4, 5, 6])
print(sr1 + sr2)        # Addition
print(sr1 * 2)          # Multiplication
print(sr1 > sr2)        # Comparison

DataFrame Operations

Similarly, we can perform the arithmetic or comparison operations on elements in DataFrames.

dfr1 = pds.DataFrame({"A": [11, 22, 34], "B": [14, 55, 36]})
dfr2 = dfr1 * 2
print(dfr2)

# Conditional multiplication
dfr2 = dfr1.copy()
dfr2.loc[dfr2['A'] > 1, 'A'] *= 2

print("Original DataFrame:")
print(dfr1)
print("\nDataFrame after conditional multiplication:")
print(dfr2)

Please note that the * 2 operation is applied to each element in the DataFrame dfr1. This operation multiplies both columns ‘A’ and ‘B’ by 2 for every row, regardless of any condition.

In Case 2, the data frame dfr2 is created as a copy of dfr1. It then selectively multiplies values in column ‘A’ by 2, but only for rows where the original value in ‘A’ is greater than 1, resulting in conditional multiplication.

Aggregation Functions

In Pandas Series, we can find the total and average of all the numbers. For DataFrames, it’s like summarizing columns, telling us the overall picture of our data. We can even tweak specific values based on certain conditions, giving us more control over our information.

In the below example, we found the sum and mean of numbers in a Series, summarized the data in a data frame, and selectively adjusted values in one column based on a condition.

import pandas as pds

# Series Aggregation
sr1 = pds.Series([10, 20, 30, 40], name='Numbers')

# Aggregation operations on the Series
sum_sr1 = sr1.sum()
mean_sr1 = sr1.mean()

print("Original Series:")
print(sr1)
print("\nSum of Series:", sum_sr1)
print("Mean of Series:", mean_sr1)

# DataFrame Aggregation
dfr1 = pds.DataFrame({"A": [11, 22, 34], "B": [14, 55, 36]})

# Aggregation operations on the DataFrame
sum_dfr1 = dfr1.sum()
mean_dfr1 = dfr1.mean()

print("\nOriginal DataFrame:")
print(dfr1)
print("\nSum of DataFrame:")
print(sum_dfr1)
print("\nMean of DataFrame:")
print(mean_dfr1)

# Conditional Aggregation in DataFrame
dfr2 = dfr1.copy()
dfr2.loc[dfr2['A'] > 20, 'A'] *= 2

print("\nDataFrame after conditional aggregation:")
print(dfr2)

Python Pandas – Common Use Cases and Examples

Let’s explore Python Pandas with practical examples. We’ll uncover its versatility in data cleaning, analysis, and exploration.

Data Cleaning

Let’s understand the data cleaning with the help of the below example. We’re using Python Pandas to create a table of products with prices and quantities. After displaying the original table, we’ll clean the data by removing currency symbols and any rows with missing values.

import pandas as pds

# Creating a sample DataFrame
data = {"Product": ["Apple", "Banana", "Orange", "Grapes", None],
        "Price": ["$2.50", "$1.20", "$3.00", None, "$4.50"],
        "Quantity": [10, 15, None, 8, 12]}

dfr = pds.DataFrame(data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(dfr)

# Data Cleaning
# Remove currency symbol
dfr["Price"] = dfr["Price"].str.replace("$", "", regex=True)
# Remove rows with missing values
dfr.dropna(inplace=True)

# Displaying the cleaned DataFrame
print("\nDataFrame after Data Cleaning:")
print(dfr)

Exploring and Filtering Data

Pandas library is fantastic for delving into and sifting through large datasets. Let’s see how:

Filtering Data:

# Filter based on conditions
expensive_fruits = dfr[dfr["Price"] > 8]
print(expensive_fruits)

# Filter using boolean indexing
filtered_df = dfr.loc[(dfr["Price"] > 7) & (dfr["Fruit"] != "Banana")]
print(filtered_df)

Sorting Data:

sorted_df = dfr.sort_values(by="Price", ascending=False)
print(sorted_df)

Grouping and Aggregation:

grouped_df = dfr.groupby("Fruit")["Price"].mean()
print(grouped_df)

Visualizing Data with Matplotlib

Combine Matplotlib with Pandas for insightful visualizations:

import matplotlib.pyplot as plt

plt.bar(dfr["Fruit"], dfr["Price"])
plt.xlabel("Fruit")
plt.ylabel("Price")
plt.title("Fruit Prices")
plt.show()

Reading and Writing Data Files

Using Pandas, we can easily read and write data in various formats. Here is a detailed tutorial on using Pandas to read from a CSV file in Python.

# Read CSV file
dfr = pds.read_csv("data.csv")

# Write DataFrame to Excel
dfr.to_excel("output.xlsx")

Going Beyond the Basics

This tutorial only scratches the surface of Pandas. As you delve deeper, explore powerful features like:

Merging and joining DataFrames: Combine data from multiple sources
Handling time series data: Analyze data with a time-based index
Advanced indexing and selection: Use complex indexing techniques
Data transformation and manipulation: Reshape and modify data with ease

Python’s Pandas library gives you a powerful tool to play around with data. This tutorial shared some basic ideas and showed examples. Just remember, the more you practice, the better you’ll get. Try out different things with real data, and Pandas can become your go-to for handling data in your projects.

Lastly, our site needs your support to remain free. Share this post on social media (Linkedin/Twitter) if you gained some knowledge from this tutorial.

Happy coding,
TechBeamers.

Python: Pandas Series and DataFrames

Introduction to Series and DataFrames

Pandas Series in Python

Create a Series in Pandas

Properties of Pandas Series

Common Methods of Pandas Series

Pandas DataFrame in Python

Create a DataFrame in Pandas

Properties of Pandas DataFrame

Common Methods of Pandas DataFrame

Indexing and Accessing Data Elements

Series Indexing

DataFrame Indexing

Basic Operations on Series and DataFrames

Series Operations

DataFrame Operations

Aggregation Functions

Python Pandas – Common Use Cases and Examples

Data Cleaning

Exploring and Filtering Data

Visualizing Data with Matplotlib

Reading and Writing Data Files

Going Beyond the Basics

Related

List of Topics

Stay Connected

Subscribe to Blog via Email

Continue Reading

Introduction to Series and DataFrames

Pandas Series in Python

Create a Series in Pandas

Properties of Pandas Series

Common Methods of Pandas Series

Pandas DataFrame in Python

Create a DataFrame in Pandas

Properties of Pandas DataFrame

Common Methods of Pandas DataFrame

Indexing and Accessing Data Elements

Series Indexing

DataFrame Indexing

Basic Operations on Series and DataFrames

Series Operations

DataFrame Operations

Aggregation Functions

Python Pandas – Common Use Cases and Examples

Data Cleaning

Exploring and Filtering Data

Visualizing Data with Matplotlib

Reading and Writing Data Files

Going Beyond the Basics

Related

List of Topics

Stay Connected

Subscribe to Blog via Email

Continue Reading

RELATED TUTORIALS

Python: Pandas to Concat Multiple DataFrame

Python: Using Pandas to Read Data from CSV Files

Python: Pandas Methods to Rename Columns

Python: Merge CSV Files Using Pandas