Taro Logo

Reshape Data: Pivot

Easy
Meta logo
Meta
2 views
Topics:
Database Problems

You are given a Pandas DataFrame called weather with the following structure:

weather
+-------------+--------+
| Column Name | Type   |
+-------------+--------+
| city        | object |
| month       | object |
| temperature | int    |
+

Write a function to pivot the DataFrame such that each row represents temperatures for a specific month, and each city is a separate column. The order of the months in the output should be sorted alphabetically.

Example:

Input:

weather
+--------------+----------+-------------+
| city         | month    | temperature |
+--------------+----------+-------------+
| Jacksonville | January  | 13          |
| Jacksonville | February | 23          |
| Jacksonville | March    | 38          |
| ElPaso       | January  | 20          |
| ElPaso       | February | 6           |
| ElPaso       | March    | 26          |
+

Expected Output:

output
+----------+--------+--------------+
| month    | ElPaso | Jacksonville |
+----------+--------+--------------+
| February | 6      | 23           |
| January  | 20     | 13           |
| March    | 26     | 38           |
+

Constraints:

  • The input DataFrame will always have the columns city, month, and temperature.
  • The month column will contain unique month names.
  • The city column will contain city names.
  • The temperature column will contain integer values.

Explain the time and space complexity of your solution.

Solution


Pivoting a DataFrame: Temperatures by City and Month

This problem focuses on reshaping a DataFrame to transform rows into columns based on specific values. We're given a table of weather data and need to pivot it so that each row represents temperatures for a specific month, and each city becomes a separate column.

1. Naive Approach: Iterative Processing

A straightforward but less efficient approach involves iterating through the DataFrame and manually constructing the pivoted table. We can use dictionaries or lists to store the temperatures for each city and month, then combine them into the desired format.

Algorithm:

  1. Create a dictionary to store the pivoted data. The keys of the dictionary will be months, and the values will be dictionaries containing city-temperature mappings.
  2. Iterate through the input DataFrame row by row.
  3. For each row, extract the city, month, and temperature.
  4. If the month is not already a key in the pivoted data dictionary, add it.
  5. Add the city-temperature mapping to the dictionary corresponding to the month.
  6. Finally, construct the output DataFrame from the pivoted data dictionary.

Code (Python):

import pandas as pd

def pivot_weather_naive(weather: pd.DataFrame) -> pd.DataFrame:
    pivoted_data = {}
    for index, row in weather.iterrows():
        city = row['city']
        month = row['month']
        temperature = row['temperature']
        
        if month not in pivoted_data:
            pivoted_data[month] = {}
        
        pivoted_data[month][city] = temperature
    
    # Create a list of dictionaries for the DataFrame
    data = []
    for month in pivoted_data:
        row = {'month': month}
        row.update(pivoted_data[month])
        data.append(row)
    
    return pd.DataFrame(data).set_index('month').sort_index()

Big O Analysis:

  • Time Complexity: O(N), where N is the number of rows in the DataFrame (due to iteration).
  • Space Complexity: O(N), where N is the number of rows in the DataFrame in the worst case, to store the pivoted data in a dictionary.

2. Optimal Approach: Using Pandas pivot function

The Pandas library provides a built-in pivot function that efficiently performs this transformation.

Algorithm:

  1. Use the pivot function with month as the index, city as the columns, and temperature as the values.
  2. Rename the index to 'month'.
  3. Sort the index to match the example.

Code (Python):

import pandas as pd

def pivot_weather(weather: pd.DataFrame) -> pd.DataFrame:
    pivoted = weather.pivot(index='month', columns='city', values='temperature')
    pivoted = pivoted.rename_axis(None, axis=1).reset_index()
    pivoted = pivoted.set_index('month').sort_index()
    return pivoted

Big O Analysis:

  • Time Complexity: O(N), where N is the number of rows in the DataFrame. The pivot function is generally optimized for performance.
  • Space Complexity: O(M * C), where M is the number of unique months and C is the number of unique cities. This represents the space required to store the pivoted table.

3. Edge Cases:

  • Missing Data: If there are missing temperature values for certain city-month combinations, the pivot function will insert NaN values. Handle NaN values by filling them with appropriate values (e.g., 0, the average temperature for that month, etc.) if necessary.
  • Duplicate Entries: If there are duplicate entries for the same city and month, the pivot function will raise an error. You may need to preprocess the data to handle duplicates (e.g., by taking the average temperature).
  • Empty DataFrame: If the input DataFrame is empty, the pivot function will return an empty DataFrame, handle this case if required.
  • Different Data Types: Ensure that the temperature column has a numeric data type. If it's a string, convert it to an integer or float before pivoting.

4. Example

Let's consider a DataFrame

data = {
    'city': ['Jacksonville', 'Jacksonville', 'ElPaso', 'ElPaso'],
    'month': ['January', 'February', 'January', 'February'],
    'temperature': [10, 20, 15, 25]
}

weather_df = pd.DataFrame(data)

print(pivot_weather(weather_df))

The code using pivot will produce the output DataFrame:

           ElPaso  Jacksonville
month                           
February      25            20
January       15            10