Taro Logo

Reshape Data: Pivot

#937 Most AskedEasy
DataFrame weather
+-------------+--------+
| Column Name | Type   |
+-------------+--------+
| city        | object |
| month       | object |
| temperature | int    |
+-------------+--------+

Write a solution to pivot the data so that each row represents temperatures for a specific month, and each city is a separate column.

The result format is in the following example.

Example 1:
Input:
+--------------+----------+-------------+
| city         | month    | temperature |
+--------------+----------+-------------+
| Jacksonville | January  | 13          |
| Jacksonville | February | 23          |
| Jacksonville | March    | 38          |
| Jacksonville | April    | 5           |
| Jacksonville | May      | 34          |
| ElPaso       | January  | 20          |
| ElPaso       | February | 6           |
| ElPaso       | March    | 26          |
| ElPaso       | April    | 2           |
| ElPaso       | May      | 43          |
+--------------+----------+-------------+
Output:
+----------+--------+--------------+
| month    | ElPaso | Jacksonville |
+----------+--------+--------------+
| April    | 2      | 5            |
| February | 6      | 23           |
| January  | 20     | 13           |
| March    | 26     | 38           |
| May      | 43     | 34           |
+----------+--------+--------------+
Explanation:
The table is pivoted, each column represents a city, and each row represents a specific month.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Is it guaranteed that each combination of `city` and `month` will be unique in the input? If duplicates exist, how should the temperatures be aggregated?
  2. If the data is sparse, for example, one city has data for a month that another doesn't, what value should be placed in the cell for the missing city-month combination?
  3. Does the order of the rows or columns in the output matter? For instance, should the months and cities be sorted alphabetically as they appear in the example?
  4. How should the solution handle an empty input DataFrame? Should it return an empty DataFrame, and if so, what should its columns be?
  5. Can any of the columns in the input DataFrame contain null or `None` values, and if so, how should they be handled during the pivot operation?

Brute Force Solution

Approach

The basic strategy is to build the new, reshaped table from scratch. We first scan the original data to figure out what the new rows and columns should be, and then we go through the original data again, piece by piece, to place each value into its correct location in the new table.

Here's how the algorithm would work step-by-step:

  1. First, look through the entire original table just to find all the unique items that will become the rows in our new table.
  2. Next, look through the original table a second time, this time to find all the unique items that will become the new columns.
  3. With this information, create a new, empty grid or table with the rows and columns you just identified.
  4. Now, go back to the original table and examine its very first entry.
  5. Find the cell in your new empty table that corresponds to that entry's specific combination of row and column categories.
  6. Place the value from that original entry into that correct cell.
  7. Repeat this process for every single entry in the original table, one by one, until every value has been moved.
  8. The newly filled table is the final reshaped data.

Code Implementation

def reshape_data_pivot_brute_force(data, row_key, pivot_column, value_column):
    unique_row_labels = sorted(list(set(item[row_key] for item in data)))
    unique_column_headers = sorted(list(set(item[pivot_column] for item in data)))

    pivoted_data_result = []

    # To build the pivoted table, we must iterate through each new row and column combination.

    for current_row_label in unique_row_labels:
        new_row_dictionary = {row_key: current_row_label}
        for current_column_header in unique_column_headers:
            value_for_cell = None

            # For each cell, a full scan of the original data is required to find the one correct value.

            for original_data_row in data:

                # This checks if a record matches the coordinates of the cell we are currently trying to fill.

                if original_data_row[row_key] == current_row_label and original_data_row[pivot_column] == current_column_header:
                    value_for_cell = original_data_row[value_column]
                    break
            
            new_row_dictionary[current_column_header] = value_for_cell
        
        pivoted_data_result.append(new_row_dictionary)

    return pivoted_data_result

Big(O) Analysis

Time Complexity
O(n²)Let n be the number of entries in the original table. The primary cost is driven by the process of populating the new table after the initial scans to identify headers. For each of the n entries from the original data, the algorithm must find its corresponding row and column in the new grid. If there are r unique rows and c unique columns, finding the correct position for a single entry by searching through these headers takes up to O(r + c) time. Because this search is performed for all n original entries, the total operations are approximately n * (r + c), which simplifies to O(n²) in the worst case where r and c approach n.
Space Complexity
O(R * C)The algorithm's space is determined by three new data structures created during its execution. First, two collections are used to store the unique items that become the new rows (R items) and new columns (C items). The most significant memory allocation is for the new empty grid, created with dimensions R by C to hold the final pivoted data. The space required for this grid, O(R * C), dominates the space for the row and column lists, O(R + C), making it the determining factor for the overall space complexity.

Optimal Solution

Approach

The strategy is to group all the scattered information for each unique item together. Then, for each item, we'll build a single, wide row by taking its individual pieces of information and placing them into their correct new columns.

Here's how the algorithm would work step-by-step:

  1. First, scan the original table to discover all the unique categories that will become the new columns in our final table.
  2. Next, go through the original data and mentally sort it into piles, where each pile contains all the information for one specific item.
  3. For each pile, start creating a new, wider row in the final table, beginning with the item's unique identifier.
  4. Look at each piece of information within that item's pile. Each piece tells you a category and a value.
  5. Find the column in the new row that matches the category, and place the corresponding value there.
  6. Continue this process for all the pieces of information in the current pile, filling out the new row.
  7. Once one item's row is complete, move on to the next pile and repeat the process until every item has its own consolidated row in the final table.

Code Implementation

def reshape_data_pivot(data_to_reshape):
    # This dictionary will store the newly structured records, using unique item identifiers as keys.

    pivoted_records_map = {}

    for data_point in data_to_reshape:
        item_identifier = data_point['item']
        category_name = data_point['column']
        value_to_assign = data_point['value']

        # This check ensures a new, blank record is created for an item the first time it is seen.

        if item_identifier not in pivoted_records_map:
            pivoted_records_map[item_identifier] = {}

        pivoted_records_map[item_identifier][category_name] = value_to_assign

    final_reshaped_list = []
    # Finally, we transform the intermediate map into the required list of records output format.

    for item_identifier, record_attributes in pivoted_records_map.items():
        newly_formed_record = {'item': item_identifier}
        newly_formed_record.update(record_attributes)
        final_reshaped_list.append(newly_formed_record)

    return final_reshaped_list

Big(O) Analysis

Time Complexity
O(n)The time complexity is determined by the number of rows, n, in the original table. The solution involves processing each of the n rows a constant number of times. First, it makes a pass through all n rows to find unique categories. It then performs a second pass to group the n rows by item. Finally, it iterates through these groups to build the pivoted table, which in total still only requires looking at each of the original n data points once. Since each pass is linear, the total operations are proportional to n, resulting in an O(n) runtime.
Space Complexity
O(N)The main use of auxiliary space comes from the step to "sort it into piles," which requires an intermediate data structure, typically a hash map, to group all the information for each unique item. This hash map must store a value for every row in the original data, so its size is directly proportional to N, the total number of rows in the input table. A secondary data structure, like a set, is used to store the unique categories discovered in the first step, but its space usage is dominated by the hash map. Therefore, the overall auxiliary space complexity is O(N).

Edge Cases

The input DataFrame is empty.
How to Handle:
The pivot operation will correctly return an empty DataFrame with no rows or columns.
Duplicate (city, month) pairs exist in the input data.
How to Handle:
A standard pivot function will raise an error because it cannot decide which temperature value to place in the resulting cell.
A city is missing temperature data for a month that is present for other cities.
How to Handle:
The resulting cell for that city and month combination will be filled with a null value (NaN).
Null values are present in the 'city' or 'month' columns.
How to Handle:
Rows containing nulls in the columns designated for the new index or columns are typically excluded from the pivot operation.
The input contains data for only a single unique city or a single unique month.
How to Handle:
The solution correctly produces a pivoted DataFrame with just one data column or one data row, respectively.
Inconsistent capitalization exists in 'city' or 'month' strings (e.g., 'ElPaso' vs 'elpaso').
How to Handle:
The pivot operation is case-sensitive and will treat differently cased strings as distinct categories, creating separate columns or rows.
The 'temperature' column itself contains null values.
How to Handle:
Any null temperature values from the input are preserved and placed into the corresponding cells of the output DataFrame.
The input DataFrame is extremely large, leading to a very wide or long output.
How to Handle:
The memory usage scales with the product of unique months and cities, which can lead to memory errors for large, sparse datasets.
0/1037 completed