DataFrame weather
+-------------+--------+
| Column Name | Type |
+-------------+--------+
| city | object |
| month | object |
| temperature | int |
+-------------+--------+
Write a solution to pivot the data so that each row represents temperatures for a specific month, and each city is a separate column.
The result format is in the following example.
Example 1:
Input:
+--------------+----------+-------------+
| city | month | temperature |
+--------------+----------+-------------+
| Jacksonville | January | 13 |
| Jacksonville | February | 23 |
| Jacksonville | March | 38 |
| Jacksonville | April | 5 |
| Jacksonville | May | 34 |
| ElPaso | January | 20 |
| ElPaso | February | 6 |
| ElPaso | March | 26 |
| ElPaso | April | 2 |
| ElPaso | May | 43 |
+--------------+----------+-------------+
Output:
+----------+--------+--------------+
| month | ElPaso | Jacksonville |
+----------+--------+--------------+
| April | 2 | 5 |
| February | 6 | 23 |
| January | 20 | 13 |
| March | 26 | 38 |
| May | 43 | 34 |
+----------+--------+--------------+
Explanation:
The table is pivoted, each column represents a city, and each row represents a specific month.
When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
The basic strategy is to build the new, reshaped table from scratch. We first scan the original data to figure out what the new rows and columns should be, and then we go through the original data again, piece by piece, to place each value into its correct location in the new table.
Here's how the algorithm would work step-by-step:
def reshape_data_pivot_brute_force(data, row_key, pivot_column, value_column):
unique_row_labels = sorted(list(set(item[row_key] for item in data)))
unique_column_headers = sorted(list(set(item[pivot_column] for item in data)))
pivoted_data_result = []
# To build the pivoted table, we must iterate through each new row and column combination.
for current_row_label in unique_row_labels:
new_row_dictionary = {row_key: current_row_label}
for current_column_header in unique_column_headers:
value_for_cell = None
# For each cell, a full scan of the original data is required to find the one correct value.
for original_data_row in data:
# This checks if a record matches the coordinates of the cell we are currently trying to fill.
if original_data_row[row_key] == current_row_label and original_data_row[pivot_column] == current_column_header:
value_for_cell = original_data_row[value_column]
break
new_row_dictionary[current_column_header] = value_for_cell
pivoted_data_result.append(new_row_dictionary)
return pivoted_data_resultThe strategy is to group all the scattered information for each unique item together. Then, for each item, we'll build a single, wide row by taking its individual pieces of information and placing them into their correct new columns.
Here's how the algorithm would work step-by-step:
def reshape_data_pivot(data_to_reshape):
# This dictionary will store the newly structured records, using unique item identifiers as keys.
pivoted_records_map = {}
for data_point in data_to_reshape:
item_identifier = data_point['item']
category_name = data_point['column']
value_to_assign = data_point['value']
# This check ensures a new, blank record is created for an item the first time it is seen.
if item_identifier not in pivoted_records_map:
pivoted_records_map[item_identifier] = {}
pivoted_records_map[item_identifier][category_name] = value_to_assign
final_reshaped_list = []
# Finally, we transform the intermediate map into the required list of records output format.
for item_identifier, record_attributes in pivoted_records_map.items():
newly_formed_record = {'item': item_identifier}
newly_formed_record.update(record_attributes)
final_reshaped_list.append(newly_formed_record)
return final_reshaped_list| Case | How to Handle |
|---|---|
| The input DataFrame is empty. | The pivot operation will correctly return an empty DataFrame with no rows or columns. |
| Duplicate (city, month) pairs exist in the input data. | A standard pivot function will raise an error because it cannot decide which temperature value to place in the resulting cell. |
| A city is missing temperature data for a month that is present for other cities. | The resulting cell for that city and month combination will be filled with a null value (NaN). |
| Null values are present in the 'city' or 'month' columns. | Rows containing nulls in the columns designated for the new index or columns are typically excluded from the pivot operation. |
| The input contains data for only a single unique city or a single unique month. | The solution correctly produces a pivoted DataFrame with just one data column or one data row, respectively. |
| Inconsistent capitalization exists in 'city' or 'month' strings (e.g., 'ElPaso' vs 'elpaso'). | The pivot operation is case-sensitive and will treat differently cased strings as distinct categories, creating separate columns or rows. |
| The 'temperature' column itself contains null values. | Any null temperature values from the input are preserved and placed into the corresponding cells of the output DataFrame. |
| The input DataFrame is extremely large, leading to a very wide or long output. | The memory usage scales with the product of unique months and cities, which can lead to memory errors for large, sparse datasets. |