Taro Logo

Create a DataFrame from List

Easy
Asked by:
Profile picture
Profile picture
Profile picture
Profile picture
+4
8 views
Topics:
Arrays

Write a solution to create a DataFrame from a 2D list called student_data. This 2D list contains the IDs and ages of some students.

The DataFrame should have two columns, student_id and age, and be in the same order as the original 2D list.

The result format is in the following example.

Example 1:

Input:
student_data:
[
  [1, 15],
  [2, 11],
  [3, 11],
  [4, 20]
]
Output:
+------------+-----+
| student_id | age |
+------------+-----+
| 1          | 15  |
| 2          | 11  |
| 3          | 11  |
| 4          | 20  |
+------------+-----+
Explanation:
A DataFrame was created on top of student_data, with two columns named student_id and age.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What data types can I expect in the `data` list? Can the inner lists contain mixed data types (e.g., strings, integers, booleans)?
  2. Will the length of each inner list in `data` always match the length of the `columns` list?
  3. Can the `data` list or the `columns` list be empty or null?
  4. Is the order of the columns in the DataFrame significant, or can I rearrange them as needed?
  5. What is the expected output DataFrame format? Should I create it using standard libraries or can I make my own DataFrame class?

Brute Force Solution

Approach

The brute force way to create a DataFrame from a list is like manually building a table. We will examine all possible arrangements of the given data to find the correct one. We'll go row by row, filling each entry until we have a complete DataFrame.

Here's how the algorithm would work step-by-step:

  1. Take the first piece of data from the list.
  2. Place it in the first cell of the DataFrame.
  3. Take the next piece of data from the list.
  4. Place it in the next available cell of the DataFrame (moving to the next column or row as needed).
  5. Repeat this process until all the data from the list is placed into the DataFrame.
  6. If there's a defined shape the DataFrame needs to have, verify the current arrangement meets that shape, otherwise this step is unnecessary.
  7. If the shape matches, then you have found a possible solution; otherwise, adjust the placement and try again from step 2 until all data is placed.
  8. Confirm all the data is correctly placed according to your requirements. This is your solution.

Code Implementation

def create_dataframe_brute_force(data, number_of_rows, number_of_columns):
    dataframe = []

    current_index = 0

    # Construct the dataframe row by row
    for row_index in range(number_of_rows):
        row = []

        for column_index in range(number_of_columns):
            # Placing each data element into the next cell
            if current_index < len(data):
                row.append(data[current_index])
                current_index += 1
            else:
                return "Data insufficient for the specified shape"

        dataframe.append(row)

    # Ensuring all data elements have been placed
    if current_index != len(data):
        return "Shape does not fit the data"

    return dataframe

Big(O) Analysis

Time Complexity
O(∞)The provided 'brute force' approach involves trying all possible arrangements of the list elements to fit the DataFrame, particularly when a specific shape is required but not inherently determined by the data. In the worst-case scenario, if no arrangement satisfies the constraints or if the constraints are very difficult to meet (akin to an NP-hard problem), the algorithm might explore an infinite number of arrangements without ever converging on a solution. Therefore, the time complexity effectively approaches infinity as the number of arrangements to test grows extremely quickly with the input list size and DataFrame shape.
Space Complexity
O(N)The brute force approach described effectively attempts to build the DataFrame in-place, but the algorithm needs to keep track of the arrangement of the data within the DataFrame structure. The core of the space complexity lies in the DataFrame itself, which will store all N elements from the input list. The size of the DataFrame is directly proportional to N, where N is the number of elements in the initial list. Therefore, the auxiliary space required to hold the DataFrame is O(N).

Optimal Solution

Approach

The goal is to efficiently organize data from a simple list into a structured table-like format, commonly called a DataFrame. We'll achieve this by systematically taking data from the list and arranging it into columns based on the headers provided.

Here's how the algorithm would work step-by-step:

  1. First, identify the column names or headers that will define the structure of the DataFrame. These act as labels for each column of data.
  2. Next, take the data from the list and organize it sequentially. This involves assigning each element of the list to a particular column, repeating the header structure as needed.
  3. If the list is shorter than the number of rows you want in the DataFrame, fill the remaining spaces with a standard placeholder value, such as 'empty' or 'null', to avoid errors.
  4. Finally, package the column headers and the organized data together in the DataFrame structure. The headers define the columns, and the data populates the rows underneath.

Code Implementation

def create_dataframe_from_list(data_list, column_names, desired_rows):
    dataframe = []
    number_of_columns = len(column_names)

    # Iterate through the desired number of rows.
    for row_index in range(desired_rows):
        row = {}

        # Assign data to columns based on the header structure.
        for column_index, column_name in enumerate(column_names):
            list_index = (row_index * number_of_columns) + column_index

            # Ensure we don't exceed the bounds of the input list.
            if list_index < len(data_list):
                row[column_name] = data_list[list_index]

            # Use 'empty' as a placeholder if data is missing.
            else:
                row[column_name] = 'empty'
        dataframe.append(row)

    # Returns a list of dictionaries representing the DataFrame.
    return dataframe

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the input list once to populate the DataFrame. Assuming the DataFrame creation itself takes constant time per element, the dominant operation is the single pass through the list. The size of the list is 'n', representing the total number of data elements. Therefore, the time complexity is directly proportional to the number of elements in the list, resulting in O(n).
Space Complexity
O(N)The algorithm constructs a DataFrame by creating a new data structure to store the organized data. Specifically, it involves allocating space to hold the data elements arranged into columns, and populating empty spaces with a placeholder. The size of this DataFrame scales linearly with the number of elements in the final data structure, which depends on the length of the input list and the desired number of rows. Therefore, the auxiliary space used is proportional to N, where N represents the total number of elements in the final DataFrame.

Edge Cases

CaseHow to Handle
data is null or NoneReturn an empty DataFrame or raise an appropriate exception such as ValueError.
columns is null or NoneReturn an empty DataFrame or raise an appropriate exception such as ValueError.
data is an empty listReturn an empty DataFrame with the provided column names.
columns is an empty listIf data is not empty, raise an exception since no column names are provided.
Number of columns does not match the number of elements in each row of dataRaise an exception such as ValueError indicating a mismatch between data and columns.
data contains rows of varying lengthsRaise an exception such as ValueError indicating inconsistent data row lengths.
columns contains duplicate column namesRaise an exception or rename the duplicate columns with suffixes like _1, _2, etc.
data contains non-primitive data types or mixed data types in a columnThe DataFrame should handle different data types (strings, integers, floats, booleans), and type coercion should be explicit if necessary.