Hi,
I recently received this coding question in an interview:
Compute median in a distributed system. The data is stored on multiple nodes.
How would you approach it? How would you ensure that you can code up a basic solution during an interview? Please provide a pseudocode.
Thank you for your answers in advance.
I'm not sure what the question means. Can you share an example? Is this intended to be a DSA problem or System Design?
A median is a value that has half the dataset above it and half the dataset below it. So I think maybe the question is to handle the scenario where there's so much data that it can't fit in memory or even in one computer.
Related LeetCode problem: https://leetcode.com/problems/find-median-from-data-stream/description/
I'm just stating the question in the same way as the interviewer stated the question. It was ML coding interview, so the output should be a code, and not just high level design. I tried to probe deeper to figure out what the problem was about, and it wasn't the well-known median in a stream problem. The data is stored on multiple machines (nodes), and you want to find median. Imagine that you have 100 machines, and each machine stores a huge file with a lot of integers. Your task is to find the median of all of these numbers on the 100 machines. I'm not sure what was exactly expected. I failed the interview. The interviewer hinted that I should use bucketing. How would you do it?
@OP this is definitely a leetcode hard: https://leetcode.com/discuss/post/444925/amazon-onsite-find-median-in-a-distribut-kp0d/
It definitely looks like DSA rounds are getting harder and harder. Amazon especially has been asking really difficult problems lately. Unfortunately its no longer the case that knowing common patterns is enough. A lot of companies are asking more niche questions that involve specific algorithms for specific use cases. Specific types of sorts, MSTs, cyclic array problems. Dijkstras