The High Performance Computing and Artificial Intelligence (HPC/AI) team at Microsoft is building the next-generation distributed AI supercomputer. This senior software engineering role focuses on developing critical infrastructure for high-performance AI model training at scale.
As a Senior Software Engineer on the HPC & AI Infrastructure team, you'll work at the intersection of AI supercomputing and large-scale networking. You'll be responsible for building network automation tools, observability frameworks, and performance optimization systems that enable ultra-low latency and high throughput in distributed AI workloads.
The role involves working with cutting-edge technologies including InfiniBand, RoCE, and accelerated compute platforms (NVIDIA, AMD GPUs). You'll build core software infrastructure for telemetry, diagnostics, orchestration, and network configuration that ensures operational excellence at exascale levels.
This is an opportunity to shape how advanced AI models are trained and deployed in the cloud, working with hardware, infrastructure, and ML platform teams. The position offers competitive compensation ($119,800 - $234,700 base salary range), comprehensive benefits, and the chance to work on systems that push the boundaries of AI infrastructure.
Microsoft provides an inclusive work environment with opportunities for growth and innovation. The role offers flexible work arrangements with up to 100% work from home options and 0-25% travel requirements. You'll be part of a team that values collaboration, technical excellence, and continuous learning while working on technology that impacts billions of users worldwide.
The ideal candidate will have strong experience in distributed systems, networking technologies, and software development, with a passion for performance engineering and AI infrastructure. This role requires both technical depth in systems/networking and the ability to collaborate across teams to deliver complex distributed systems.