Senior Site Reliability Engineer, Compute

Crusoe

Building the World's Favorite AI-first Cloud infrastructure company, pioneering vertically integrated, purpose-built AI infrastructure solutions powered by clean, renewable energy.

San Francisco, CA, USA

$183,000 - $210,000

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer, Compute

Crusoe is revolutionizing AI cloud infrastructure by building sustainable, high-performance computing solutions. As a Senior Site Reliability Engineer focused on Compute, you'll be instrumental in developing and optimizing the company's virtualization and compute infrastructure. The role combines deep technical expertise in Linux kernel internals, virtualization technologies, and system optimization with a focus on supporting modern AI and HPC workloads.

You'll work with cutting-edge technologies including SmartNICs, BlueField devices, and TPUs, while being responsible for critical infrastructure components from the kernel to orchestration layers. The position requires strong programming skills in languages like C, Go, or Rust, and extensive experience with system-level debugging and performance optimization.

The company offers a comprehensive benefits package including equity, competitive salary, and various health and wellness benefits. Working in a hybrid environment in San Francisco, you'll be part of a team that's setting new standards in sustainable AI infrastructure. This is an opportunity to make a significant impact at a well-funded technology company that's aligned with both technological advancement and environmental responsibility.

The role is perfect for an experienced SRE who is passionate about infrastructure optimization, has deep Linux expertise, and wants to work on challenging problems at the intersection of AI, cloud computing, and sustainability. You'll be contributing to a platform that's considered the "gold standard" for reliability and performance in AI infrastructure.

Last updated 5 hours ago

Responsibilities For Senior Site Reliability Engineer, Compute

Develop automation and observability tools to monitor compute infrastructure
Support and scale virtualization stack including KVM, QEMU, and other hypervisors
Collaborate with Linux kernel and hardware teams to resolve performance bottlenecks
Optimize performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources
Perform root cause analysis for kernel crashes and hardware-software integration problems
Integrate hypervisor-level enhancements for guest VM reliability and workload isolation
Tune kernel subsystems including process scheduler, NUMA configuration, and memory management
Implement and validate support for emerging compute hardware

Requirements For Senior Site Reliability Engineer, Compute

Linux

Rust

5+ years of professional experience in SRE, Linux system engineering, or compute infrastructure roles
Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems
Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware
Familiarity with SmartNICs/DPUs and kernel bypass techniques
Expert-level skills in at least one programming language: C, Go, or Rust
Experience with system-level debugging, including kdump, kexec, and kernel panic analysis
Proficiency in Infrastructure as Code tooling and CI/CD practices
Strong understanding of compute scheduling, resource management, and high-throughput networking

Benefits For Senior Site Reliability Engineer, Compute

401k

Medical Insurance

Dental Insurance

Vision Insurance

Parental Leave

Education Budget

Equity

Commuter Benefits

Hybrid work schedule
Industry competitive pay
Restricted Stock Units
Health insurance package (HDHP and PPO, vision, and dental)
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit ($50 per pay period)

Crusoe

Building the World's Favorite AI-first Cloud infrastructure company, pioneering vertically integrated, purpose-built AI infrastructure solutions powered by clean, renewable energy.

San Francisco, CA, USA

$183,000 - $210,000

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

AI · Enterprise SaaS · Cloud

Interested in this job?

Jobs Related To Crusoe Senior Site Reliability Engineer, Compute

Senior Site Reliability Engineer, Storage

Crusoe

Senior Site Reliability Engineer position at Crusoe, focusing on storage infrastructure for AI-optimized cloud services, offering $183k-$210k salary with hybrid work in San Francisco.

Senior Site Reliability Engineer, Production Engineering

Cisco ThousandEyes

Senior Site Reliability Engineer position at Cisco ThousandEyes, focusing on production engineering and cloud infrastructure management in London with hybrid work arrangement.

Senior Site Reliability Engineer

Thomson Reuters

Senior Site Reliability Engineer position at Thomson Reuters, focusing on maintaining and improving system reliability and infrastructure scalability.

Senior Software Engineer, Site Reliability Tooling

Upstart

Senior SRE Engineer role at Upstart, building and maintaining tooling for reliability and observability of AI-powered lending platforms. Remote-friendly with competitive compensation.

Senior Site Reliability Engineer, Storage

Crusoe

Senior Site Reliability Engineer position at Crusoe, focusing on storage infrastructure for AI-optimized cloud services, offering $183k-$210k salary with hybrid work in San Francisco.