Taro Logo

System Engineer, GPU Infrastructure & Platform Engineering - GPU Optimization Department (GPUOD)

Japanese e-commerce and fintech company that operates 70+ businesses spanning e-commerce, digital content, communications and fintech services.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For System Engineer, GPU Infrastructure & Platform Engineering - GPU Optimization Department (GPUOD)

Rakuten's AI & Data Division (AIDD) is seeking a Senior System Engineer to join their GPU Optimization Department. This role is crucial in managing and optimizing Rakuten's company-wide AI infrastructure, focusing on high-performance computing and GPU resource management. The position involves working with cutting-edge technologies including the latest Hopper and upcoming Blackwell architectures, spanning thousands of accelerators across hybrid infrastructure.

The role combines DevOps expertise with specialized knowledge in GPU infrastructure, requiring deep understanding of Kubernetes, distributed systems, and ML/AI workloads. You'll be responsible for building and scaling GPU infrastructure that supports both training (ranking models, LLMs) and inference workloads, ensuring efficient utilization and stability of Rakuten's AI computing resources.

This is an excellent opportunity for an experienced engineer who wants to work at the intersection of infrastructure and AI, managing large-scale GPU clusters and optimizing performance for critical AI workloads. You'll be part of a team that enables AI innovation across Rakuten's global operations, working with state-of-the-art hardware and software solutions.

The position offers exposure to cutting-edge AI infrastructure challenges, including work with large language models, real-time AI, and distributed training systems. You'll collaborate with global AI/ML teams and have the opportunity to shape the future of Rakuten's GPU platform architecture.

Last updated 5 days ago

Responsibilities For System Engineer, GPU Infrastructure & Platform Engineering - GPU Optimization Department (GPUOD)

  • Optimize Kubernetes for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation
  • Deploy and maintain inference serving platforms for high-throughput and low-latency model deployment
  • Automate cluster provisioning, monitoring, and recovery
  • Collaborate with ML engineers to troubleshoot GPU-related issues
  • Implement observability tools to track GPU utilization and cluster health
  • Develop infrastructure-as-code solutions for reproducible GPU environments

Requirements For System Engineer, GPU Infrastructure & Platform Engineering - GPU Optimization Department (GPUOD)

Python
Go
Kubernetes
Linux
  • 3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing
  • Deep expertise in Kubernetes for GPU workload orchestration
  • Strong programming skills in Go or Python for platform development
  • Proficiency in Linux system administration, performance tuning, and networking
  • Experience with IaC tools and CI/CD pipelines
  • Bachelor's or higher degree in Computer Science, Engineering, or related field
  • Strong teamwork and communication skills
  • Advanced English language skills

Interested in this job?

Jobs Related To Rakuten System Engineer, GPU Infrastructure & Platform Engineering - GPU Optimization Department (GPUOD)