Research Engineer, Agentic AI Evals

HUD

HUD (YC W25) develops agentic evals for Computer Use Agents (CUAs) that browse the web, providing the first comprehensive evaluation tool for CUAs.

San Francisco, CA, USA • Singapore

Machine Learning

Mid-Level Software Engineer

Hybrid

11 - 50 Employees

3+ years of experience

This job posting is no longer active. Check out these related jobs instead:

Job Description

HUD (YC W25) is an innovative AI company developing comprehensive evaluation tools for Computer Use Agents (CUAs) that browse the web. As a YC-backed startup with $2 million in seed funding, they're experiencing strong demand and rapid growth. The Research Engineer role focuses on building and implementing evaluation frameworks for AI agents, combining technical expertise with research capabilities.

The position offers a unique opportunity to work with a distinguished team including international Olympiad medallists and researchers published in prestigious conferences like ICLR and NeurIPS. The role involves creating evaluation environments, developing datasets, and contributing to the advancement of AI agent assessment methodologies.

This is an ideal position for someone passionate about AI evaluation and safety, with strong technical skills in Python and web technologies. The company offers flexibility in work arrangements, with both remote and office options in San Francisco or Singapore. They value technical aptitude and learning potential over years of experience, making it an excellent opportunity for motivated engineers interested in AI safety and evaluation.

The role combines practical engineering with research aspects, requiring both technical proficiency and understanding of AI systems. Working at HUD means joining a fast-growing team of about 15 people, with the opportunity to make significant contributions to the field of AI evaluation. The company provides comprehensive support for relocation and visas for strong candidates, demonstrating their commitment to building the best possible team.

The position would be particularly appealing for engineers who enjoy building evaluation systems, have experience with LLM frameworks, and are interested in contributing to AI safety and alignment. The work environment emphasizes both quality and quantity in contributions, with concrete goals like creating multiple evaluation environments daily.

Last updated 3 days ago

Responsibilities For Research Engineer, Agentic AI Evals

Build out environments for HUD's CUA evaluation datasets, including evals for safety redteaming, general business tasks, long-horizon agentic tasks
Create custom CUA datasets/evaluation pipelines
Build out large, high-quality eval datasets

Requirements For Research Engineer, Agentic AI Evals

Python

React

Proficiency in Python, Docker, and Linux environments
React experience for frontend development
Production-level software development experience preferred
Strong technical aptitude and demonstrated problem-solving ability
Hands-on experience with LLM evaluation frameworks and methodologies
Strong communication skills for remote collaboration across time zones
Familiarity with current AI tools and LLM capabilities
Understanding of safety and alignment considerations in AI systems

Benefits For Research Engineer, Agentic AI Evals

Visa Sponsorship

Remote work options
Visa sponsorship available
Relocation support
Flexible work arrangements
Office locations in San Francisco and Singapore

HUD