Taro Logo

Research Engineer, Agentic AI Evals

HUD (YC W25) develops agentic evals for Computer Use Agents (CUAs) that browse the web, providing the first comprehensive evaluation tool for CUAs.
Machine Learning
Mid-Level Software Engineer
Hybrid
11 - 50 Employees
3+ years of experience
AI
This job posting is no longer active. Check out these related jobs instead:

Job Description

HUD (YC W25) is an innovative AI company developing comprehensive evaluation tools for Computer Use Agents (CUAs) that browse the web. As a YC-backed startup with $2 million in seed funding, they're experiencing strong demand and rapid growth. The Research Engineer role focuses on building and implementing evaluation frameworks for AI agents, combining technical expertise with research capabilities.

The position offers a unique opportunity to work with a distinguished team including international Olympiad medallists and researchers published in prestigious conferences like ICLR and NeurIPS. The role involves creating evaluation environments, developing datasets, and contributing to the advancement of AI agent assessment methodologies.

This is an ideal position for someone passionate about AI evaluation and safety, with strong technical skills in Python and web technologies. The company offers flexibility in work arrangements, with both remote and office options in San Francisco or Singapore. They value technical aptitude and learning potential over years of experience, making it an excellent opportunity for motivated engineers interested in AI safety and evaluation.

The role combines practical engineering with research aspects, requiring both technical proficiency and understanding of AI systems. Working at HUD means joining a fast-growing team of about 15 people, with the opportunity to make significant contributions to the field of AI evaluation. The company provides comprehensive support for relocation and visas for strong candidates, demonstrating their commitment to building the best possible team.

The position would be particularly appealing for engineers who enjoy building evaluation systems, have experience with LLM frameworks, and are interested in contributing to AI safety and alignment. The work environment emphasizes both quality and quantity in contributions, with concrete goals like creating multiple evaluation environments daily.

Last updated 3 days ago

Responsibilities For Research Engineer, Agentic AI Evals

  • Build out environments for HUD's CUA evaluation datasets, including evals for safety redteaming, general business tasks, long-horizon agentic tasks
  • Create custom CUA datasets/evaluation pipelines
  • Build out large, high-quality eval datasets

Requirements For Research Engineer, Agentic AI Evals

Python
React
  • Proficiency in Python, Docker, and Linux environments
  • React experience for frontend development
  • Production-level software development experience preferred
  • Strong technical aptitude and demonstrated problem-solving ability
  • Hands-on experience with LLM evaluation frameworks and methodologies
  • Strong communication skills for remote collaboration across time zones
  • Familiarity with current AI tools and LLM capabilities
  • Understanding of safety and alignment considerations in AI systems

Benefits For Research Engineer, Agentic AI Evals

Visa Sponsorship
  • Remote work options
  • Visa sponsorship available
  • Relocation support
  • Flexible work arrangements
  • Office locations in San Francisco and Singapore