CentML Logo

CentML

Software Engineer - LLM Training

Posted 20 Days Ago
Be an Early Applicant
Hybrid
9 Locations
Mid level
Hybrid
9 Locations
Mid level
Design and implement distributed training systems for large-scale AI models, optimizing performance across many GPUs and ensuring usability and flexibility on the CentML platform.
The summary above was generated by AI

About Us

We believe AI will fundamentally transform how people live and work. CentML's mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.


Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.


About the Position

We are seeking highly crafted and motivated software engineers to join our team to empower AI practitioners to develop AI models on CentML Platform, productively and affordably. If you have launched multi-node distributed training jobs before and experienced firsthand how painful and cumbersome to get it functional, let alone high-performing, and you wanna be part of the team that derives solutions to address this challenge so that other AI practitioners wouldn’t feel the same pain that you had, please come and join us!


What you’ll do

  • Design and implement highly efficient distributed training systems for large-scale deep learning models.
  • Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs.
  • Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks.
  • Productionize the training systems onto CentML Platform.
  • Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques.
  • Contribute to the design of APIs, abstractions and UX that make it easier to scale models while maintaining usability and flexibility.
  • Profile, debug, and tune performance at the system, model, and hardware levels.
  • Participate in design discussions, code reviews, and technical planning to ensure the product aligns with business goals.
  • Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems.

What you’ll need to be successful

  • Bachelor’s, Master’s, or PhD’s degree in Computer Science/Engineering, Software Engineering, related field or equivalent working experience.
  • 3+ years of experience in software development, preferably with Python and C++.
  • Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles.
  • Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, DeepSpeed.
  • Experience optimizing compute, memory, and communication performance in large model training workflows.
  • Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools.
  • Solid grasp of deep learning fundamentals, especially as they relate to transformer-based architectures and training dynamics.
  • Experience working with cloud platforms (AWS, GCP, or Azure) and containerization tools (Docker, Kubernetes).
  • Ability to work closely with both research and engineering teams, translating evolving needs into robust infrastructure.
  • Excellent problem-solving skills, with the ability to debug complex systems.
  • A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI.

Bonus points if you have

  • Experience building tools or platforms for ML model training or fine-tuning.
  • Experience building backends (e.g., using FastAPI) and frontend (e.g., using React).
  • Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang).
  • Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code.
  • Familiarity with MLOps concepts, including model versioning and serving.

Benefits & Perks

- An open and inclusive work environment

- Employee stock options

- Best-in-class medical and dental benefits

- Parental Leave top-up

- Professional development budget

- Flexible vacation time to promote a healthy work-life blend


We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, and any other protected ground of discrimination under applicable human rights legislation. 


CentML strives to respect the dignity and ‎‎independence of people with disabilities and is committed to giving them the same ‎‎opportunity to succeed as all other employees. 


Inclusiveness is core to our culture at CentML, and we strive to ensure you get the most from your interview experience. CentML makes reasonable accommodations for applicants with disabilities. If a reasonable accommodation is needed to participate in the job application or interview process, please reach out to the Talent team.

Top Skills

AWS
Azure
C++
Cuda
Deepspeed
Docker
GCP
Kubernetes
Megatron Core
Nccl
Python
PyTorch

Similar Jobs

An Hour Ago
Remote
Hybrid
Vancouver, BC, CAN
184K-276K Annually
Senior level
184K-276K Annually
Senior level
Blockchain • Fintech • Mobile • Payments • Software • Financial Services
As an Engineering Manager, you'll lead a team in managing customer identity solutions, drive technical architecture, and ensure collaboration across disciplines while mentoring engineers.
Top Skills: Amazon Web ServicesDatadogGrpcGuiceHibernateHTTPJavaJSONJunitKafkaKotlinMySQLProtocol Buffers
An Hour Ago
Remote
Hybrid
8 Locations
319K-479K Annually
Expert/Leader
319K-479K Annually
Expert/Leader
Blockchain • Fintech • Mobile • Payments • Software • Financial Services
Lead and scale engineering teams responsible for Trust Services at Cash App, focusing on security, risk management, and customer trust.
Top Skills: Financial TechnologyFraud PreventionRisk ManagementSecurity
An Hour Ago
Remote
Hybrid
7 Locations
185K-327K Annually
Senior level
185K-327K Annually
Senior level
Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
As an Engineering Manager, you'll lead the Identity Lifecycle team, managing technical architecture, team dynamics, and driving the vision alongside product management and design teams.
Top Skills: Amazon Web ServicesDatadogGrpcGuiceHibernateHTTPJavaJSONJunitKafkaKotlinMySQLProtocol Buffers

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account