BentoML

Inference Optimization Engineer

Reposted 25 Days Ago

Be an Early Applicant

Remote

3 Locations

Mid level

Remote

3 Locations

Mid level

As an Inference Optimization Engineer, you'll enhance the performance of large language models by optimizing inference processes on GPUs, profiling workloads, and sharing insights with the community.

The summary above was generated by AI

About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures. You will profile real workloads, remove bottlenecks, and lift each layer of the stack to new performance ceilings. Every gain you unlock will flow straight into open source code and power fleets of production models, cutting GPU costs for teams around the world. By publishing blog posts and giving conference talks you will become a trusted voice on efficient LLM inference at large scale.

Example projects:

https://bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction
https://bentoml.com/blog/benchmarking-llm-inference-backends
https://bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes

Responsibilities

Latency & throughput - Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.
Benchmarking - Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.
Resource efficiency - Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.
Serving features - Improve batching, caching, load balancing, and model-parallel execution.
Knowledge sharing - Write technical posts, contribute code, and present findings to the open-source community.

Qualifications

Deep understanding of transformer architecture and inference engine internals.
Hands-on experience speeding up model serving through batching, caching, load balancing.
Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).
Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.
Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.
Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

Why join us

Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.
Technical scope – operate distributed LLM inference and large GPU clusters worldwide.
Customer reach – support organizations around the globe that rely on BentoML.
Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.
Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.
Compensation – competitive salary, equity, learning budget, and paid conference travel.

Top Skills

Cuda

Rocm

Sglang

Triton

Trt-Llm

Vllm

Similar Jobs

ServiceNow

Senior Customer Success Manager

An Hour Ago

Remote or Hybrid

Montréal, QC, CAN

Senior level

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation

The Senior Customer Success Manager advocates for customers, ensuring they achieve outcomes using ServiceNow products by providing guidance and support. Responsibilities include project oversight, enhancing product adoption, and addressing customer issues to maximize value from their investment.

Top Skills: AIServicenow

ServiceNow

Consultant

An Hour Ago

Remote or Hybrid

Toronto, ON, CAN

Senior level

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation

The role involves consulting and configuring ServiceNow solutions focusing on HR, Legal, and Workplace Services. Responsibilities include leading design workshops, advising customers, overseeing project delivery, and ensuring best practices for ServiceNow implementations.

Top Skills: BootstrapCSSHTMLJavaScriptLdapServicenowSsoWeb ServicesXML

360Learning

Account Manager

An Hour Ago

Easy Apply

Remote

Canada

Easy Apply

Senior level

Artificial Intelligence • Cloud • Edtech • HR Tech • Sales • Software • Generative AI

The Sr Enterprise Key Account Manager will grow revenue from existing enterprise clients, optimize customer experience, and drive account renewals through tailored outreach and strategic planning.

Top Skills: Crm SoftwareLmsLxp

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering