BuildOps Logo

BuildOps

Staff Site Reliability Engineer

Posted Yesterday
Be an Early Applicant
Easy Apply
Hybrid
Los Angeles, CA
174K-226K Annually
Senior level
Easy Apply
Hybrid
Los Angeles, CA
174K-226K Annually
Senior level
Lead reliability strategy and own reliability domains end-to-end. Build observability, SLIs/SLOs, automation, incident response, and AWS IaC tooling. Mentor engineers, run on-call, and drive multi-team reliability projects.
The summary above was generated by AI

At BuildOps, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire.

You will join our cloud infrastructure and reliability engineering team as a Staff Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure. We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment.

Reporting to the DevOps and SRE Manager, this is a hands-on, staff-level role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.

What You’ll Do
  • Own one or more reliability domains end-to-end (for example observability, incident management workflows, performance of key surfaces, or core platform readiness), including strategy, roadmap, and execution
  • Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
  • Lead multi-sprint, multi-engineer reliability or performance initiatives, coordinating work across teams and driving them to successful completion
  • Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
  • Act as a subject-matter expert in at least one reliability area (for example observability, incident management, performance engineering, or search/data platforms), helping other teams make good design and operational decisions
  • Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations—and influence system design toward reliability and performance goals
  • Help evolve and operate our AWS infrastructure (networking, compute, data stores) in collaboration with infrastructure experts, working within Infrastructure as Code workflows
  • Contribute code to services, tooling, and automation (for example reliability libraries, deployment and incident tooling, health checks), and use LLMs/AI tools to accelerate high-quality delivery
  • Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
  • Participate in production on-call rotations and incident response for high-severity issues, including learning-focused post-incident reviews and follow-through on action items
  • Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
  • Collaborate with engineering and product leadership to prioritize reliability, performance, and operability work alongside feature delivery
  • Document standards, playbooks, and best practices so reliability improvements scale across teams
  • Mentor other SREs and software engineers in reliability-minded design, observability, incident response, and pragmatic use of SRE practices
  • Help build systems, automation, and team practices that reduce reliance on heroics and ad-hoc firefighting
What We Look For
  • 8+ years of experience operating complex, user-facing SaaS systems and working on production systems and reliability-focused initiatives
  • Proven experience leading multi-sprint, multi-engineer projects (for example reliability, performance, or infrastructure initiatives) to successful completion with clear business impact
  • Experience leading at least one org-wide or multi-team reliability or performance initiative from definition through rollout and follow-through on improvements

Thorough understanding of, and hands-on experience with, modern SRE practices, such as:

    • Defining and implementing SLIs/SLOs and error budgets
    • Reducing toil through automation
    • Safe deployment and rollout patterns
    • Structured post-incident reviews and continuous improvement
  • Strong software engineering skills: you’ve written and maintained production-quality code and can work comfortably in at least one modern language (for example Python or Node.js/TypeScript)
  • You regularly use LLMs and AI-assisted tooling in your workflow and know how to validate and improve what they generate
  • Deep expertise in at least one reliability-related domain, such as observability, incident management, performance engineering, or large-scale data/search platforms

Strong observability skills, including:

    • Designing metrics, logging, and tracing for multi-service systems
    • Building actionable dashboards and alerts with clear runbooks
    • Correlating metrics, logs, and traces to debug complex issues
    • Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor-agnostic experience is welcome)
  • Experience working with AWS in production and collaborating effectively within Infrastructure as Code workflows (for example Terraform-based systems and container/orchestration platforms such as ECS, EKS, or Kubernetes)

Incident management experience, including:

    • Participating in or coordinating incident response
    • Working within an incident management tool (for example incident.io, PagerDuty, Opsgenie, or similar)
    • Helping teams implement durable, high-leverage follow-ups after incidents
  • Strong communication and influence skills and the ability to explain complex technical topics to both technical and non-technical audiences, influence peers and stakeholders, and mentor less-experienced engineers
  • CS degree or equivalent experience running production systems; we are equally interested in people from non-traditional backgrounds who have spent time operating real-world environments
  • Ability and willingness to participate in a production on-call rotation
  • Ability to work a hybrid schedule – Monday/Friday WFH; Tuesday–Thursday in-office

Compensation

  • $174,000 - $226,000 base salary range + annual bonus
What we offer:
  • Generous equity grant, become an owner in our company!
  • Macbook computer provided
  • A comprehensive benefits package
  • Flexible PTO and hybrid work schedules
  • Work from home stipend
  • Hubs in Los Angeles, San Francisco, Toronto, and Raleigh with hybrid work schedules and lunch provided for in-office days
  • Company events like BBQs and team-building activities, both in-person and virtual
  • Fast-paced, collaborative, and dynamic work environment
  • Opportunities for growth and career advancement
  • Chance to work with cutting-edge technology and innovative solutions
  • The chance to get in on the ground floor and build something truly groundbreaking for ourselves and our amazing customers

We welcome applicants from across the U.S. where we are registered to do business and able to support employment. Currently, this excludes the following states: Alabama, Alaska, Connecticut, Hawaii, Kentucky, Mississippi, Nebraska, New Mexico, North Dakota, Rhode Island, South Dakota, West Virginia, and Wyoming. This list is based solely on operational and compliance considerations and is reviewed from time to time as our footprint grows.

About BuildOps

Join BuildOps, the largest commercial trade platform in the country, as we transform the multi-billion dollar commercial contracting industry!

We’re not just talking incremental improvements—we’re talking a full-scale revolution, empowering the hardworking heroes who build and maintain the infrastructure that keeps our world running. See why contractors choose Buildops here.

This is your chance to be part of a rocketship. We’re fresh off a $1 billion valuation and a $127M Series C funding round (part of over $275M raised to date) led by industry-leading investors like Meritech Capital, BOND, and SE Ventures, backed by Schneider Electric (Reuters, TechCrunch, LA Business Journal) . Our latest investors join our team of industry heavyweights like Next47, former Twitter CEO Dick Costolo, former Salesforce President Gavin Patterson, and Boost Mobile CEO Stephen Stokols. Their investment is fueling our aggressive growth and our commitment to equipping contractors with AI-driven tools to conquer chaos, boost efficiency, skyrocket profitability, and ultimately, deliver exceptional service.

At BuildOps, we’re changing the game and doing the best work of our careers. You’ll be a key player in a company that’s truly making a difference for the backbone of our economy. If you’re ready to tackle big challenges, work with a passionate team, and build something extraordinary, BuildOps is the place for you. 🚀

Top Skills

Python,Node.Js,Typescript,Datadog,Prometheus,Grafana,Honeycomb,New Relic,Aws,Terraform,Ecs,Eks,Kubernetes,Pagerduty,Incident.Io,Opsgenie,Llms/Ai Tools
HQ

BuildOps Santa Monica, California, USA Office

Santa Monica, California, United States, 90404

Similar Jobs at BuildOps

Yesterday
Easy Apply
Hybrid
San Francisco, CA, USA
Easy Apply
174K-226K Annually
Senior level
174K-226K Annually
Senior level
Cloud • Mobile • Software
Staff-level SRE responsible for owning reliability domains, driving SRE practices (SLIs/SLOs, error budgets), building observability, leading multi-team reliability initiatives, contributing code and IaC for AWS, participating in on-call and incident response, and mentoring engineers to improve system reliability and performance.
Top Skills: Python,Node.Js,Typescript,Datadog,Prometheus,Grafana,Honeycomb,New Relic,Aws,Terraform,Ecs,Eks,Kubernetes,Pagerduty,Opsgenie,Incident.Io,Llms/Ai Tools
19 Hours Ago
Easy Apply
Hybrid
Los Angeles, CA, USA
Easy Apply
90K-140K Annually
Mid level
90K-140K Annually
Mid level
Cloud • Mobile • Software
As Implementation Manager at BuildOps, you'll onboard customers, ensure satisfaction, manage project deployments, and collaborate across teams to achieve revenue goals.
Top Skills: Accounting Solutions (QuickbooksB2B SaasConfluenceGuidecxJIRANetsuite)SageSalesforceSlackSpectrum
Yesterday
Easy Apply
Hybrid
Los Angeles, CA, USA
Easy Apply
155K-196K Annually
Senior level
155K-196K Annually
Senior level
Cloud • Mobile • Software
Lead SRE efforts to improve reliability, performance, and observability of AWS-based production systems. Implement SLIs/SLOs, build automation and tooling, run incident response, evolve Terraform IaC, and collaborate with engineering teams to scale reliability practices.
Top Skills: Python,Node.Js,Typescript,Datadog,Prometheus,Grafana,Honeycomb,New Relic,Aws,Terraform,Docker,Ecs,Eks,Kubernetes,Incident.Io,Pagerduty,Opsgenie,Llms

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account