Finite State

Senior Site Reliability Engineer (SRE)

Posted Yesterday

Easy Apply

Remote

2 Locations

215K-250K Annually

Senior level

Easy Apply

Remote

2 Locations

215K-250K Annually

Senior level

Lead the design and implementation of observability, SLO/SLA frameworks, and AI-enabled infrastructure automation. Architect scalable AWS infrastructure, improve incident management and on-call practices, and drive organization-wide adoption of telemetry and reliability standards.

The summary above was generated by AI

Finite State partners with product security teams, the guardians of our connected world, to create transparency for their connected devices and supply chains. Our platform handles connected devices and embedded systems across all industries, including those found in enterprises, healthcare, utilities, connected vehicles, manufacturing facilities, critical infrastructure, and government entities.

We are a fast-growing series-B company with a fully distributed workforce. Led by a team of seasoned experts, we are a mission-driven team passionate about arming our customers with the actionable insights, critical vulnerability data, and remediation guidance necessary to mitigate product risk and protect the connected attack surface. We are committed to a remote first culture.

About the Role

We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization. This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem.

This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code.

If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you.

What You’ll DoObservability & Reliability LeadershipLeverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity.

Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads.
Define and implement a comprehensive observability framework across applications and
infrastructure.
Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives.
Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms.
Drive best practices in error budgeting, alert design, and production health monitoring.

Operational Excellence

Define and evolve incident management processes, including:

On-call structures and escalation models
Postmortems and blameless retrospectives
Runbooks and operational playbooks

Improve system reliability, performance, scalability, and cost efficiency.
Establish operational KPIs and reliability dashboards for engineering and leadership visibility.
Lead reliability reviews for new architecture and product initiatives.

Infrastructure Engineering

Architect and implement scalable cloud infrastructure primarily within AWS.
Work closely with modern application platforms such as Vercel and Supabase.
Implement and improve Infrastructure-as-Code practices.
Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation.
Ensure production-grade security, compliance, and resilience standards.

AI-First Enablement

Champion the use of AI tools to:

Accelerate infrastructure provisioning
Improve operational workflows
Enhance observability signal quality
Automate incident response and remediation

Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability.

Technical Leadership

Serve as a senior technical authority for reliability and infrastructure decisions.
Mentor engineers on production best practices.
Influence architectural decisions to improve system resilience and maintainability.
Drive a culture of reliability, accountability, and continuous improvement.

What You BringExperience

10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering.
Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale.
Deep experience building and managing on-call rotations and incident management processes.
Strong background in distributed systems and cloud-native architectures.

Technical Expertise

Hands-on experience with:

Honeycomb
Grafana
AWS
Vercel
Supabase

Strong experience with observability instrumentation and telemetry design.
Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar).
Experience designing resilient CI/CD pipelines.
Deep understanding of high-availability, scalability, and performance engineering principles.

AI & Automation

Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows.
Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations.
Strong interest in building AI-native operational practices.

Leadership & Communication

Ability to operate as both strategic architect and hands-on implementer.
Strong written and verbal communication skills.
Experience influencing cross-functional teams.
Comfort working in fast-paced, high-growth environments.

Nice to Have

Experience supporting AI/ML workloads in production.
Experience building internal developer platforms (IDP).
Experience with cost observability and FinOps practices.
Experience scaling observability in high-growth SaaS environments.

What Success Looks Like in the First 6 Months

Clear SLO framework implemented across core services.
Observability tooling standardized and adopted organization-wide.
On-call and incident management processes running smoothly with measurable improvements.
AI-driven infrastructure workflows reducing operational toil.

Increased system reliability and reduced mean time to detection (MTTD) and recovery (MTTR).

Compensation

Our salary ranges are categorized into two tiers based on geographic location:

Tier 1 (San Francisco, New York, Seattle): $230,000 - $250,000
Tier 2 (All Other Locations): $215,000 - $240,000

The final base salary will be determined by experience, skill set, and specific location. In addition to base pay, this role is eligible for equity and benefits.

About Finite State

At Finite State, we're on a mission to secure the connected world. Our platform empowers product security teams to detect vulnerabilities, manage software supply chain risks, and ensure compliance across complex device ecosystems. From IoT to critical infrastructure, we provide unparalleled visibility into firmware and software components, helping organizations protect their products and customers.

We move with urgency and intent — we’re transparent, own outcomes, put customers first, speak up, and learn fast — turning evidence into action. CLARITY is how we move fast without breaking trust.

C - Customer first - Learn from customers. Ship with urgency.
L - Leverage - Outsource the routine. Own the result.
A - Agency - We take responsibility—end to end.
R - Results - Ship value. Improve fast.
I - Integrity - Speak up. Experiment boldly. Be kind.
T - Transparency - Clear context. Faster decisions.
Y - "Why" - Our mission—securing the connected products humanity depends on—is the reason Finite State exists. CLARITY is how we make that mission real, every day, at speed

Bold Innovation – We push boundaries, explore new ideas, and take initiative to solve complex problems.

The Finite State platform brings visibility and control to the supply chains that create connected devices and embedded systems—all in a simple to use platform and at the scale manufacturers need to keep device production on time and on budget. After unpacking and analyzing every file, configuration, and setting in a firmware build, the platform generates a complete bill of materials for software components, identifies known and 0-day vulnerabilities, shows a contextual risk score, and provides actionable insights that product teams can use to secure their software

We are proud to be an Equal Employer Opportunity employer. We do not discriminate based upon race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. Finite State is committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities.

Top Skills

Honeycomb,Grafana,Aws,Vercel,Supabase,Terraform,Pulumi,Ci/Cd,Observability,Telemetry,Infrastructure-As-Code,Cursor,Claude,Codex,Ai-Assisted Tooling

Similar Jobs

Circle

Site Reliability Engineer

2 Days Ago

In-Office or Remote

153K-205K Annually

Senior level

153K-205K Annually

Senior level

Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3

Design, maintain, and secure cloud infrastructure and CI/CD pipelines; automate operations with Go/Python; manage Kubernetes and blockchain nodes; implement disaster recovery; use AI tools for monitoring, anomaly detection, and capacity planning; participate in on-call rotations; mentor team members to improve reliability and performance.

Top Skills: Go,Python,Shell,Terraform,Crossplane,Aws Lambda,Kubernetes,Helm,Ethereum,Solana,Arbitrum,Base,Avalanche,Postgresql,Redis,Opensearch,Apache Airflow,Aws Dms,Snowflake,Github Copilot,Gemini,Chatgpt,Llms,Apm,Rum,Telemetry

nesto

Senior Site Reliability Engineer

3 Days Ago

Remote

Canada

Senior level

Fintech • Payments • Financial Services

Drive SRE initiatives to improve platform reliability, performance, and automation. Build observability (Datadog), enhance CI/CD and infra-as-code (Pulumi, ArgoCD), guide teams on SLOs and incident response, participate in on-call rotation, and collaborate on design and capacity planning for a cloud-native mortgage platform.

Top Skills: Typescript,React,Go (Golang),Docker,Kubernetes,Helm,Argocd,Github Actions,Datadog,Pulumi,Terraform,Crossplane,Joy,Google Cloud (Gcp),Aws,Rest,Pub/Sub

MongoDB

Senior Site Reliability Engineer

21 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

Manage continuous delivery infrastructure for reliable code deployment. Collaborate with teams to streamline onboarding, support deployment systems, and participate in on-call rotations.

Top Skills: Argo WorkflowsArgocdAWSAzureGoGoogle Cloud PlatformKubernetesPython

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering