Zefr

Principal Site Reliability Engineer

Reposted 22 Days Ago

Be an Early Applicant

Hybrid

Marina del Rey, CA

210K-235K Annually

Expert/Leader

Hybrid

Marina del Rey, CA

210K-235K Annually

Expert/Leader

As a Principal Site Reliability Engineer, you'll lead reliability practices, mentor engineers, and manage cloud infrastructure in a multi-cloud environment, focusing on continuous improvement and innovation.

The summary above was generated by AI

What we do:

Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe.

What you’ll do:
As a Principal Site Reliability Engineer at Zefr, you'll serve as a technical leader and subject matter expert, helping define the technical vision and shape the direction of our reliability practices across the organization.

You'll leverage deep expertise in observability, core SRE principles, cloud infrastructure, CI/CD and DevSecOps to solve our most complex challenges and set the standard for engineering excellence.

This role requires a blend of hands-on technical expertise and strategic thinking. You'll drive cross-functional initiatives, mentor engineers across teams, and partner with leadership to ensure our AI-powered platform is robust, efficient, and scalable.

We’re looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. Zefr wants a candidate that champions reliability as a product feature, and can translate complex technical concepts into strategy. This is a role where you'll shape how we build and operate systems at scale.

Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
Participate in 24/7 on-call rotation, respond to system performance issues and outages.
Debug code at the application and infrastructure level.
Mature our CI/CD workflows and release process.
Maintains a forward-thinking approach, actively researching and proposing new solutions.
Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Technology Stack at Zefr:

Core Infrastructure & Cloud Platforms:

Cloud Providers: Google Cloud Platform (primary), Amazon Web Services Infrastructure as Code (IaC): Terraform, Terragrunt
Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize
Service Mesh: Istio

CI/CD & Automation:

CI/CD Pipelines: GitHub Actions
GitOps / Continuous Delivery: Argo CD
Primary Scripting/Automation Language: Python

Observability & Monitoring:

Monitoring & Alerting: Prometheus, Chronosphere, Pagerduty
Telemetry Standards: OpenTelemetry

Application & Data Ecosystem (Supporting):

Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React
Data Streaming: Apache Kafka
Data Processing/Transformation: Pandas, DBT
Workflow Orchestration: Apache Airflow, Ray

Data Stores & Databases:

Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL)
NoSQL Databases: DynamoDB
Search Databases: OpenSearch
Vector Databases: Qdrant
Caching: Redis
Data Warehousing: Snowflake

What we’re looking for:

10+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
Experience in Advertising or AdTech
Demonstrated technical leadership experience; including mentoring engineers, driving cross-functional projects, and influencing architectural decisions at an organizational level.
Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
Advanced Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
Deep production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
Strong knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
Exceptional written and verbal communication skills; ability to translate complex technical concepts for diverse audiences and build consensus across teams.
Experience authoring technical strategy documents, RFCs, and architectural proposals.

Benefits (for US based employees):

Flexible PTO
Medical, dental, and vision insurance with FSA options
Company-paid life insurance
Paid parental leave
401(k) with company match
Professional development opportunities
13 paid holidays off
Summer Fridays (we leave early)
In-office, hybrid, and fully-remote work options available
In-office lunches and lots of free food
Optional in-person and virtual events (we like to celebrate!)

Compensation (for US based employees):

The anticipated salary for this position is between $210,000 and $235,000. Within the range, individual pay is determined by factors such as job-related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation.

Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better. We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign-born residents, and veterans to apply even if you do not meet 100% of the qualifications.

Top Skills

Amazon Web Services

Apache Airflow

Apache Kafka

Argo Cd

Aws Aurora

Chronosphere

Dbt

Docker

DynamoDB

Fastapi

Flask

Gcp Cloud Sql

Github Actions

Google Cloud Platform

Helm

Istio

Kubernetes

Kustomize

Node.js

Opensearch

Opentelemetry

Pagerduty

Pandas

Postgres

Prometheus

Python

Qdrant

Ray

React

Redis

Snowflake

Terraform

Terragrunt

Marina del Rey, CA, United States, 90066

Similar Jobs

DFIN

Site Reliability Engineer

4 Days Ago

Remote or Hybrid

United States

Senior level

Fintech • Software

The Principal Site Reliability Engineer - Cloud is responsible for managing and optimizing SaaS cloud infrastructure, ensuring performance, reliability, and security, while automating operations and collaborating within teams.

Top Skills: .NetAnsibleAppdynamicsAWSAzureAzure DevopsC#DatadogDynatraceHarnessIderaJavaJenkinsKubernetesNew RelicRedgateSolarwindsSQLTerraform

Early Warning

Site Reliability Engineer

11 Days Ago

In-Office

172K-258K Annually

Expert/Leader

172K-258K Annually

Expert/Leader

Fintech

The Principal Site Reliability Engineer designs and implements software to enhance application performance and resilience while ensuring security standards. Responsibilities include automating application management, providing observability, and leading cross-functional teams. Mentorship and on-call rotation participation are expected.

Top Skills: AuroraAWSChefDockerDynamo DbGitGoJavaJenkinsJmsKafkaKubernetesMavenMemcachedOraclePythonRedisSqsSwarm

Hyundai Autoever America

Site Reliability Engineer

12 Days Ago

In-Office

Irvine, CA, USA

180K-210K Annually

Senior level

180K-210K Annually

Senior level

Automotive • Information Technology

The Executive Principal will lead the Infrastructure Engineering team, ensuring operational effectiveness and providing technical expertise in various IT domains, while overseeing 24/7 operations and strategic planning.

Top Skills: Active DirectoryAixAzureCcnaCcnpMcseMscaOffice 365Red HatSharepointSolarisVMware

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Zefr

Principal Site Reliability Engineer

Top Skills

Zefr Marina del Rey, California, USA Office

Similar Jobs

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

What you need to know about the Los Angeles Tech Scene

Key Facts About Los Angeles Tech