Zefr Logo

Zefr

Senior Site Reliability Engineer

Reposted 7 Days Ago
Be an Early Applicant
Hybrid
Marina del Rey, CA
150K-170K
Senior level
Hybrid
Marina del Rey, CA
150K-170K
Senior level
The Senior Site Reliability Engineer will focus on cloud infrastructure, CI/CD practices, observability, and collaboration with machine learning teams to build and maintain scalable systems for Zefr's products.
The summary above was generated by AI
What we do: 

Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe.

What you’ll do: 

As a Site Reliability Engineer at Zefr, you’ll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts, to deliver high-quality, reliable, and scalable solutions. A significant aspect of this role involves working closely with Zefr's Machine Learning team, ensuring the specialized infrastructure required for model training, deployment, and serving is robust, efficient, and scalable. 

We’re looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. By ensuring the continuous health and efficiency of our infrastructure, including those supporting critical ML workloads, you will directly contribute to Zefr’s commitment to providing a consistently high-quality user experience. This is a role where we both expect to learn from you and have you learn from us!

  • Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.

  • Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.

  • Collaborate with other engineers, particularly the Machine Learning team, to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.

  • Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.

  • Proactively maintain the health of production environments, including monitoring ML model performance and resource utilization.

  • Participate in 24/7 on-call rotation, respond to system performance issues and outages.

  • Debug code at the application and infrastructure level.

  • Mature our CI/CD workflows and release process.

  • Maintains a forward-thinking approach, actively researching and proposing new solutions.

  • Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Technology Stack at Zefr:

Core Infrastructure & Cloud Platforms:

  • Cloud Providers: Google Cloud Platform (GCP), Amazon Web Services (AWS)

  • Infrastructure as Code (IaC): Terraform

  • Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize

  • Service Mesh: Istio

CI/CD & Automation:

  • CI/CD Pipelines: GitHub Actions

  • GitOps / Continuous Delivery: Argo CD

  • Primary Scripting/Automation Language: Python

Observability & Monitoring:

  • Monitoring & Alerting: Prometheus, Datadog, Pagerduty

  • Telemetry Standards: OpenTelemetry

Application & Data Ecosystem (Supporting):

  • Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React

  • Data Streaming: Apache Kafka

  • Data Processing/Transformation: Pandas, DBT

  • Workflow Orchestration: Apache Airflow, Ray

  • Machine Learning Stack:

    • Serving: Triton Inference Server

    • MLOps/Experiment Tracking: Weights and Biases, DVC

    • Libraries/Frameworks: Transformers, HuggingFace

    • Model Optimization/Formats: Onnx, TensorRT

Data Stores & Databases:

  • Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL)

  • NoSQL Databases: DynamoDB

  • Search Databases: OpenSearch

  • Vector Databases: Qdrant

  • Caching: Redis

  • Data Warehousing: Snowflake

What we’re looking for:
  • 6+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers. (One of AWS or GCP required)

  • Production experience designing, managing, deploying, and maintaining container based workloads into Kubernetes clusters

  • 1+ year of Machine Learning Infrastructure Development and Operations

  • Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)

  • Knowledge of IaC and configuration management tools (Terraform, OpenTofu, Crossplane, Pulumi, Ansible, CloudFormation)

  • Strong problem-solving experience, focusing on automation

  • Production experience with Monitoring and Observability tools (Prometheus, Grafana, Datadog, Thanos, New Relic, Open Telemetry)

  • Understanding of Cloud Networking concepts (Mesh Networking, NAT, Load Balancers, SSL Certificates and TLS termination, API Gateways, proxies, etc)

  • Strong written and verbal communication, organization, and documentation skills

Benefits (for US based employees):
  • Flexible PTO

  • Medical, dental, and vision insurance with FSA options

  • Company-paid life insurance

  • Paid parental leave

  • 401(k) with company match

  • Professional development opportunities

  • 13+ paid holidays off

  • Summer Fridays (we leave early)

  • Hybrid work schedule

  • In-office lunches and lots of free food

  • Optional in-person and virtual events (we like to celebrate!)

Compensation (for US based employees):

The anticipated salary for this position is between $150,000 and $170,000. Within the range, individual pay is determined by factors such as job-related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation.

Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better.  We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign-born residents, and veterans to apply even if you do not meet 100% of the qualifications.

Top Skills

Apache Kafka
Argo Cd
Datadog
Docker
DynamoDB
Fastapi
Flask
Github Actions
Huggingface
Kubernetes
Node.js
Opentelemetry
Postgres
Prometheus
Python
React
Redis
Snowflake
Tensorrt
Terraform
Transformers
Triton Inference Server
HQ

Zefr Marina del Rey, California, USA Office

Marina del Rey, CA, United States, 90066

Similar Jobs

3 Days Ago
In-Office
Costa Mesa, CA, USA
Senior level
Senior level
Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
The Senior Site Reliability Engineer will manage infrastructure for business systems, enhancing CI/CD pipelines, and ensuring system reliability across multiple teams.
Top Skills: AnsibleAWSAzureBashCloudFormationDockerGoGoogle Cloud PlatformHelmKubernetesLinuxPowershellPuppetPythonRustSccmSiemens TeamcenterTerraformWindows Server
2 Days Ago
Easy Apply
Remote or Hybrid
9 Locations
Easy Apply
164K-235K
Senior level
164K-235K
Senior level
Fintech • HR Tech
Design and implement resilient production systems, automate processes, and migrate legacy systems to modern designs while improving on-call experiences.
Top Skills: AWSAzureDockerGCPJavaKubernetesPythonRubyTerraformTypescript
9 Hours Ago
Remote or Hybrid
San Diego, CA, USA
127K-215K Annually
Senior level
127K-215K Annually
Senior level
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
As a Senior Site Reliability Engineer, you will maintain cloud infrastructure reliability, automate tasks, and drive technical resolutions across the technology stack, focusing on improving system design and operations.
Top Skills: AWSAzureJavaScriptLinuxMariadbMySQLPostgresPython

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account