Fractal

Site Reliability Engineer

Posted 23 Days Ago

Be an Early Applicant

California

110K-160K Annually

Mid level

California

110K-160K Annually

Mid level

Manage, monitor, and optimize Jenkins and C3 clusters on Kubernetes, ensuring reliability, scalability, and performance of CI/CD infrastructure while authoring complex Jenkins pipelines.

The summary above was generated by AI

It's fun to work in a company where people truly BELIEVE in what they are doing!

We're committed to bringing passion and customer focus to the business.

Fractal Analytics is a strategic AI partner to Fortune 500 companies with a vision to power every human decision in the enterprise. Fractal is building a world where individual choices, freedom, and diversity are the greatest assets. An ecosystem where human imagination is at the heart of every decision. Where no possibility is written off, only challenged to get better. We believe that a true Fractalite empowers imagination with intelligence. And that it will be such Fractalites that will continue to build the company for the next 100 years.

Please visit Fractal | Intelligence for Imagination for more information about Fractal.

**Please Note: This role is specifically located in the Bay Area of San Francisco. You will need to work onsite Monday - Friday. We offer paid relocation.**

Role Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team to manage, monitor, and optimize our Jenkins and C3 clusters on Kubernetes. The ideal candidate will have a deep understanding of Kubernetes, Jenkins, Cloud Infrastructure, and Infrastructure as Code (IaC) practices. You will be responsible for ensuring the reliability, scalability, and performance of our CI/CD infrastructure, while also authoring and managing complex Jenkins pipelines using Groovy scripting.

Responsibilities:

Deploy, Manage, and Scale Jenkins Clusters: Design, implement, and manage Jenkins clusters, ensuring high availability and optimal performance across all environments.
Author and Maintain Jenkins Pipelines: Write, update, and maintain complex Jenkins pipelines using Groovy to automate CI/CD processes for various development teams.
Monitor and Manage C3 Clusters: Ensure the stability, health, and scalability of C3 Clusters, deploying applications and services on Kubernetes.
Kubernetes Management: Deploy, monitor, and scale applications on Kubernetes clusters. Maintain Helm charts, manage services, and ensure resource allocation for optimal cluster performance.
Cloud Infrastructure Management: Work with leading Cloud Platforms (AWS, GCP, Azure) to set up, configure, and manage infrastructure resources using Infrastructure as Code (Terraform, CloudFormation, etc.).
Monitoring & Incident Response: Set up monitoring solutions, define alerts, and manage the incident response process for any issues related to Jenkins, C3, or Kubernetes clusters.
Automate Infrastructure Processes: Build automation tools for scaling, monitoring, and maintaining infrastructure using modern tools like Terraform, Ansible, or equivalent.
Collaborate Across Teams: Work closely with development, DevOps, and operations teams to ensure smooth CI/CD workflows and a seamless integration between application development and infrastructure.
Security & Compliance: Ensure all systems follow best practices in terms of security and compliance with relevant regulations. This includes role-based access, encryption, and automated vulnerability scanning.

Requirements:

Have interest and ability to become certified on the end client AI platform. (We will provide all the necessary training and support)
Bachelor’s or master’s degree in computer science, a related field, or equivalent professional experience.
3+ years of experience as an SRE, DevOps Engineer, or related role.
Strong expertise in Jenkins administration and writing complex pipelines in Groovy.
Hands-on experience with Kubernetes in production environments (managing clusters, deployments, services, and pods).
Proficiency in cloud platforms like AWS, GCP, or Azure, including managing infrastructure via IaC tools like Terraform, CloudFormation, or equivalent.
Familiarity with monitoring tools like Prometheus, Grafana or equivalent.
Experience with Helm, and managing Kubernetes applications via Helm charts.
Strong scripting and automation skills in languages like Bash, Python, or Groovy.
Experience with CI/CD tools, GitOps, and best practices for continuous integration and delivery pipelines.
Understanding of networking concepts and security best practices in a cloud-native environment.
Incident management experience, including setting up on-call rotations, managing runbooks, and post-incident reviews.

Pay:

The wage range for this role takes into account the wide range of factors that are considered in making compensation decisions, including but not limited to skill sets; experience and training; licensure and certifications; and other business and organizational needs. The disclosed range estimate has not been adjusted for the applicable geographic differential associated with the location at which the position may be filled. At Fractal, it is not typical for an individual to be hired at or near the top of the range for their role and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is: $110,000 - $160,000. In addition, you may be eligible for a discretionary bonus for the current performance period.

Benefits:

As a full-time employee of the company or as an hourly employee working more than 30 hours per week, you will be eligible to participate in the health, dental, vision, life insurance, and disability plans in accordance with the plan documents, which may be amended from time to time. You will be eligible for benefits on the first day of employment with the Company. In addition, you are eligible to participate in the Company 401(k) Plan after 30 days of employment, in accordance with the applicable plan terms. The Company provides for 11 paid holidays and 12 weeks of Parental Leave. We also follow a “free time” PTO policy, allowing you the flexibility to take the time needed for either sick time or vacation.

Fractal provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.

If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

Not the right fit? Let us know you're interested in a future opportunity by clicking Introduce Yourself in the top-right corner of the page or create an account to set up email alerts as new job postings become available that meet your interest!

Top Skills

Ansible

AWS

Azure

Bash

Cloud Infrastructure

CloudFormation

GCP

Grafana

Groovy

Helm

Infrastructure As Code

Jenkins

Kubernetes

Prometheus

Python

Terraform

Similar Jobs

Cisco ThousandEyes

Lead Site Reliability Engineer II, Production Engineering

2 Days Ago

Easy Apply

Hybrid

San Francisco, CA, USA

Easy Apply

199K-283K

Senior level

199K-283K

Senior level

Cloud • Software

Lead the Production Engineering SRE team, focusing on DevSecOps, system reliability, security architecture, and team mentorship in cloud-native technologies.

Top Skills: ArgocdAWSDockerGoKubernetesOpentelemetryPrometheusPythonTerraform

Block

Staff Site Reliability Engineer

2 Days Ago

Remote

Hybrid

264K-395K Annually

Senior level

264K-395K Annually

Senior level

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency

As a Senior Site Reliability Engineer at Block, you'll enhance the reliability of systems by designing and maintaining scalable infrastructure, collaborating with development teams, performing root cause analysis, and mentoring junior staff. You'll ensure high availability and contribute to operational efficiency.

Cisco ThousandEyes

Principal Site Reliability Engineer, Datastores

6 Days Ago

Easy Apply

Hybrid

San Francisco, CA, USA

Easy Apply

176K-314K

Senior level

176K-314K

Senior level

Cloud • Software

The Principal Site Reliability Engineer will oversee mission-critical datastores, ensuring reliability, scalability, and performance while leading automation efforts and mentoring the engineering team.

Top Skills: AWSElasticsearchGoKafkaKubernetesMongoDBMySQLPythonTerraform

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering