TWG Global

Platform Reliability Engineer

Reposted 20 Days Ago

In-Office

Santa Monica, CA, USA

120K-190K Annually

Mid level

In-Office

Santa Monica, CA, USA

120K-190K Annually

Mid level

As a Site Reliability Engineer, you'll maintain infrastructure for ML workloads, implement observability tools, manage CI/CD pipelines, and troubleshoot incidents in a collaborative environment.

The summary above was generated by AI

At TWG Group Holdings, LLC (“TWG Global”), we drive innovation and business transformation across a range of industries—including financial services, insurance, technology, media, and sports—by leveraging data and AI as core assets. Our AI-first, cloud-native approach delivers real-time intelligence and interactive business applications, empowering informed decision-making for both customers and employees.

We prioritize responsible data and AI practices to ensure ethical standards and regulatory compliance. Our decentralized structure enables each business unit to operate autonomously, supported by a central AI Solutions Group, while strategic partnerships with leading data and AI vendors fuel game-changing efforts in marketing, operations, and product development.

You will collaborate with management to advance our data and analytics transformation, enhance productivity, and enable agile, data-driven decisions. By leveraging relationships with top tech startups and universities, you will help create competitive advantages and drive enterprise innovation.

At TWG Global, your contributions will support our goal of sustained growth and superior returns, as we deliver rare value and impact across our businesses. We’re a fast-growing AI/ML team delivering high-impact use case solutions to financial institutions, insurers, and other regulated enterprises. Backed by proven leaders in finance and national security, our team is scaling rapidly to serve clients across North America with robust, secure, and production-grade AI solutions.

Role Overview

We are seeking a Platform Reliability Engineer (SRE) to ensure the scalability, stability, and performance of our data platforms and ML infrastructure. You’ll work closely with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and reduce operational overhead.

What you'll do:

Build and maintain infrastructure to support real-time and batch ML workloads
Implement observability tools (logging, monitoring, alerting) for model performance and system uptime
Design and manage CI/CD pipelines applications
Ensure high availability, disaster recovery, and rollback capabilities for production environments
Manage access controls, secrets, and security policies in collaboration with compliance and IT
Troubleshoot incidents, lead postmortems, and drive root-cause resolution
Work with U.S. and international teams to provide 24/7 coverage across time zones

Requirements

3–6 years of experience in DevOps, SRE, or backend engineering roles
Proficient with tools like Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow
Strong scripting in Python or Bash and familiarity with Linux environments
Knowledge of observability stacks (e.g., Prometheus, Grafana, ELK, Datadog)
Familiarity with cloud platforms (e.g., AWS, GCP, or Azure)
Strong documentation, problem-solving, and incident response skills

Preferred Qualifications:

Experience supporting ML/AI workflows using Palantir Foundry is a plus (but not required)
Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations
Knowledge of MLOps frameworks (e.g., MLflow, Kubeflow, SageMaker Pipelines)
Ability to automate deployments, testing, and monitoring at scale

Benefits

Work on real-world AI applications with high-impact clients
Collaborate with world-class data scientists, engineers, and product leaders
Flat org structure, high trust, high autonomy
Competitive salary + performance-based incentives

Position Location

This is an onsite position based in Jacksonville, FL; New York, NY; or Santa Monica, CA. Remote candidates will be considered on a case-by-case basis.

Compensation

The base pay for this position is $120,000-190,000. A bonus will be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits.

Top Skills

Airflow

AWS

Azure

Bash

Datadog

Docker

Elk

GCP

Github Actions

Gitlab

Grafana

Kubeflow

Kubernetes

Mlflow

Palantir Foundry

Prometheus

Python

Sagemaker Pipelines

Terraform

Similar Jobs

Domino Data Lab

Reliability Engineer

13 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

185K-230K Annually

Senior level

185K-230K Annually

Senior level

Artificial Intelligence • Machine Learning

Own and modernize Domino's Tempest scale-testing platform; build repeatable automated validation, sizing guidance, and cloud-scale test automation; partner with platform teams to enable multi-cloud scale testing and improve test reliability and reporting.

Top Skills: Ci SystemsCloud PlatformsCloud-Native ToolingEnd-To-End FrameworksKubernetesMulti-CloudPerformance/Load Testing FrameworksPythonTempest

PayPal

Senior Site Reliability Engineer

8 Days Ago

In-Office

144K-213K Annually

Senior level

144K-213K Annually

Senior level

Fintech • Payments

The role involves overseeing load balancers, supporting system availability, managing outages, leading network architecture, and mentoring global teams.

Top Skills: Artifact RegistryCloud ArmorCloud BuildCloud DnsCloud RunDatadogGCPGcsGoogle Kubernetes EngineIamLoad BalancersPub/SubTerraform

Hopper

Senior Site Reliability Engineer

Yesterday

In-Office or Remote

230K-330K Annually

Senior level

230K-330K Annually

Senior level

Travel

The Senior Site Reliability Engineer will automate and optimize infrastructure on Google Cloud, improve cost efficiency, and support on-call incidents, working closely with the engineering teams.

Top Skills: BashContainersDatadogGCPHelmIstioKubernetesKustomizePythonSQL

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering