Subzero Labs Logo

Subzero Labs

Site Reliability Engineer

Posted 2 Days Ago
Remote
Hiring Remotely in USA
Expert/Leader
Remote
Hiring Remotely in USA
Expert/Leader
As a Site Reliability Engineer, you'll ensure scalability, performance, and reliability of blockchain applications, tackling operational challenges through automated solutions and proactive system designs.
The summary above was generated by AI

About Subzero Labs

Subzero Labs is building the next generation of decentralized infrastructure.

Site Reliability Engineer

We're building the infrastructure behind a next-generation decentralized programmable network with reliability, observability, and confidentiality baked in from the ground up. As a Site Reliability Engineer, you'll ensure the scalability, performance, and reliability of our large-scale blockchain applications and infrastructure.

Position Overview

Combining software engineering and systems administration expertise, you'll adopt a proactive, software-centric approach to tackle operational challenges. Your responsibilities include detecting issues, automating failure handling, devising disaster recovery plans, maintaining system uptime, and mitigating broken systems to prevent future disruptions. You'll leverage coding, automation, and engineering principles to build resilient, self-healing systems that scale to meet growing demands.

What You'll Do

Network Infrastructure Reliability: Design fault-tolerant systems to run validators, nodes, and indexers across cloud and bare-metal environments. Build self-healing mechanisms that recover automatically from faults, crashes, and partitions.

Infrastructure as Code: Define production systems using Terraform, Helm, Kubernetes, or Pulumi—supporting reproducible deployments, rapid scaling, and multi-region HA clusters.

TEE-Backed Secure Computation: Deploy and manage trusted execution environments (TEEs) such as Intel TDX, AMD SEV-SNP, or Azure Confidential VMs for secure blockchain operations.

Observability & Alerting: Build comprehensive Grafana dashboards and AlertManager alerts to monitor chain liveness, network performance, and quality metrics. Instrument services with tracing, metrics, and logs down to the hardware level.

Performance & Resource Tuning: Profile and tune workloads under sustained high throughput—optimizing CPU/memory/disk I/O pressure. Build tools to detect degraded validators or slow block propagation in real time.

Security Hardening & Key Management: Engineer hardened signing pipelines integrating TEEs, HSMs, or cloud-native KMS systems. Manage key lifecycle (rotation, expiration, revocation) with zero downtime while reducing attack surface area.

CI/CD & Safe Rollouts: Build GitHub Actions workflows testing and shipping changes across multiple environments. Own release engineering across devnet, testnet, and mainnet, ensuring protocol compatibility and seamless validator upgrades.

Incident Response & Chaos Engineering: Run fire drills, simulate node failures and partitions, and lead incident postmortems. Design for failure and validate assumptions under pressure.

Cross-Functional Communication: Work closely with engineers, product managers, node operators, and partners to support deployments, debug edge cases, and share best practices as a key interface between core protocol teams and the network operator ecosystem.

Requirements

  • 10+ years in DevOps or SRE roles with focus on tooling, automation, and infrastructure

  • Proficiency in systems languages: Rust, Go, Python, Shell scripting

  • Experience writing and reviewing code, developing documentation and disaster recovery plans, debugging complex problems on live blockchain systems

  • Advanced knowledge of cloud infrastructure: networking, orchestration tools, containerization, compute, and storage systems

  • Proven ability to design, develop, and deploy systems enhancing throughput, latency, reliability, availability, and security

  • Clear communication skills: ability to explain technical concepts simply

  • Self-starter mindset: continuous learning and critical thinking under pressure

Preferred Qualifications

  • Background in distributed systems and consensus protocols

  • Experience with monitoring and observability platforms

  • Knowledge of security best practices for distributed systems


Top Skills

Go
Grafana
Helm
Kubernetes
Pulumi
Python
Rust
Shell Scripting
Terraform
HQ

Subzero Labs Los Angeles, California, USA Office

Los Angeles, CA, United States

Similar Jobs

2 Days Ago
In-Office or Remote
Denver, CO, USA
125K-162K
Expert/Leader
125K-162K
Expert/Leader
Artificial Intelligence • Fintech • Information Technology • Logistics • Payments • Business Intelligence • Generative AI
As a Lead Site Reliability Engineer, you'll manage data pipelines, AWS infrastructure, collaborate with ML teams, and troubleshoot complex issues while enabling observability and support for AI-driven features.
Top Skills: AksAnsibleSparkAWSAzureAzure MlBashBedrockChefCi/CdDockerEcsEksEmrGitJenkinsLinuxLookerMySQLPythonRedshiftS3SagemakerTerraform
2 Days Ago
Remote or Hybrid
2 Locations
205K-257K Annually
Senior level
205K-257K Annually
Senior level
Fintech • Machine Learning • Payments • Software • Financial Services
Lead diverse technology projects in a fast-paced environment while improving performance and reliability of services using distributed microservices. Collaborate on cloud-based solutions and mentor other engineers.
Top Skills: AWSCassandraDockerKafkaNode.jsOpensearchPostgres
3 Days Ago
Remote
United States
140K-210K Annually
Senior level
140K-210K Annually
Senior level
Sales • Software • Automation
As a Site Reliability Engineer, you'll maintain and enhance infrastructure systems, manage databases, ensure system stability, and automate processes using various DevOps tools and technologies.
Top Skills: AnsibleAWSDockerElasticsearchFlaskKubernetesMongoDBPostgresPythonRedisTerraform

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account