Second Front Systems

Site Reliability Engineer - Observability

Reposted 25 Days Ago

Remote

Hiring Remotely in USA

160K-180K Annually

Senior level

Remote

Hiring Remotely in USA

160K-180K Annually

Senior level

The Senior Site Reliability Engineer will deploy and maintain observability infrastructure, automate processes, and troubleshoot complex systems within DoD networks.

The summary above was generated by AI

ABOUT THE ROLE

Second Front Systems' (2F) Product team is seeking a highly skilled and motivated Senior Site Reliability Engineer to join our Observability team. We are a small team working to accelerate the deployment of emerging technology into national security use-cases. We are seeking technical professionals who want to operate on the front lines of an exciting and disruptive mission.

As a Senior SRE for Second Front Systems, you'll be responsible for deploying, maintaining, and scaling our observability infrastructure across multiple DoD networks. You'll work with Kubernetes-based platforms, BigBang charts from DoD Platform One, and build automation to make our monitoring stack easier to deploy for new customers. You'll be empowered to collaborate with others to implement infrastructure that delivers unique capabilities for our commercial and government customers, including the Department of Defense.

The Observability team is looking for a strong SRE with deep DevSecOps and Kubernetes experience. Someone who has deployed and maintained monitoring infrastructure at scale, with an eye for security in highly-regulated environments. Experience with DoD software deployments, Platform One, and single-tenant architectures is highly valued.

We are a fast-growing entrepreneurial team working at the convergence of technology and national security. If this type of effort interests you, come join us!

Note: This position requires U.S. citizenship due to government contract requirements.

Candidates must be located in the following geographic areas: DMV (DC/Maryland/Virginia), Raleigh/Durham/Chapel Hill, Denver/Colorado Springs, and Dallas/Fort Worth.

What You’ll Do

Deploy and maintain observability stack (Grafana, Mimir, Prometheus) across multiple customer clusters and DoD networks
Build Helm chart abstractions and automation to streamline monitoring deployments for new customers
Troubleshoot and debug complex Kubernetes issues, networking problems, and monitoring stack failures
Configure and maintain BigBang charts and DoD Platform One integrations
Design and implement infrastructure automation using tools like Pulumi, ArgoCD, and Flux
Work with Istio service mesh and Keycloak for authentication in secure environments
Monitor and optimize performance of monitoring infrastructure across multiple environments
Collaborate with security teams to ensure compliance with NIST requirements and DoD standards
Participate in on-call rotation and incident response for production environments

Skills You’ll Bring to Our Team

5+ years of Site Reliability Engineering or DevOps experience
Deep experience with Kubernetes administration, troubleshooting, and scaling
Hands-on experience deploying and maintaining observability tools (Prometheus, Grafana, Mimir/Cortex)
Strong understanding of Helm charts, GitOps practices, and CNCF tooling
Experience with service mesh technologies (Istio preferred)
Proven ability to debug complex distributed systems and networking issues
Understanding of authentication systems and security in regulated environments
Ability to work independently and collaborate with team members in a remote environment

Preferred Qualifications

Active security clearance or ability to obtain a Secret-level security clearance
Previous experience with DoD software deployments and Platform One
Experience with BigBang charts and Iron Bank containers
Experience working in national security or highly regulated environments
Familiarity with compliance frameworks (NIST, FedRAMP, etc.)
Experience with infrastructure as code (Pulumi, Terraform)

Technologies we Use

Observability: Grafana stack, Prometheus, custom alerting tools
Kubernetes: Helm, ArgoCD, Flux, Tekton, BigBang charts
Security: Istio, Keycloak, Kyverno
Infrastructure: AWS/GCP/Azure, Pulumi, Git/GitLab
Languages: YAML, Bash, Go

Top Skills

Argocd

AWS

Azure

Flux

GCP

Git

Grafana

Helm

Istio

Keycloak

Kubernetes

Prometheus

Pulumi

Similar Jobs

Chainlink Labs

Senior Site Reliability Engineer

15 Days Ago

In-Office or Remote

Senior level

Blockchain • Internet of Things • Payments • Cryptocurrency • Web3

As a Senior Site Reliability Engineer, you'll build observability platforms, support telemetry types, ensure reliability and security, and collaborate with engineers to deploy services.

Top Skills: AWSCC++Elk StackGithub ActionsGoGrafanaJavaKubernetesPackerPerlPrometheusPythonRubySplunkTerraform

NVIDIA

Senior Site Reliability Engineer

24 Days Ago

In-Office or Remote

144K-270K

Senior level

144K-270K

Senior level

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse

The Senior Site Reliability Engineer will design, implement, and maintain an observability platform, ensuring reliability and performance while supporting production systems and optimizing operational practices.

Top Skills: DockerGoGrafanaKubernetesLinuxNetworkingOpenstackOpentelemetryPerlPrometheusPythonRuby

MongoDB

Senior Site Reliability Engineer

22 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

This role involves building and maintaining observability services, ensuring service reliability, and collaborating with other teams on best practices.

Top Skills: AWSFluentbitGCPJaegerKubernetesAzureQuickwitSplunkVectorVictoriametrics

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering