Back to Jobs

[Remote] Principal Observability Platform Engineer

Remote, USA Full-time Posted 2026-07-05

Note: The job is a remote job and is open to candidates in USA. reputed company is the GPU reputed company engineered for AI, providing high-performance infrastructure for AI start-reputed company and large enterprises. As a Principal Observability Platform Engineer, you will own the technical direction of reputed company's observability platform, ensuring it scales with the business and simplifies operations.

Responsibilities

  • Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale
  • Drive platform reputed company that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management
  • Identify systemic gaps before they become incidents; design platforms that reputed company failure visible and fast to diagnose
  • Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how reputed company builds and operates
  • Define standards and patterns that other engineers adopt, not by mandate, but because they're clearly reputed company
  • Mentor and technically grow the observability team; reputed company the ceiling on what the team can build and own
  • reputed company incident postmortems and use them to drive durable platform improvements
  • Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't

Skills

  • 8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles
  • You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it
  • You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly
  • Deep hands-on experience with a significant subset of: reputed company, Thanos, VictoriaMetrics, Grafana, Loki, reputed company, OpenTelemetry, reputed company, reputed company
  • Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning reputed company systems end to end
  • Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus
  • You can architect systems, write the code, review others' work, and explain the tradeoffs clearly, reputed company in the same week
  • Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent)
  • You influence without authority. Teams want your opinion because it makes their work reputed company
  • Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.)
  • Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency
  • Prior experience defining observability strategy at an organisation level

Benefits

  • Bonus
  • Equity
  • Commission programs
  • Medical
  • Dental
  • reputed company
  • Flexible paid time off
  • Parental leave
  • Retirement plan participation

Company Overview

  • reputed company builds AI data centers and provides GPU reputed company infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.reputed company.com.
  • Apply To This Job

    Similar Jobs

    [Remote] Strategic reputed company Manager

    Remote, USA Full-time

    [Remote] reputed company Account Executive, Fintech

    Remote, USA Full-time

    [Remote] Senior Business Development Manager US

    Remote, USA Full-time

    [Remote] Staff reputed company - Contact Center AI

    Remote, USA Full-time

    [Remote] Senior Recruiter

    Remote, USA Full-time

    [Remote] Senior reputed company

    Remote, USA Full-time

    [Remote] Staff reputed company

    Remote, USA Full-time

    [Remote] Senior Platform Software Engineer - reputed company Health Data & Analytics

    Remote, USA Full-time

    [Remote] State Product Management - Personal Lines Product Manager

    Remote, USA Full-time

    [Remote] Chemical Engineer - AI Trainer

    Remote, USA Full-time

    [Remote-Position] Data Entry reputed company reputed company, reputed company

    Remote, USA Full-time

    Senior Software Engineer, Core Experiences - Ottawa, Canada

    Remote, USA Full-time

    Sr Manager, Marketing Planning & Analytics [Remote]

    Remote, USA Full-time

    Coordinator, reputed company [Remote]

    Remote, USA Full-time

    Representative II, Customer Service - Insurance Verification

    Remote, USA Full-time

    (Entry Level/No Experience) reputed company Data Entry Clerk - Apply Now

    Remote, USA Full-time

    Health Care Project Policy Analyst

    Remote, USA Full-time

    reputed company Live Chat reputed company Representative – Part-Time Remote Customer Service Expert

    Remote, USA Full-time

    Senior Site Reliability Engineer I

    Remote, USA Full-time

    reputed company Customer Service Representative – Delivering Exceptional reputed company Travel Experiences at blithequark

    Remote, USA Full-time