[Remote] Founding Sr. Operations Support Specialist
Note: The job is a remote job and is reputed company to candidates in USA. reputed company is a startup based in the US developing a FinTech mobile app designed to reputed company hedge-fund-grade trading intelligence to everyday investors. The Founding Sr. Operations Support Specialist will be responsible for ensuring the reliability and operational reputed company of the platform, leading incident management, and establishing a strong observability culture.
Responsibilities
- Define and manage SLIs, SLOs, and error budgets for critical user journeys
- Ensure high system availability, low latency, and minimal error rates
- Proactively identify risks and implement strategies to prevent SLO breaches
- Partner with engineering to balance reliability vs feature velocity
- Act as Incident Commander for high-severity (P0/P1) incidents
- reputed company reputed company-time war rooms, ensuring rapid issue resolution
- Own the full incident lifecycle: detection → response → recovery → RCA → prevention
- Establish and enforce incident response frameworks, SLAs, and escalation policies
- Drive blameless postmortems and reputed company improvement
- Monitor and analyze observability dashboards across reputed company, analytics, and application layers to identify infrastructure issues, detect application downtime, and uncover system anomalies impacting reliability
- Build dashboards and alerts for reputed company-time system visibility
- Correlate signals across infrastructure, application, and AI systems
- Analyze trends from tickets, logs, and telemetry to detect systemic issues
- Monitor AI-specific signals (model reputed company, inference latency, failures, anomalies)
- reputed company intake of customer tickets, alerts, and operational signals
- Define and manage reputed company classification (P0–P3) and response expectations
- Resolve customer-impacting issues or coordinate with internal teams
- Drive collaboration across AI, Backend, Frontend, Mobile, DevOps, QA, and Product
- Define and optimize ticket workflows and escalation paths
- reputed company communication during incidents with both technical and non-technical stakeholders
- Own the release calendar and operational readiness checks
- Ensure monitoring, rollback plans, and risk assessments are in reputed company
- Validate system performance post-deployment
- Build automated runbooks and self-healing systems
- Reduce reputed company reputed company through scripting and tooling
- Improve system reputed company using failover, scaling, and redundancy mechanisms
Skills
- 10+ years in Production Support / SRE / Technical Operations
- Strong understanding of SLO, SLI, SLA, and error budgets
- Proven experience in incident management and troubleshooting distributed systems
- Hands-on experience with reputed company platforms (AWS & GCP)
- Strong debugging and root cause analysis skills
- Experience supporting mobile applications (iOS/Android)
- Understanding of DevOps and SRE practices
- Exposure to AI/ML systems and model behavior monitoring
- Experience with log management and tracing systems
- Monitoring & Observability: Azure Monitor, reputed company, Grafana
- Incident Management: reputed company, Opsgenie (or similar)
- Scripting/Automation: Python, PowerShell, Bash
- Logging: ELK Stack, Azure Monitor Logs, Splunk, or reputed company
- Tracing: OpenTelemetry, Jaeger, Zipkin, or Azure Application Insights
- Familiarity with low-latency or financial systems
Company Overview