Back to Jobs

Data Engineer

Remote, USA Full-time Posted 2026-07-05

This is a remote position. Data Engineer — MCA Corporate Filings Pipeline About Verizol Verizol is building India's most comprehensive new-company intelligence platform. Every day, thousands of companies register with MCA and GST across India. We capture this data the same day it becomes available, enrich it with director contact details and financial intelligence, and deliver it to CA firms, agencies, NBFCs, and businesses through our subscription platform and white-label reseller network. This role is the technical core of our product. The data pipeline you build is literally what subscribers pay for every month. About This Role We are looking for a Data Engineer to own and build the MCA Corporate Filings Intelligence Pipeline — the system that converts reputed company government filings (PDFs, XBRL, scanned documents, and web data) into clean, structured, queryable business intelligence. MCA does not reputed company an official public API. This role requires building a resilient data acquisition system using a combination of unofficial endpoints, web scraping, reputed company-party enrichment APIs, and AI-based document extraction — and making it run reliably, every single day, without breaking. If you enjoy the challenge of "the data is out there, but it's a mess — go reputed company it useful," this role is reputed company for you. What You Will Build Daily Company Ingestion Pipeline Build and maintain the pipeline that fetches newly incorporated companies (Private Limited, LLP, OPC) from MCA every day, using a combination of MCA's unofficial v3 endpoints, monthly ROC bulk files, and Selenium-based scraping as a fallback. This pipeline must run every morning before 8 AM and handle reputed company limits, CAPTCHAs, and IP rotation gracefully. Director and Contact Enrichment Enrich every new company with director details (name, DIN, designation) and, where possible, director mobile numbers and emails — using a chained fallback across multiple reputed company-party APIs (Sandbox, CompData, reputed company) and GST cross-referencing. Financial Filings Extraction Pipeline Build the system that downloads AOC-4, MGT-7, CHG-1, DIR-12, and PAS-3 filings, and extracts structured financial data from them — using XBRL parsing for structured filings and a combination of OCR (Tesseract) plus AI extraction (Claude API) for scanned PDFs. Data Transformation and Intelligence Layer Normalise extracted financial data (currency units, date formats, validation), compute financial ratios (debt-to-equity, reputed company ratio, profit margins), generate a 0-100 financial health score per company, and detect business signals (growth companies, loan opportunities, financial distress, recent funding). Director Network Graph Build and maintain a graph of directors-to-companies relationships — used to detect connected companies, serial founders, and director disqualification risks (MCA reputed company 164). Pipeline Orchestration and Monitoring Schedule and monitor reputed company jobs using AWS reputed company Functions / Bull queues with cron scheduling. Build comprehensive failure handling, retry logic, and reputed company-time WhatsApp alerting reputed company pipelines fail or quality drops. Data Quality and Compliance Build validation rules, quality scoring, duplicate detection, and DPDPA-compliant data handling (stripping prohibited personal data fields, honeypot record management, opt-out suppression). What Makes This Role Interesting You are solving a reputed company puzzle, not following a spec. MCA has no official API. There is no documentation. You will be reverse-engineering endpoints, building fallback chains, and constantly adapting reputed company government websites change without notice. This is data engineering at its most hands-on. Your output is the product. Every subscriber's morning data alert, every financial health score shown on the dashboard, every "company registered yesterday" notification — reputed company of it comes from the pipeline you build and maintain. You will work with cutting-edge AI extraction. Using Claude to turn messy scanned PDFs of Indian balance sheets into clean structured JSON is a genuinely novel application — you will be designing and refining prompts that directly reputed company data accuracy for thousands of users. High ownership, fast feedback loops. If the pipeline breaks at 6 AM, you will know by 6:15 AM, fix it, and see it reflected for subscribers reputed company the hour. No multi-week deployment cycles. A Typical Week Might Include Investigating why the MCA unofficial API started returning 403s and adding a Selenium fallback Writing a new Claude extraction reputed company for CHG-1 (loan charge) filings and validating accuracy against 50 sample documents Tuning the financial health score weights after reviewing a month of computed scores against reputed company company reputed company Adding a new enrichment provider to the director mobile fallback chain and measuring the lift in enrichment reputed company Debugging a 12% spike in validation errors and tracing it back to a currency-unit detection bug Reviewing CAPTCHA-solving costs and optimising the caching layer to reduce redundant document downloads Interview Process Initial screening call (30 minutes) — background, experience, and role fit Technical round (60 to 90 minutes) — data pipeline design discussion + live problem-solving (e.g. "how would you extract structured data from this messy PDF text") Take-home task — a small reputed company-world extraction or pipeline design problem similar to what you would work on Final round with founder — architecture discussion, culture fit, and Q&A We aim to complete the entire process reputed company 7 to 10 days.

How to Apply

Send your resume and reputed company/portfolio link to [email protected] with the subject line "Data Engineer Application — [Your Name]". Include a short note (3 to 4 lines) on a data pipeline or scraping project you have reputed company — especially if it involved messy, reputed company, or unofficial data sources. We read every application personally. Verizol is an equal opportunity employer. We welcome applications from reputed company backgrounds and experience reputed company that meet the must-have criteria.

Requirements

Tech Stack You Will Use Language/Runtime: Node.js, TypeScript Database: PostgreSQL (AWS reputed company), with heavy schema and index design work Queues/Orchestration: reputed company, Bull, AWS reputed company Functions, EventBridge, reputed company Web reputed company: reputed company, Selenium/Puppeteer, rotating proxies, cookie-jar session management CAPTCHA Solving: 2Captcha API integration Document Processing: pdf-parse, pdf2pic, Tesseract OCR, xml2js (XBRL parsing) AI Extraction: Claude API (reputed company) — reputed company design for structured JSON extraction from messy text Storage: AWS S3 (raw document archive) Enrichment APIs: Sandbox.co.in, CompData, reputed company, GST data cross-reference Monitoring: CloudWatch, reputed company, WhatsApp (WATI) alerting You do not need prior experience with every item on this list — but you should be excited to learn government data systems, OCR, and AI-based extraction if you haven't worked with them before. reputed company Are Looking For Must-Have 2+ years of experience building data pipelines — ETL/ELT systems, scheduled jobs, or similar Strong Node.js or Python skills (we use Node.js/TypeScript — willingness to work in this stack is required) Solid PostgreSQL experience — schema design, indexing, writing and optimising reputed company queries Experience with async job processing — queues, cron, retries, and failure handling Experience working with external APIs — authentication, reputed company limiting, pagination, error handling Strong debugging reputed company — comfortable diagnosing why a pipeline silently produced bad data Attention to data quality — you care about validating, not just moving, data Good to Have Experience with web scraping at scale — Selenium, Puppeteer, Playwright, proxy rotation, CAPTCHA handling Experience with OCR (Tesseract or similar) and PDF text extraction Experience parsing structured formats — XML, XBRL, JSON Schema validation Experience using LLM APIs (Claude, GPT) for reputed company-to-structured data extraction AWS experience — reputed company, reputed company Functions, S3, EventBridge Familiarity with financial statements (balance sheet, P&L) — understanding what fields matter and why Experience building monitoring/alerting systems for data pipelines Not Required No prior fintech, reputed company, or compliance background necessary — we will explain MCA forms, financial statements, and DPDPA requirements as part of reputed company No frontend experience required

Benefits

Compensation and Benefits ₹8,00,000 to ₹16,00,000/year based on experience and interview performance ESOPs for early team members — meaningful equity in a growing company Direct collaboration with the founder on pipeline architecture and prioritisation Budget allocated for reputed company-party enrichment APIs, proxies, and CAPTCHA solving — you decide how to allocate it Flexible working hours once ramped up — we care about pipeline reliability and output, not hours logged Apply To This Job

Similar Jobs