Head – DR Orchestration & Testing – ()
Description
1. Role Purpose
Lead the strategy, governance, automation, and execution of Disaster Recovery (DR) orchestration and testing across Equity Group. Ensure critical services recover within agreed RTO/RPO, strengthen operational resilience, and evidence compliance with internal policies and international standards (e.g., ISO 22301, ISO 27031, ISO 27001, NIST SP 800-34). The role embeds “prove-it” recovery through continuous testing, telemetry, and executive reporting—supporting COHERE goals and the ARRP agenda.
2. Role Summary
A senior leadership role accountable for an enterprise-wide DR orchestration and testing program spanning data centers, cloud, networks, applications, data platforms, and third-party services. The role builds automated runbooks, governs recovery scenarios, executes end-to-end exercises (tabletop to full failover), and drives remediation. It tightly integrates with Change/Release, Backup & Recovery, Cybersecurity, SRE/Operations, and Business Units to assure recoverability for core banking, payments, digital channels, and shared services across all subsidiaries.
Key Accountabilities
Strategy, Policy & Governance
• Define and maintain the Group DR Orchestration & Testing Policy, Standards, and Playbooks aligned to ITIL v4, ISO 22301/27031, and NIST SP 800-34.
• Institutionalize governance anchored by the IT Steering Committee and the Service Continuity/DR Working Group as the mechanisms for cadence, accountability, and reporting.
• Establish decision rights, RACI, and acceptance criteria for “go-live” recoverability (RTO/RPO, data integrity, service dependencies).
• Embed DR impact assessment in Change, Release, and Architecture review gates.
Orchestration & Automation
• Design and implement automated recovery runbooks (e.g., infra, platform, DB, app, network/DNS, identity) leveraging workflow/orchestration tools, Infrastructure-as-Code, and CI/CD.
• Engineer repeatable failover/failback patterns (active–active, active–standby, zonal/region/site) for on‑prem, hybrid, and cloud workloads.
• Integrate observability (APM, logs, synthetics) to validate service health during exercises and real events.
Testing Program Management
• Own the Group DR Test Calendar (annual/quarterly/monthly) covering tabletop, technical component tests, integrated service tests, and full-scale exercises.
• Define test scenarios based on BIAs, risk scenarios (e.g., ransomware, DC outage, carrier failure, major release rollback), and regulatory expectations.
• Measure and certify recoverability per service; track defects, action owners, and closure SLAs.
Data, Backup & Cyber Recovery Assurance
• Align backup/restore testing with application-level recovery (including immutable/air-gapped copies, vaulting, and key management).
• Validate data integrity, transaction reconciliation, and journal consistency post-recovery (e.g., core banking, card switch, channels).
• Coordinate with Cybersecurity on ransomware readiness, clean‑room recovery, and malware‑free restore procedures.
Third-Party & Cloud Resilience
• Assess and test DR commitments of critical vendors/fintech partners; verify evidence of recoverability and exit/failover options.
• Govern SaaS and cloud region/zone strategies, data residency constraints, and cross‑border implications for subsidiaries.
Service Mapping & Readiness
• Maintain service dependency maps (CMDB) linking business services to applications, platforms, data stores, integrations, and infrastructure.
• Define minimal viable service (MVS) configurations for recovery and ensure runbooks reflect current state.
Metrics, Reporting & Continuous Improvement
• Define and report KPIs/KRIs: test coverage %, pass rate, RTO/RPO adherence, MTTR
(exercises/incidents), % automated runbooks, restore success rate, findings aging, and resilience confidence score.
• Produce executive dashboards and Monthly/Quarterly Resilience Reports to Group CIO, CFO, Risk, and Executive Committees.
• Run post-exercise/post-incident reviews and drive structural fixes (automation, design changes, capacity).
Subsidiary Coordination & Incident Readiness
• Coordinate DR readiness across Banking, Insurance, Fintech, Health, and Foundation; tailor scenarios to local contexts while enforcing Group standards.
• Lead or support technical recovery command during major incidents and planned DR events.
Financial Planning & Value Optimization
• Quantify cost‑to‑recover vs. risk; recommend right‑sized patterns (active–active vs. warm/cold) by criticality.
• Support budgeting for resilience tooling, testing, and automation; demonstrate ROI through reduced downtime and faster recovery.
Key Deliverables
• Group DR Orchestration & Testing Policy, Standards, and Runbook Library.
• Annual DR Test Calendar with scenario catalog and success criteria.
• Service-level Recovery Certificates (per critical service) and remediation tracker.
• Enterprise Resilience Dashboard (RTO/RPO, coverage, pass rate, MTTR, confidence score).
• Quarterly Executive Resilience Reports and Board-ready summaries.
• Post-Exercise/Incident Review reports with prioritized corrective actions.
• Up-to-date Service Dependency Maps and MVS definitions.
Qualifications
Required Qualifications & Experience
Education
• Bachelor’s in Computer Science, Engineering, Information Systems, or related field.
• Master’s in IT Management, Business Continuity/Resilience, or Operations is an advantage.
Certifications (Preferred)
• ISO 22301 Lead Implementer/Lead Auditor or BCI (AMBCI/MBCI) or DRI (ABCP/CBCP).
• ITIL v4 Managing Professional.
• ISO 27001 Lead Implementer/Auditor or CISSP/CISM (nice to have).
• Cloud architecture (Azure/AWS) and/or Kubernetes platform certifications (nice to have).
Professional Experience
• 10+ years in IT service continuity/operations/architecture with 5+ years leading DR/BCP programs in regulated, multi-entity environments (BFSI preferred).
• Proven delivery of automated DR runbooks and end-to-end recovery exercises across hybrid/cloud and on‑prem estates.
• Hands-on familiarity with storage & DB replication (e.g.,Commvault, Data Guard/Always On), traffic management/DNS, messaging, container platforms, identity, and observability.
• Experience coordinating vendors/partners and assuring third‑party recoverability.