Regional Disaster Recovery (RDR) Architecture and Deployment in ocs-ci
Table of Contents
Overview
Regional Disaster Recovery (RDR) is a disaster recovery solution for OpenShift Data Foundation (ODF) that enables asynchronous replication of persistent volumes across geographically distributed OpenShift clusters. RDR provides application failover and relocate capabilities between a primary and secondary cluster.
Key Characteristics:
Mode:
regional-dr(async replication)Replication Policy: Asynchronous (
async)Cluster Roles: ActiveACM (Hub), PrimaryODF, SecondaryODF
Storage Types: RBD (Ceph Block) and CephFS (Ceph Filesystem)
Deployment Modes: Greenfield and Brownfield
RDR Architecture
High-Level Architecture
graph TB
subgraph ACM["ACM Hub Cluster"]
ACM_COMP["ACM Components"]
ACM_LIST["• Advanced Cluster Management<br/>• Multicluster Engine<br/>• ODF Multicluster Orchestrator<br/>• Ramen DR Hub Operator<br/>• DRPolicy Management<br/>• DRPC Orchestration"]
ACM_COMP -.-> ACM_LIST
end
subgraph PRIMARY["Primary ODF Cluster"]
P_STORAGE["ODF Storage"]
P_STORAGE_LIST["• RBD Mirror<br/>• Ceph Cluster<br/>• VolumeReplication"]
P_DR["DR Components"]
P_DR_LIST["• Ramen DR Cluster<br/>• VRG Primary<br/>• VolSync"]
P_WORKLOAD["Active Workloads"]
P_WORKLOAD_LIST["• Subscriptions<br/>• ApplicationSets"]
P_STORAGE -.-> P_STORAGE_LIST
P_DR -.-> P_DR_LIST
P_WORKLOAD -.-> P_WORKLOAD_LIST
end
subgraph SECONDARY["Secondary ODF Cluster"]
S_STORAGE["ODF Storage"]
S_STORAGE_LIST["• RBD Mirror<br/>• Ceph Cluster<br/>• VolumeReplication"]
S_DR["DR Components"]
S_DR_LIST["• Ramen DR Cluster<br/>• VRG Secondary<br/>• VolSync"]
S_WORKLOAD["Standby Workloads"]
S_STORAGE -.-> S_STORAGE_LIST
S_DR -.-> S_DR_LIST
end
ACM -->|Manages| PRIMARY
ACM -->|Manages| SECONDARY
PRIMARY <-->|Async Replication| SECONDARY
style ACM fill:#e1f5ff
style PRIMARY fill:#c8e6c9
style SECONDARY fill:#fff9c4
Network Connectivity
RDR requires network connectivity between clusters:
Submariner : Provides secure Layer 3 connectivity
Globalnet: Enables overlapping CIDR ranges
S3 Storage: For metadata and backup storage
Latency Requirement: < 10ms RTT for hub-spoke communication
Key Components
1. ACM Hub Cluster Components
Advanced Cluster Management (ACM)
Purpose: Central management and orchestration
Version: 2.12+
Key Functions:
Cluster lifecycle management
Application deployment via GitOps
Policy enforcement
Observability
Multicluster Engine (MCE)
Purpose: Cluster provisioning and management
Deployment: Installed on ACM hub
Functions: Cluster import, managed cluster lifecycle
ODF Multicluster Orchestrator
Deployment:
odf-multicluster-orchestrator-controller-managerNamespace:
openshift-operatorsPurpose: Coordinates storage operations across clusters
Key Resources:
MirrorPeer: Defines replication relationships
StorageClusterPeer: Manages peer connections
Ramen DR Hub Operator
Purpose: DR orchestration and policy management
Key CRDs:
DRPolicy: Defines DR policies and scheduling intervalsDRPlacementControl (DRPC): Controls application placementDRCluster: Represents managed clusters in DR topology
2. Managed Cluster (Primary/Secondary) Components
ODF Storage Cluster
Components:
Ceph cluster (Mon, OSD, MGR)
RBD provisioner
CephFS provisioner
Storage classes
RBD Mirroring
Purpose: Asynchronous block storage replication
Components:
rbd-mirrorpodsVolume Replication CRDs
Replication secrets
Deployment Modes:
Greenfield: New deployments with
bluestore-rdrannotationBrownfield: Existing deployments
Ramen DR Cluster Operator
Label:
app=ramen-dr-clusterPurpose: Local DR operations on managed clusters
Key CRDs:
VolumeReplicationGroup (VRG): Groups PVCs for replicationVolumeReplication: Per-PVC replication control
VolSync (ODF 4.19+)
Purpose: CephFS replication using Restic/Rclone
Storage Class:
ocs-storagecluster-cephfs-vrgComponents:
ReplicationSource
ReplicationDestination
Token Exchange Agent
Label:
app=token-exchange-agentPurpose: Secure credential exchange between clusters
Namespace:
openshift-storage
3. Workload Types
Subscription-based Applications (Soon to be Deprecated)
Namespace: Application-specific
DRPC Location: Application namespace
GitOps: ACM ApplicationSet or Subscription
ApplicationSet-based Applications
Namespace:
openshift-gitopsDRPC Location:
openshift-gitopsGitOps: ArgoCD ApplicationSet
Discovered Applications
Purpose: Protect existing applications without GitOps
DRPC Location:
openshift-dr-opsFeatures:
KubeObject protection
Recipe-based backup
Multi-namespace support
Multicluster Access Patterns
Context Switching in ocs-ci
The framework uses context switching to manage multiple clusters:
# Switch to ACM hub cluster
config.switch_acm_ctx()
# Switch to primary cluster
primary_config = get_primary_cluster_config()
config.switch_ctx(primary_config.MULTICLUSTER["multicluster_index"])
# Switch by cluster name
config.switch_to_cluster_by_name("cluster-name")
Cluster Roles and Indexes
# RDR Roles
RDR_ROLES = ["ActiveACM", "PrimaryODF", "SecondaryODF"]
# Optional: PassiveACM for dual-hub scenarios
if get_passive_acm_index():
RDR_ROLES.append("PassiveACM")
# Cluster ranking
ACM_RANK = 1
MANAGED_CLUSTER_RANK = 2
DRPC Access Patterns
# Get current primary cluster
primary_cluster_name = dr_helpers.get_current_primary_cluster_name(
namespace=workload_namespace,
workload_type=constants.SUBSCRIPTION
)
# Get current secondary cluster
secondary_cluster_name = dr_helpers.get_current_secondary_cluster_name(
namespace=workload_namespace,
workload_type=constants.SUBSCRIPTION
)
# Access DRPC object
drpc_obj = DRPC(namespace=workload_namespace)
drpc_data = drpc_obj.get()
# Check DRPC action
if drpc_data["spec"]["action"] == constants.ACTION_FAILOVER:
current_cluster = drpc_data["spec"]["failoverCluster"]
else:
current_cluster = drpc_data["spec"]["preferredCluster"]
Replication Resource Access
# Check VolumeReplicationGroup state
vrg_obj = OCP(
kind=constants.VOLUME_REPLICATION_GROUP,
namespace=workload_namespace
)
# Check mirroring status on primary
config.switch_to_cluster_by_name(primary_cluster_name)
dr_helpers.wait_for_mirroring_status_ok(
replaying_images=pvc_count
)
# Verify replication destinations on secondary
config.switch_to_cluster_by_name(secondary_cluster_name)
dr_helpers.wait_for_replication_destinations_creation(
pvc_count, workload_namespace
)
OCS-CI Deployment Flow
Phase 1: Infrastructure Setup
1. ACM Hub Cluster Deployment
├── Deploy OpenShift cluster
├── Install ACM operator
├── Install MCE operator
└── Configure observability
2. Managed Clusters Deployment (Primary & Secondary)
├── Deploy OpenShift clusters via ACM
│ ├── Create/import cluster prerequisites
│ ├── Create/import cluster via ACM UI/CLI
│ └── Wait for cluster ready
├── Install ODF operator
├── Create StorageCluster
└── Verify ODF deployment
Phase 2: DR Infrastructure Setup
3. DR Operators Deployment
├── On ACM Hub:
│ ├── Deploy ODF Multicluster Orchestrator
│ │ └── Verify deployment available
│ ├── Enable MCO console plugin
│ └── Create ServiceExporter (4.19+)
│
└── On Managed Clusters:
├── Enable RBD mirroring on StorageCluster
├── Deploy Ramen DR Cluster Operator
└── Configure S3 secrets for DR
4. Network Configuration (if Submariner enabled)
├── Download subctl CLI
├── Deploy broker on primary cluster
├── Join clusters to broker
└── Verify connectivity
Phase 3: DR Configuration
5. MirrorPeer Creation
├── Load MirrorPeer template (MIRROR_PEER_RDR)
├── Update cluster names in spec
├── Apply MirrorPeer on ACM hub
└── Validate MirrorPeer status
├── Check phase: "ExchangedSecret"
├── Verify token-exchange-agent pods
└── Verify rbd-mirror pods
6. DRPolicy Creation
├── Load DRPolicy template
├── Configure:
│ ├── drClusters: [primary, secondary]
│ ├── schedulingInterval: "5m" (default)
│ └── replicationClassSelector (for RBD)
├── Apply DRPolicy on ACM hub
└── Validate DRPolicy status: "Validated"
7. StorageClusterPeer Validation (4.19+)
├── Verify peer state on both clusters
└── Verify VolSync deployment
Phase 4: Workload Deployment
8. Application Deployment with DR Protection
├── Deploy application (Subscription/ApplicationSet)
├── Create DRPC resource
│ ├── Specify drPolicyRef
│ ├── Set preferredCluster (primary)
│ └── Set placementRef
├── Wait for VRG creation
├── Wait for VolumeReplication resources
└── Verify initial replication
9. Verify DR Readiness
├── Check DRPC conditions:
│ ├── PeerReady: True
│ └── ClusterDataProtected: True
├── Verify mirroring status
└── Verify replication destinations
Deployment Class Hierarchy
Deployment (base class)
├── do_deploy_rdr()
│ └── Calls get_multicluster_dr_deployment()
│
└── get_rdr_conf()
└── Returns DR configuration dict
MultiClusterDROperatorsDeploy (base DR class)
├── deploy_dr_multicluster_orchestrator()
├── configure_mirror_peer()
├── deploy_dr_policy()
└── enable_acm_observability()
RDRMultiClusterDROperatorsDeploy (RDR-specific)
└── deploy()
├── Deploy orchestrator on all ACM hubs
├── Enable MCO console plugin
├── Create ServiceExporter (4.19+)
├── Configure MirrorPeer
├── Deploy RBD DR operations
├── Enable ACM observability
├── Deploy DRPolicy
├── Validate StorageClusterPeer (4.19+)
└── Configure backup (if needed)
Key Deployment Methods
Deployment.do_deploy_rdr()
Location: ocs_ci/deployment/deployment.py:739
def do_deploy_rdr(self):
"""Call Regional DR deploy"""
if config.ENV_DATA.get("skip_dr_deployment", False):
return
if config.multicluster:
dr_conf = self.get_rdr_conf()
deploy_dr = get_multicluster_dr_deployment()(dr_conf)
deploy_dr.deploy()
RDRMultiClusterDROperatorsDeploy.deploy()
Location: ocs_ci/deployment/deployment.py:4115
Main deployment orchestration for RDR setup.
MultiClusterDROperatorsDeploy.configure_mirror_peer()
Location: ocs_ci/deployment/deployment.py:3401
Creates and validates MirrorPeer resource.
MultiClusterDROperatorsDeploy.deploy_dr_policy()
Location: ocs_ci/deployment/deployment.py:3583
Creates DRPolicy with cluster relationships.
Important Design Pieces
1. Asynchronous Replication
Scheduling Interval: Defines RPO (Recovery Point Objective)
Default: 5 minutes
IBM Cloud Managed: 10 minutes
Configurable via DRPolicy
Replication Flow:
sequenceDiagram
participant App as Application
participant PVC as Primary PVC
participant RBD as RBD Mirror Daemon
participant Snap as Snapshot
participant SecPVC as Secondary PVC
App->>PVC: Write data
PVC->>RBD: Capture changes
Note over RBD: Continuous monitoring
loop Every Scheduling Interval
RBD->>Snap: Create snapshot
Snap->>SecPVC: Replicate snapshot
SecPVC->>SecPVC: Apply changes
end
Note over PVC,SecPVC: Async Replication (5-10 min RPO)
2. Failover vs Relocate
Failover (Disaster Scenario)
Trigger: Primary cluster unavailable
Action:
spec.action: FailoverTarget:
spec.failoverClusterProcess:
Detect primary cluster failure
Update DRPC with failover action
Promote secondary VRG to primary
Start application on secondary
Delete resources from primary (when available)
Relocate (Planned Migration)
Trigger: Planned move to another cluster
Action:
spec.action: RelocateTarget:
spec.preferredClusterProcess:
Ensure both clusters healthy
Update DRPC with relocate action
Quiesce application on current primary
Ensure final sync complete
Promote new primary VRG
Start application on new primary
Demote old primary VRG to secondary
3. VolumeReplicationGroup (VRG)
Purpose: Groups PVCs for coordinated replication
States:
Primary: Active cluster with read/write accessSecondary: Standby cluster receiving replicated data
Key Responsibilities:
Manage VolumeReplication resources
Coordinate snapshots
Handle promotion/demotion
Manage PVC protection
4. Consistency Groups (4.21+)
Purpose: Ensure crash-consistent snapshots across multiple PVCs
Configuration:
# Enabled by default in RDR mode for 4.21+
cg_enabled = config.ENV_DATA.get("cg_enabled", True)
Benefits:
Application-consistent backups
Coordinated snapshot timing
Reduced RPO for multi-PVC applications
5. OSD Deployment Modes
Greenfield (4.14-4.17)
metadata:
annotations:
ocs.openshift.io/clusterIsDisasterRecoveryTarget: "true"
OSDs deployed with
bluestore-rdrstore typeOptimized for DR workloads
Automatic configuration
Brownfield
Existing OSD deployments
Standard bluestore
Manual DR configuration
6. Hub Recovery and Backup
Backup Components:
ACM resources
DR policies
Cluster configurations
Backup Schedule:
Resource:
schedule-acmNamespace: ACM namespace
Policy:
backup-restore-enabled
Recovery Process:
configure_rdr_hub_recovery()
├── Create backup schedule
├── Validate DPA (Data Protection Application)
└── Verify policy compliance
Configuration and Constants
Key Constants
Mode and Policy
RDR_MODE = "regional-dr"
RDR_REPLICATION_POLICY = "async"
RDR_DR_POLICY_IBM_CLOUD_MANAGED = "odr-policy-10m"
OSD Deployment
RDR_OSD_MODE_GREENFIELD = "greenfield"
RDR_OSD_MODE_BROWNFIELD = "brownfield"
Storage Classes
RDR_VOLSYNC_CEPHFILESYSTEM_SC = "ocs-storagecluster-cephfs-vrg"
RDR_CUSTOM_RBD_POOL = "rdr-test-storage-pool"
RDR_CUSTOM_RBD_STORAGECLASS = "rbd-cnv-custom-sc"
Namespaces
DR_DEFAULT_NAMESPACE = "openshift-dr-system"
DR_OPS_NAMESPACE = "openshift-dr-ops" # For discovered apps
Labels
TOKEN_EXCHANGE_AGENT_LABEL = "app=token-exchange-agent"
RBD_MIRROR_APP_LABEL = "app=rook-ceph-rbd-mirror"
RAMEN_DR_CLUSTER_OPERATOR_APP_LABEL = "app=ramen-dr-cluster"
RDR_VM_PROTECTION_LABEL = "ramendr.openshift.io/k8s-resource-selector"
Templates
MIRROR_PEER_RDR = "ocs_ci/templates/multicluster/mirror_peer_rdr.yaml"
DR_POLICY_YAML = "ocs_ci/templates/multicluster/dr_policy_hub.yaml"
Cluster Roles
RDR_ROLES = ["ActiveACM", "PrimaryODF", "SecondaryODF"]
# Optional for dual-hub scenarios
# RDR_ROLES.append("PassiveACM")
Upgrade Order
RDR has a specific upgrade sequence:
UPGRADE_TEST_ORDER = {
ORDER_OCP_UPGRADE: 1, # OCP upgrade
ORDER_OCS_UPGRADE: 2, # ODF upgrade
ORDER_MCO_UPGRADE: 3, # Multicluster Orchestrator
ORDER_DR_HUB_UPGRADE: 4, # DR Hub operator
ORDER_ACM_UPGRADE: 5, # ACM upgrade
}
Upgrade Sequence:
ACM Hub OCP upgrade
Primary managed cluster OCP upgrade
Secondary managed cluster OCP upgrade
Primary ODF upgrade
Secondary ODF upgrade
ACM MCO operator upgrade
ACM DR Hub operator upgrade
Primary/Secondary DR cluster operator upgrade (automatic)
ACM upgrade (if selected)
Configuration Parameters
# DR configuration dictionary
dr_conf = {
"rbd_dr_scenario": True/False, # Enable RBD DR
"cephfs_dr_scenario": True/False, # Enable CephFS DR
}
# Environment variables
ENV_DATA = {
"skip_dr_deployment": False,
"rdr_osd_deployment_mode": "greenfield",
"cg_enabled": True,
"submariner_source": "upstream",
"configure_acm_to_import_mce": False,
}
# Multicluster configuration
MULTICLUSTER = {
"multicluster_mode": "regional-dr",
"dr_cluster_relations": [
["primary-cluster", "secondary-cluster"]
],
}
Testing and Validation
Running RDR Deployment and Tests
Deployment Command
To deploy RDR infrastructure across three clusters (ACM Hub, Primary ODF, Secondary ODF), use the following run-ci command:
run-ci \
multicluster 3 tests/ \
-m deployment \
--deploy \
--ocsci-conf conf/ocsci/multicluster_mode_rdr.yaml \
--color=yes \
--squad-analysis \
--cluster1 \
--cluster-name acm-hub-cluster \
--cluster-path /home/user/clusters/acm-hub-cluster/openshift-cluster-dir \
--ocp-version 4.17 \
--ocs-version 4.17 \
--osd-size 512 \
--ocsci-conf conf/deployment/aws/ipi_3az_rhcos_compactmode_3m_0w.yaml \
--ocsci-conf conf/ocsci/multicluster_active_acm_cluster.yaml \
--ocsci-conf conf/ocsci/submariner_downstream.yaml \
--ocsci-conf conf/ocsci/multicluster_dr_rbd.yaml \
--cluster2 \
--cluster-name primary-odf-cluster \
--cluster-path /home/user/clusters/primary-odf-cluster/openshift-cluster-dir \
--ocp-version 4.17 \
--ocs-version 4.17 \
--osd-size 512 \
--ocsci-conf conf/deployment/aws/ipi_3az_rhcos_3m_3w.yaml \
--ocsci-conf conf/ocsci/multicluster_primary_cluster.yaml \
--ocsci-conf conf/ocsci/multicluster_dr_rbd.yaml \
--ocsci-conf conf/ocsci/submariner_downstream.yaml \
--cluster3 \
--cluster-name secondary-odf-cluster \
--cluster-path /home/user/clusters/secondary-odf-cluster/openshift-cluster-dir \
--ocp-version 4.17 \
--ocs-version 4.17 \
--osd-size 512 \
--ocsci-conf conf/deployment/aws/ipi_3az_rhcos_3m_3w.yaml \
--ocsci-conf conf/ocsci/multicluster_dr_rbd.yaml \
--ocsci-conf conf/ocsci/submariner_downstream.yaml
Command Breakdown:
multicluster 3: Deploy 3 clusters in multicluster mode-m deployment --deploy: Run deployment marker and execute deployment--ocsci-conf conf/ocsci/multicluster_mode_rdr.yaml: Enable RDR mode--cluster1: ACM Hub cluster configuration (compact mode, 3 masters, 0 workers)--cluster2: Primary ODF cluster configuration (3 masters, 3 workers)--cluster3: Secondary ODF cluster configuration (3 masters, 3 workers)--ocsci-conf conf/ocsci/multicluster_dr_rbd.yaml: Enable RBD DR scenario--ocsci-conf conf/ocsci/submariner_downstream.yaml: Enable Submariner networking
Running RDR Tests
After deployment, run RDR tests with tier1 and rdr markers:
run-ci \
multicluster 3 \
-m "tier1 and rdr" \
--ocsci-conf conf/ocsci/multicluster_mode_rdr.yaml \
--color=yes \
--cluster1 \
--cluster-name acm-hub-cluster \
--cluster-path /home/user/clusters/acm-hub-cluster/openshift-cluster-dir \
--ocsci-conf conf/ocsci/multicluster_active_acm_cluster.yaml \
--cluster2 \
--cluster-name primary-odf-cluster \
--cluster-path /home/user/clusters/primary-odf-cluster/openshift-cluster-dir \
--ocsci-conf conf/ocsci/multicluster_primary_cluster.yaml \
--cluster3 \
--cluster-name secondary-odf-cluster \
--cluster-path /home/user/clusters/secondary-odf-cluster/openshift-cluster-dir \
Test Command Options:
-m "tier1 and rdr": Run tests marked with both tier1 and rdr markersTest path:
tests/functional/disaster-recovery/regional-dr/for all RDR testsSpecific test: Add test file and method name for targeted testing
Test Categories
Failover Tests -
test_failover.pyPrimary cluster down scenarios
Primary cluster up scenarios
RBD and CephFS interfaces
Relocate Tests -
test_relocate.pyPlanned migration
Application continuity
Failover and Relocate -
test_failover_and_relocate.pyCombined scenarios
CLI and UI testing
Discovered Apps -
test_failover_and_relocate_discovered_apps.pyNon-GitOps applications
KubeObject protection
Recipe-based backup
Hub Recovery -
test_neutral_hub_failure_and_recovery.pyHub cluster failure
Backup and restore
Node Operations -
test_node_operations_during_failover_relocate.pyNode failures during DR operations
Resilience testing
Test Markers
@rdr # Marks test as RDR-specific
@turquoise_squad # Squad ownership
@tier1 # Test tier
@acceptance # Acceptance test
Validation Helpers
Key validation functions in ocs_ci/helpers/dr_helpers.py:
get_current_primary_cluster_name(): Identify active clusterget_current_secondary_cluster_name(): Identify standby clusterwait_for_mirroring_status_ok(): Verify replication healthwait_for_all_resources_creation(): Verify workload deploymentwait_for_all_resources_deletion(): Verify cleanupwait_for_replication_destinations_creation(): Verify secondary resourcesverify_last_kubeobject_protection_time(): Validate backup timing
Troubleshooting
Common Issues
MirrorPeer not reaching ExchangedSecret
Check token-exchange-agent pods
Verify network connectivity
Check S3 secret configuration
DRPolicy not Validated
Verify both clusters are healthy
Check MirrorPeer status
Verify StorageCluster configuration
Replication not working
Check rbd-mirror pods
Verify VolumeReplication resources
Check mirroring status in Ceph
Failover stuck
Check DRPC conditions
Verify VRG state
Check for resource conflicts
Debug Commands
# Check DRPC status
oc get drpc -n <namespace> -o yaml
# Check VRG status
oc get vrg -n openshift-dr-ops -o yaml
# Check MirrorPeer
oc get mirrorpeer -o yaml
# Check DRPolicy
oc get drpolicy -o yaml
# Check replication status
oc get volumereplication -n <namespace>
# Check Ceph mirroring
ceph rbd mirror pool status <pool-name>
References
Key Files
Deployment:
ocs_ci/deployment/deployment.pyMulticluster Deployment:
ocs_ci/deployment/multicluster_deployment.pyDR Helpers:
ocs_ci/helpers/dr_helpers.pyConstants:
ocs_ci/ocs/constants.pyDRPC Resource:
ocs_ci/ocs/resources/drpc.pyACM Integration:
ocs_ci/ocs/acm/acm.pySubmariner:
ocs_ci/deployment/acm.py
Documentation
Red Hat Advanced Cluster Management for Kubernetes
OpenShift Data Foundation Documentation
Ramen DR Operator Documentation
Submariner Documentation
Summary
RDR in ocs-ci provides a comprehensive framework for testing Regional Disaster Recovery scenarios in OpenShift Data Foundation. The architecture supports:
Asynchronous replication between geographically distributed clusters
Automated failover for disaster scenarios
Planned relocate for maintenance and optimization
Multiple workload types: Subscriptions, ApplicationSets, Discovered Apps
Storage flexibility: RBD and CephFS support
Consistency groups for multi-PVC applications
Hub recovery for ACM cluster failures
The deployment flow is fully automated through ocs-ci, enabling comprehensive testing of DR scenarios across different ODF versions, platforms, and configurations.