Good afternoon. My name is John Whitman. I'm a staff System Engineer for the Networking and Security Business Unit here at VMware. Today, I'm gonna be talking about disaster recovery using NSX and SRM on with automation, data center recovery and what it's like today in 2018 for your software-defined data center. Now, a quick agenda, we're going to talk about disaster recovery and explain its challenges today that you might be facing in your current environment. We'll talk about NSX multi-site networking in detail, understanding Site Recovery Manager which is known as VMware SRM. We'll talk about NSX and SRM and how to make DR successful. Then, we'll talk about planned migrations, physical infrastructure maintenance without an outage, testing DR and NSX in a real-world environment and then a summary with some final notes. So, let's talk about disaster recovery. A lot of people put disaster recovery into an individualized bucket. It means that, I've got my backups. I've got my replication and I'm able to recover between two sites. Some other customers also put disaster recovery into the bucket of "Well, I've got five 9s. I'm able to bring my environment up. I'm able to actually work with my data." But at the same time, what does it really mean to recover from an interruption. Now, some of the things we'll talk about is RTO and RPO, but there's other functions that come along with a successful disaster recovery plan and runbook as well. So, when you have a disaster, it's not just an infrastructure failure, it's recovering applications and getting that momentum and getting that infrastructure available to the end-users and customers out in your environment. Now, one thing to talk about is availability versus reliability. Just because your infrastructure is available it doesn't mean that it's reliable. You might be able to bring up the environment and have it functional but is a reliable. Once you're on the DR site and you're running in your secondary data center, are you able to actually function and work in that environment and be productive without lag or without infrastructure outages in your secondary site? Now, today, applications fail because of human errors. You never find a data center that just crashes into a hole in the ground or the entire grid fails, it's generally because of human errors, code that gets pushed out that's incorrect, infrastructure outages and server issues inside of the data center. So, bringing it this human error into the disaster recovery environment is something that we want to do alleviate when you're recovering on the DR side. Now, most of you know what RTO and RPO means. RTO is the recovery time objective, that's how long does it take for me to actually recover the environment and bring it back online when I have an outage. RPO is the recovery point objective. Where did my data last stop being recorded? Is it five minutes? Is it 15 minutes? Or is it near real time? So, when I do come up on the secondary site, how much of that day or how much of that data hub I lost? But a lot of people don't understand what ROC and WOC is. So, ROC is the rate of change. Now, just because you have real-time asynchronous replication between two sites or you have reliable backups, doesn't necessarily mean that you have the bandwidth or physical infrastructure capability to actually be able to record all of those changes. If you have a large environment and you're having terabytes of data every day that's changing and you're trying to push that over, let's say, a one gig or even a 10 gig pipe is shared with other services, you might not be able to recover at the mandated business RPO based off of the rate of change, ROC. Now, WOC is right order consistency and this is something that normally isn't taken into consideration with disaster recovery and business continuity solutions because right consistency means, okay my data's there. It's available. Is it corrupt? Is it real and can I actually go and access it and work in that database that just had a live failure? So, right order consistency is very important and that's usually at the storage and array-based level but it really plays as a primary factor into your disaster recovery plan and scenario. Now, let's talk about traditional challenges. In a traditional DR environment, you have to basically create an entire mirror of your environments. You have your physical infrastructure. You have your edge routers. You have your application that might sprawl across multiple different racks. And then, you have the infrastructure components, the firewalls, the gateways, your security policies. Everything that makes that site run, available and gives high availability and reliability to that primary site. Now, you have to duplicate it. The problem is, you don't always necessarily have the ability to be able to build out and design and have an infrastructure just sitting somewhere in the wind that is available standing by. So, taking and being able to recover that application on dissimilar hardware or dissimilar networking infrastructure can be a key component that allows you to not only diversify your disaster recovery site but also allow you to have different type of design implementations for that secondary environment. Now, one of the key challenges today is having to Re-IP that environment. Having to actually relocate physical infrastructure, your L2, your L3 stack between the two sides, being able to recreate security policies, being able to recreate all of your routing in your infrastructure environment and on top of that you might have load balancers, DNS and application IP dependencies that you necessarily don't know or can't see and when you're going through you're testing of your DR runbook. Now, moving an application between two sites isn't always necessarily the difficult part. Sometimes, it's bringing up the infrastructure, bringing up your MPLS backbone. Having OTV which is a very complex technology to bring that L2 stretch over without having to Re-IP. There's other services out there and other pieces of hardware that allow you to do this with a complex and now you're hardware dependent, you're locked into that environment. It's expensive. It's complex. It's generally proprietary. It doesn't give you any type of flexibility and it gives you lack of automation in that environment. It's not a holistic solution and is really focused on networking at a per-device basis, and back 10 years ago when you had a physical environment this was fine because you had that physical infrastructure you had to rely on. Today, with a mostly virtualized environment, having to be coupled and rely on that physical infrastructure just doesn't give you the flexibility that's required today in a software-defined data center. Now, traditional networking solutions include OTV over Dark Fiber and MPLS or a VPLS over some sort of carrier backbone and it is a hardware-based solution that is complex and challenging to maintain. Its non holistic and really only focuses on a specific part of your infrastructure. Now, networking with NSX decouples you from that. So, what's needed for a software-defined approach? You need to be able to decouple from the physical hardware. You want ease of use and ease of deployment. You want to have flexibility. You want to be able to have hardware and infrastructure diversity and not be locked down to a key specific vendor, and you want to be able to have a high degree of automation that you can rapidly deploy and recover when needed out of environment. Also having an extensive partner ecosystem gives you the diversity to not just have one infrastructure component fail and be recovered but have it all fail and be recovered at the same time. Now, let's talk about NSX and multi-site networking in detail. What is NSX and multi-site? So, first off, NSX is more than just an intelligence switching platform. It's switching, routing, load balancing, physical connectivity to your VLAN backed infrastructure, portable firewalling with DFW, VPN connectivity between site and site and end-user and site, data security and activity monitoring. It's a suite of products that you can deploy that allows you to decouple yourself from the physical infrastructure. Now, one thing about traditional firewalls is that you can only identify virtual machines or physical boxes through the MAC address or the IP address. Now because NXS is a logical component inside of a software defined networking environment and we're tightly coupled with vSphere and ESX components, we're able to identify virtual machines in the environment based off of security groups, resource pools, port groups, user identity, VM name, operating system and so on. Also MAC address and IP address, but that allows us to actually go in and create a dynamic inclusion and exclusion list, so you can then identify your environment in any way necessary. Now micro-segmentation everybody talks about it, it's a buzzword and I'm sure in some of these previous videos it's gone over in detail, but how does that play into disaster recovery? When you have an environment that is complex and deployed but if you're also using micro-segmentation, you need to be able to pick that up in the environment, you need to be able to transport it, forklift it between sites or have the diversity be able to isolate between sites. So NSX micro-segmentation gives you policies that are aligned of logical groups, it prevents threats from spreading and it gives you the actual ability to move and diversify your environment as you go between two sites. What's nice about this is, it gives you data hygiene. When a VM is deleted, it's automatically removed from that security group, the security policies that apply to it are removed and now it's cleaned up. So NSX Cross-VC. This is a very interesting technology in the way that now you can span your data center and you can span your application across multiple sites. Here in this environment, we have site one, site two and site three. Now, it's not just a play for disaster recovery, what if you want to do physical infrastructure maintenance? Which we'll talk about more later on in this video, but as you can see here, workloads can now be moved with L2 adjacency between sites. You're literally able to take a VM from site one, move it to site two, move it to site three and it gives you L2 adjacency between the sites. We're decoupling from the physical underlay, from the physical infrastructure allowing you to not only move and migrate that virtual machine, but recover it with the same IP and the same environment that it had surrounding it in the primary site on a second or third site. Now let's take a look at NSX Cross-VC and the network architecture that is beneath it. So here in this example, we have three data centers and we have three vCenters. Each vCenter has its own cluster of hosts. Now we deploy an NSX Manager at each location. Now, take note that the NSX manager and the vCenter have a one-to-one relationship, so you do need to have an NSX manager deployed at each site. But notice here that the primary site is our primary NSX manager and the second and third site have secondary NSX managers, that's because they're playing the role of a secondary NSX manager that are taking orders and taking commands from the primary NSX manager. We then deploy a universal control cluster. Now this control cluster stateless so it can live across all the sites, but generally best practices to deploy this control cluster on the primary site. Remember, if the primary site fails that's okay, we can promote one of the secondary NSX managers over the primary. It can take control of the environment and it can redeploy that universal control cluster which we'll go into detail later. So, now that we've got our environment deployed and our universal control cluster deployed, each one of the NSX managers now couples itself with this universal control cluster allowing it to be synchronized across all three sites. Each site now is responsible for its independent hosts, but it's still synchronizes with the primary NSX manager and its database. Across all three sites, each one of the physical ESXi hosts now communicates back to the universal control cluster to get its infrastructure commands, to get its control plane and it's data plane intelligence. Now, understanding Cross-VC security architecture. How does this work when we deploy DFW rules or multi-site security policies? Once again we have our primary site with a NSX manager and a vCenter server and we have our two secondary sites that have our vCenter server and our NSX managers that are deployed. Each site is once again responsible for its own ESX hosts. When a user creates a DFW rule, a universal logical distributed firewall rule, that is pushed down to the primary NSX manager, it is then stored in the local database and then replicated across the universal sync process between each one of the secondary NSX managers. Now today, this is scalable up to eight NSX managers, so you're not just limited to three or four sites. Once that DFW rule is replicated and pushed over to the secondary sites, it is then also stored in that local NSX managers database. So if one of the primary NSX managers was to fail, you still have a local repository of all of the universal DFW rules that had been deployed in your environment. Once that rule is deployed and available in each environment, it is then pushed down to the individual ESX hosts that are managed on each individual site. We do this for reliability, redundancy and scalability, so you're not dependent on the connectivity to the primary site. Now let's talk about Site Recovery Manager which is VMware SRM. So what is SRM? Site Recovery Manager is an automation orchestration tool that does the recovery for you. It is not responsible for the replication but can broker the replication. So let's look at the individual components of the physical architecture of SRM. Once you deploy the SRM server you then have a vSphere Web Client that has a plug-in, you attach it to your SSO or your PSC and then it attaches itself to your vSphere environment. Now we can have a VR appliance do the replication or you can do hardware-based replication. You get an SRM plug-in to the vSphere Web Client and each site has its own SRM server. Now looking at the logical architecture what is SRM behind the scenes? Once again, we have an SRM server for each vCenter server that's deployed and then you have site pairing resource mapping that goes between the two sites networks, folders, storage policies, placeholder, datastores, resource pools. Those are mapped between the two sites, so if you have a failure on one site, SRM automatically knows what network to attach the virtual machines to, what datastore to put it in, what cluster and even what host to put it in. Now, a protection group is what you're protecting, it's the group of VMs that you want to categorize together and protect and a recovery plan is how you're going to recover them, in what order and what dependencies you bring them online over in the secondary site. Now SRM has several deployment models, you can do a one-to-one pairing where a site A protects the site B, site B protects the site C and C protects back to A, in a circular fashion and you can also have a primary or main data center where you have multiple remote offices that fail back to that primary data center. So remote office A, B and C, can all fail back to a single vCenter with multiple SRM instances on a primary data center or protected site. It's simple, it's reliable, it's easy to deploy. Each VM is only protected and replicated once and it enhance and utilizes enhanced link mode with vCenter. So when you do sign on and you do log on to a vCenter web services client, you're able to see all your site's universally that way you're really managing multiple environments from a single pane of glass. Now protection groups explained a little bit more means that, I want to be able to protect my web app, my email app and my SharePoint tiers in individual groups and I want to be able to identify them during a disaster recovery event. Recovery plans are how I'm going to recover them. Now remember, a protection group can be a member of multiple recovery plans. Recovery plan one can just be a single-tier or a single group of VMs, recovery plan two can be multiple tiers or multiple groups of protection groups and recovery plan three could be the entire data center, if you wanted to recover everything all at once and forklift the entire site. Now, priorities and dependencies. This is very important because applications can't just be brought online by pushing on the power buttons. So inside of SRM, you're able to go in and create priorities and inner priority group dependencies which allows you to bring up a priority group and then everything else is delayed until that group is online and then inside of that priority group, you can say that I want app server one to come online, once AP1 is online, will bring app server two online and then we'll keep rolling down the priority groups three, four, five and six. Now there are several supported failover scenarios, you can have a partial failover, a stretch deployment or full failover. A partial failover means that we have an application running and we just want to move a couple of VMs or recover just a couple of VMs between the two sites. A stretch deployment means that I'm going to leave a database or I'm going to leave a primary application component on the primary site, while I'm recovering just minimal components or partial components to the secondary site and then a full failover is where you've completely brought down the primary site and you're bringing up the secondary site with all of the components necessary and all of the application tiers required to bring that application back online and serve a client base or customers.