Azure Confidential VM benchmarks perform higher than same size non-confidential VM instance - azure

This is my first attempt using AMD SEV and before deploying any applications I wanted to verify the expected performance overhead compared to a VM without AMD SEV. I am trying to replicate these results, that find Confidential VMs 2-8% slower than their non-confidential counterparts. However, in my case, the confidential VM seems to consistently perform better than the Standard VM.
I deployed 2 Azure VMs, one D4asv5 (Security: Standard) and one DC4asv5 (Security: Confidential VM), both in US West, with the rest of the configuration exactly the same.
I run cpu and fileIO benchmarks with sysbench (same parameters for both VMs) and CoreMark 666 benchmarks, as in the article above. In all cases, the Standard VM performance was 4-9% slower than the Confidential VM.
I have double-checked that the VMs are configured as intended, so I assume this must be an issue with the Standard VM performance. I redeployed the VM to force host migration, and ran Performance Diagnostics that didn't identify any issues. The benchmarks ran over a few-days period during which I reallocated the VMs multiple times.
Is there anything else that I am missing, that could justify this performance difference?

Related

How to improve download speed on Azure VMs?

My organization is spinning down its private cloud and preparing to deploy a complete analytics and data warehousing solution on Azure VMs. Naturally, we are doing performance testing so we can identify and address any unforeseen issues before fully decommissioning our datacenter servers.
So far, the only significant issue I've noticed in Azure VMs is that my download speeds don't seem to change at all no matter which VM size I test. The speed is always right around 60 Mbps downstream. This is despite guidance such as this which indicates I should see improvements in ingress based on VM size.
I have done significant and painstaking research into the issue, but everything I've read so far only really addresses "intra-cloud" optimizations (e.g., ExpressRoute, Accelerated Networking, VNet peering) or communications to specific resources. My requirement is for fast download speeds on the open internet (secure and trusted sites only). To preempt any attempts to question this requirement, I will not specifically address the reasons why I can't consider alternatives such as establishing private/dedicated connections, but suffice it to say, I have no choice but to rule out those options.
That said, I'm open to any other advice! What, if anything, can I do to improve download speeds to Azure VMs when data needs to be transferred to the VM from the open web?
Edit: Corrected comment about Azure guidance.
I finally figured out what was going on. #evilSnobu was onto something -- there was something flawed about my tests. Namely, all of them (seemingly by sheer coincidence) "throttled" my data transfers. I was able to confirm this by examining the network throughput carefully. Since my private cloud environment never provisioned enough bandwidth to hit the 50-60 Mbps ceiling that seems to be fairly common among certain hosts, it didn't occur to me that I could eventually be throttled at a higher throughput rate. Real bummer. What this experiment did teach me, is that you should NOT assume more bandwidth will solve all your throughput problems. Throttling appears to be exceptionally common, and I would suggest planning for what to do when you encounter it.

LoadRunner testing over different domains

I have been requested to use LoadRunner to do load testing. While my LR servers are all physical servers, i will need to test a system that's not only on VM's, but i'll need to access through a VDI AND the system under test is in a completely different secure domain (diff OU's). This makes me believe there is going to be a large disparity and skewed performance results with all of the tokens and authentication that will have to happen. How can I measure this gap if at all?
Are you trying to assess performance for the application or the VDI terminal interface? This is an important consideration as to your path.
It does not matter if your target for AUT is on VM or Physical for your load generators. This does impact your monitoring strategy to collect monitoring stats for from the target AUT hypervisor rather than from the VM Guest OS.
You can have load generators and monitors behind a firewall. Take a look at the documentation for a path on this

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

HDInsight vs. Virtualized Hadoop Cluster on Azure

I'm investigating two alternatives for using a Hadoop cluster, the first one is using HDInsight (with either Blob or HDFS storage) and the second alternative is deploying a powerful Windows Server on Microsoft Azure and run HDP (Hortonwork Data Processing) on it (using virtualization). The second alternative gives me more flexibility, however what I'm interested in is investigating the overhead of each alternative. Any ideas on that? Particularly how is the effect of Blob storage in the efficiency?
This is a pretty broad question, so an answer of "it depends," is appropriate here. When I talk with customers, this is how I see them making the tradeoff. It's a spectrum of control at one end, and convenience on the other. Do you have specific requirements on which Linux distro or Hadoop distro you deploy? Then you will want to go with IaaS and simply deploy there. That's great, you get a lot of control, but patching and operations are still your responsibility.
We refer to HDInsight as a managed service, and what we mean by that is that we take care of running it for you (eg, there is an SLA we provide on the cluster itself, and the apps running on it, not just "can I ping the vm"). We operate that cluster, patch the OS, patch Hadoop, etc. So, lots of convenience there, but, we don't let you choose which Linux distro or allow you to have an arbitrary set of Hadoop bits there.
From a perf perspective, HDInsight can deploy on any Azure node size, similar to IaaS VM's (this is a new feature launched this week). On the question of Blob efficiency, you should try both out and see what you think. The nice part about Blob store is you get more economic flexibility, you can deploy a small cluster on a massive volume of data if that cluster only needs to run on a small chunk of data (as compared to putting it all in HDFS, where you need all of the nodes running all of the time to fit all of your data).

Azure changing hardware

I have a product which uses CPU ID, network MAC, and disk volume serial numbers for validation. Basically when my product is first installed these values are recorded and then when the app is loaded up, these current values are compared against the old ones.
Something very mysterious happened recently. Inside of an Azure VM that had not been restarted in weeks, my app failed to load because some of these values were different. Unfortunately the person who caught the error deleted the VM before it was brought to my attention.
My question is, when an Azure VM is running, what hardware resources may change? Is that even possible?
Thanks!
Answering this requires a short rundown of how Azure works.
In each data centres there are thousands of individual machines. Each machine runs a hypervisor which allows a number of operating systems to share the same underlying hardware.
When you start a role, Azure looks for available resources - disk space CPU RAM etc and boots up a copy of the appropriate OS VM in thoe avaliable resources. I understand from your question that this is a VM role - so this VM is the one you uploaded or created.
As long as your VM is running, the underlying virtual resources provided by the hypervisor are not likely to change. (the caveat to this is that windows server 2012's hyper visor can move virtual machines around over the network even while they are running. Whether azure takes advantage of this, I don't know)
Now, Azure keeps charging you for even when your role has stopped because it considers your role "deployed". So in theory, those underlying resources still "belong" to your role.
This is not guaranteed. Azure could decided to boot up your VM on a different set of virtualized hardware for any number of reasons - hardware failure being at the top of the list, with insufficient capacity being second.
It is even possible (tho unlikely) for your resources to be provided by different hardware nodes.
An additional point of consideration is that it is Azure policy that disaster recovery (or other major event) may include transferring your roles to run in a separate data centre entirely.
My point is that the underlying hardware is virtual and treating it otherwise is most unwise. Roles are at the mercy of the Azure Management Routines, and we can't predict in advance what decisions they may make.
So the answer to your question is that ALL of the underlying resources may change. And it is very, very possible.

Resources