HDInsight vs. Virtualized Hadoop Cluster on Azure - azure

I'm investigating two alternatives for using a Hadoop cluster, the first one is using HDInsight (with either Blob or HDFS storage) and the second alternative is deploying a powerful Windows Server on Microsoft Azure and run HDP (Hortonwork Data Processing) on it (using virtualization). The second alternative gives me more flexibility, however what I'm interested in is investigating the overhead of each alternative. Any ideas on that? Particularly how is the effect of Blob storage in the efficiency?

This is a pretty broad question, so an answer of "it depends," is appropriate here. When I talk with customers, this is how I see them making the tradeoff. It's a spectrum of control at one end, and convenience on the other. Do you have specific requirements on which Linux distro or Hadoop distro you deploy? Then you will want to go with IaaS and simply deploy there. That's great, you get a lot of control, but patching and operations are still your responsibility.
We refer to HDInsight as a managed service, and what we mean by that is that we take care of running it for you (eg, there is an SLA we provide on the cluster itself, and the apps running on it, not just "can I ping the vm"). We operate that cluster, patch the OS, patch Hadoop, etc. So, lots of convenience there, but, we don't let you choose which Linux distro or allow you to have an arbitrary set of Hadoop bits there.
From a perf perspective, HDInsight can deploy on any Azure node size, similar to IaaS VM's (this is a new feature launched this week). On the question of Blob efficiency, you should try both out and see what you think. The nice part about Blob store is you get more economic flexibility, you can deploy a small cluster on a massive volume of data if that cluster only needs to run on a small chunk of data (as compared to putting it all in HDFS, where you need all of the nodes running all of the time to fit all of your data).

Related

How to migrate data between Apache Pulsar clusters?

How do folks migrate data between Pulsar environments either for disaster recovery or blue-green deployment are they copying data to a new AWS region or K8S namespace?
One of the easiest approaches is to rely on geo-replication to replicate the data across different clusters.
This PIP was just created to address this exact scenario of blue-green deployments and to address issues with a cluster becoming non-functioning or corrupted in case an upgrade or update (perhaps of experimental setting changes) goes wrong: https://github.com/pkumar-singh/pulsar/wiki/PIP-95:-Live-migration-of-producer-consumer-or-reader-from-one-Pulsar-cluster-to-another
Until then, aside from using geo-replication, another potential way around this is having an active-active setup (with two live clusters) with a DNS endpoint that producers and consumers use that can be cut over from the old cluster to the new cluster just before taking the old cluster goes down. If it's not acceptable to have the blip in latency this approach would cause, you could replicate just the ingest topics from the old cluster to the new cluster, cut over your consumers, and then cut over your producers. Depending on how your flows are designed, you may need to replicate additional topics to make that happen. Keep in mind that on the cloud there can be different cost implications depending on how you do this. (Network traffic across regions is typically much more expensive than traffic within a single zone.)
If you want to have the clusters completely separate (meaning avoiding using Pulsar geo-replication), you can splice a Pulsar replicator function into each of your flows to manually produce your messages to the new cluster until the traffic is cut over. If you do that, just be mindful that the function must have its own subscription on each topic you deploy it to ensure that you're not losing messages.

How to effectively share data between scale set VMs

We have an application which is quite scalable as it is. Basically you have one or more stateless nodes that all do some independent work of files that are read and written to shared NFS share.
This NFS can be bottleneck but with local deployment customers just buys big enough box to have sufficient performance.
Now we are moving this to Azure and I would like to have a better more "cloudy" way of sharing data :) and running some Linux NFS server isn't ideal scenario if we need to manage them.
Is the Azure Blob storage the right tool for this job (https://azure.microsoft.com/en-us/services/storage/blobs/)?
we need good scalability. e.g. up to 10k files writen in a minute
files are quite small, less than 50KB per file on average
files created and read, not changed
files are short lived, we purge them every day
I am looking for more practical experience with this kind of storage and how good it really is.
There are two possible solutions to your request, either using Azure Storage Blobs (Recommended for your scenario) or Azure Files.
Azure Blobs has the following scaling targets:
It doesn't support the fact of attaching it a server, such as a network share.
Blobs do not support a hierarchy file structure besides having containers (Virtual folders can be accessed, but the con is you can't delete a container if it contains blobs- for the point about purging- but there are methods to do purging using your own code.)
Azure Files:
Links recommended:
Comparison between Azure Files and Blobs:https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
Informative SO post here

How to execute parallel computing between several instances in Google Cloud Compute Engine?

I've recently encountered a problem to process a pickle file of 8 Gigabytes with a Python script using VMs in Google Cloud Compute Engine. The problem is that the process takes too long and I am searching for ways to decrease the time of processing. One of possible solutions could be sharing the processes in the script or map them between CPUs of several VMs. If somebody knows how to perform it, please, share with me!))
You can use Clusters for Large-scale Technical Computing in the Google Cloud Platform (GCP). There are open source software like ElastiCluster provide cluster management and support for provisioning nodes while using Google Compute Engine (GCE).
After the cluster is operational, workload manager manages the task execution and node allocation. There are a variety of popular commercial and open source workload managers such as HTCondor from the University of Wisconsin, Slurm from SchedMD, Univa Grid Engine, and LSF Symphony from IBM.
This article is also helpful.
it looks like an HPC problem. Look at this link: https://cloud.google.com/solutions/architecture/highperformancecomputing.
There are lot of valuable solutions to your problem but it depends on the details of your case. A first simple approach could be to logically split your task in small jobs. Then you can assign a subset of these jobs to each GCE instance in your group of dedicated instances.
You can consider to create a group of a predefined number of instances. Each run could rely on a startup scripts in order to reach out the job it must execute. When the job finishes the instance can be deleted and substituted by a new one (Google Compute Engine Managed Instance Groups will create a new instance automatically). You must only manage when the group should start and stop.
Furthermore, you can consider preemptible instances (more cheaper).
Hope this helps you.
Bye

Wildcards in counter specifiers in Azure Diagnostic

What I am trying to achieve is similar to what you can see in mongo-azure repo, specifically, I want to write Azure Diagnostic Configuration File in such a way so that I can get all the instances of performance counters, for example
\Processor(*)\% Processor Time
and it doesn't seem to be working - no data is visible in the table in my storage account.
Is it achievable at all with configuration, and if so, how?
UPD: We were able to get this working for a simple single VM (so it is possible!), but for some reason it still doesn't work for VMs in VMSS where Service Fabric Cluster is running
UPD #2: We did upgrade to VS 2015 tools 1.5 and now it magically works. I am not really sure if that was the root cause problem or we screwed up anywhere else.
Is it achievable at all with configuration, and if so, how?
Based on the documentation here, it seems it is not possible. From this link:
Performance counters available for Microsoft Azure
Azure provides a subset of the performance counters available for
Windows Server, IIS and the ASP.NET stack. The following table lists
some of the performance counters of particular interest for Azure
applications.
Table that follows below only includes Processor(_Total).
This is possible. We do this in CloudMonix when monitoring VMSS and it is working.
How are you instrumenting Diagnostics and which tables are you looking at?
Specifying \LogicalDisk(*)\Available Megabytes yileds all drives and their free space, for example

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Resources