Deploying local elasticsearch cluster on azure - azure

I am using Apache Nutch to crawl sites for one of my project. The data and the content is successfully crawled. For indexing and search queries, I am using Elasticsearch clusters to process data easily. I am really new to Elasticsearch clusters, locally everything is working fine. But now I want to deploy the same local clusters on Azure services so I can communicate with data from another application that I am working on. I have seen some tutorials, but there is no option of deploying your local clusters on Azure.
Please guide me through this.

You can not deploy local cluster on azure, you need to setup another es cluster on azure nodes & point your writing application to that cluster.

Related

Local instance of Databricks for development

I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical.
Is there a "local" install of Databricks that can be installed for development purposes (it doesn't need to be a scalable version but does need to be essentially fully featured)? In other words, is there a way each developer can create their own development instance of Databricks on their local machine?
Is there another way to provide a dedicated Databricks environment for each developer?
Databricks, as a cloud-deployed platform, leverages many cloud technologies in its deployment. For example, Auto Loader incrementally ingests new data files as they arrive in AWS using EventBridge, SNS and S3, while Azure uses EventHubs, Notification Hubs and ADLS technologies. They aim to create a seamless look and feel across AWS, Azure and GCP but can do this only in the cloud.
For local deployment, you may be able to use Apache Spark and MlFlow and create a similar experience, but the notebook experience isn't open source. The workflow of Databricks is proprietary, though Databricks has open-sourced many of its technologies, like Delta Lake. The local Spark, MlFlow, may suffice for some and then use the cloud sparingly, but the seamless workflow offered by Databricks is challenging to replicate outside of the leading cloud vendors.

Apache Beam using Spark Runner Deployment on Pivotal Cloud Foundry

I have a requirement to deploy a Apache Beam application using Spark Runtime engine. My Question is if I can deploy a Spark Application on Pivotal Cloud Foundry environment. Could you please provide examples if available.
Thanks
Yes, Cloud Foundry can run Apache Spark applications. CF now has the ability to mount persistent volumes, manage container networking for the Spark cluster itself, and provide isolation segments for different kind of compute nodes (e.g. to identify a sub-cluster with high performance networking that might be more suited to Spark apps vs. general purpose apps).
You still need a backing store outside of CF for the data you want to feed into Spark or output from Spark. This could be HDFS, Cassandra, JDBC/SQL, NFS, HTTP / S3 , etc.
Cloud Foundry is stateless, but it’s quite capable of running workloads like Spring Cloud Data Flow today, which integrates with Apache Spark quite nicely, along with Hbase, Hadoop, regular RDBMSs, Kafka/Redis/RabbitMQ, FTP servers, cloud services.. whatever you need really.
Here are the links, you can refer them.
How to leverage Pivotal Cloud Foundry, Pivotal HD, Apache Spark and EMC ECS to analyze Twitter data
Spark on Cloud Foundry

Upgrade/Migrate HDInsight Cluster to Last Version

I'm sure this is posted somewhere or has been communicated but I just can't seem to find anything about upgrading/migrating from a HDInsight cluster from one version to the next.
A little background. We've been using Hive with HDInsight to store all of our IIS logs since 1/24/2014. We love it and it provides good insight to our teams.
I recently was reviewing http://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/ and noticed that our version of HDInsight (2.1.3.0.432823) is no longer supported and will be deprecated in May. That got me to thinking about how to get onto version 3.2. I just can't seem to find anything about how to go about doing this.
Does anyone have any insight into if this is possible and if so how?
HDInsight uses Azure Storage for persistent data, so you should be able to create a new cluster and point to the old data, as long as you are using wasb://*/* for your storage locations. This article has a great overview of the storage architecture: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/
If you are using Hive and have not set up a customized metastore, then you may need to save or recreate some of the tables. Here's a blog post that covers some of those scenarios: http://blogs.msdn.com/b/bigdatasupport/archive/2014/05/01/hdinsight-backup-and-restore-hive-table.aspx
You can configure a new cluster and add the existing cluster's storage container as an "additional" storage account to test this out without first taking down the current cluster. Just be sure not to have both clusters using the same container as their default storage.

Elastic search azure cluster prevent public write

We recently created one azure elastic search cluster. Everything around writing, querying, fail-over works fine. But cluster URL is public, from any machine I can write data to the index. What is the best way to secure the cluster and prevent it from writing data, only authenticated machine should be able to write data to index.
Thanks,
Manish

Accessing Raw Data for Hadoop

I am looking at the data.seattle.gov data sets and I'm wondering in general how all of this large raw data can get sent to hadoop clusters. I am using hadoop on azure.
It looks like data.seattle.gov is a self contained data service, not built on top of the public cloud.
They have own Restful API for the data access.
Thereof I think the simplest way is to download interested Data to your hadoop cluster, or
to S3 and then use EMR or own clusters on Amazon EC2.
If they (data.seattle.gov ) has relevant queries capabilities you can query the data on demand from Your hadoop cluster passing data references as input. It might work only if you doing very serious data reduction in these queries - otherwise network bandwidth will limit the performance.
In Windows Azure you can place your data sets (unstructured data etc..) in Windows Azure Storage and then access it from the Hadoop Cluster
Check out the blog post: Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/05/apache-hadoop-on-windows-azure-connecting-to-windows-azure-storage-your-hadoop-cluster.aspx
You can also get your data from the Azure Marketplace e.g. Gov Data sets etc..
http://social.technet.microsoft.com/wiki/contents/articles/6857.how-to-import-data-to-hadoop-on-windows-azure-from-windows-azure-marketplace.aspx

Resources