How to use Zookeeper with Azure HDInsight Linux cluster? - azure

Obviously I need to start a zookeeper server on one of the cluster machines, then I need other client machines to connect to this server.
The way I did it is that I used ssh to connect to the headnode, I found a zk server running on the port 2181. So, I used ifconfig to get the machine's IP address (for example 10.0.0.8) and i then had my worker nodes connect to:
10.0.0.8:2181.
However, my MR job now completes but it works slowly and the output is not correct. I suspect that I'm doing something wrong with Zookeeper, especially that I didn't follow a tutorial and improvised my steps.

HDInsight has multiple zookeeper servers. Not sure if specifying one might be the cause of the problem you are seeing.
I wrote an example a while back that uses Storm to write to HBase (both servers on the same Azure Virtual Network,) and as part of the configuration, I had to specify the three zookeeper servers for the component that writes to hbase. (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-sensor-data-analysis/ is the article.)
From the cluster head node, you can probably ping zookeeper0, zookeeper1, and zookeeper2 to find the IP address of each.

Related

Spark streaming Kafka not working when Kafka and Worker are on different Machines

I have a simple Spark streaming app that works with kafka(deploy on my machine, as in the basic config that ships with the distribution). When i run my sparkstreaming app on a standalone server with my master and worker on my machine and therefore the same machine as kafka everything is fine.
However as soon as i decide to add another node/worker, or if i simply only start the worker on my second machine (Where Kafka is not) nothing happen anymore. The Streaming tab disappear. But i don't see any error in the stderr of the driver or the worker in the ui.
With no error i just don't know where to look at. The application just does not work.
If anyone has ever experience something of the sort, please would you share some suggestions?
I use the proper machine ip adress of my local network
A possible issue which would cause this behaviour is a misconfiguration of the Kafka advertised host.
By default a Kafka broker advertise itself using whatever java.net.InetAddress.getCanonicalHostName(). The returned address might not be reachable from the node running the Spark worker.
In order to fix the issue you should set the advertised address on each Kafka broker to be reachable from all the nodes.
The relevant Kafka broker configuration options are:
Kafka 0.9: advertised.host.name
Kafka 0.10: advertised.listeners (with fallback on advertised.host.name)
For further details on these configuration parameters refer to the Kafka documentation for version 0.9 or 0.10

Spark Yarn-client mode across network via VPN

I have been trying to get Spark yarn-client mode working through VPN. More specifically, spark driver will be launched locally from my laptop, while the yarn cluster is in its own private network reachable through a non-bridged VPN.
The first challenge was to make the spark driver service reachable from yarn-cluster since the VPN is one-way, my laptop is not routable from the cluster.
I managed to get this working by adding an entry in /etc/hosts to point a public domain name to my local network IP, something like
192.168.0.6 spark.driver.mydomain
Then I set spark.driver.host=spark.driver.mydomain.
Now spark driver can successfully bind to spark.driver.mydomain, and tell yarn application manager to connect to spark.driver.mydomain. I also need to configure spark.driver.mydomain to point to my public IP by modifying my domain's DNS, and configure firewall to make the service publicly available.
Now I can run spark from my laptop to drive the cluster, almost there. However the SparkUI doesn't work. There is no way to connect to SparkUI despite of the message says it's suffcessfully started at spark.driver.mydomain:4040. I opened all the ports through my local network's firewall using DMZ. I also tried to use local network IP address. I can notice it is being redirected to yarn resource managers link, http://resourcemanager/proxy/application_id but just get timed out eventually, and I haven't figured out how the proxy thing works.
The spark session also occasionally spits out warning messages like
WARN ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor#executor:port] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
The basic spark actions all works despite of the warning message.
There are still quite a few concerns and questions
Does the communication between spark driver and yarn cluster contain unencrypted data in this scenario? Is there any data security concerns ( assuming the VPN is secure).
SparkUI is not accessible, which is intolerable.
Warning messages
Is it really a good practice to run driver from a remote network in yarn-client mode? There are certainly other benefits to do so, but is the framework designed to do this?
Finally, here is a JIRA issue that may lead to more general solutions. https://issues.apache.org/jira/browse/SPARK-5113

Cassandra native transport port 9042 slow on EC2 Machine

I have a 5 node Cassandra cluster set up on EC2, all in the same region.
If I connect over cqlsh (9160), queries respond in under a second.
When I connect via Dev Center, or using the native Java Driver, both of which use port 9042, the queries take over 20 seconds to respond.
They consistently respond in the same 21 second region. Never fast and then slow.
I have set up a few Cassandra Clusters on EC2 and have seen this before but do not know how to fix the problem. The last time, I scrapped the cluster and built a new one and the response time on port 9042 was fine.
Any help in how to debug or fix this problem would be appreciated, thanks.
The current version of DevCenter was designed to support as main scenario running (longish) CQL scripts (vs an interactive console with queries executed one after another). DevCenter is using as an underlying connector the DataStax Java driver for Cassandra.
For the above mentioned scenario, in order to ensure there are no "conflicts", a new Session is created for each execution. When a Session is initialized, the driver performs an auto-node discovery, creates connection pools, etc. Basically it does a lot of preparation work. Depending on the latency from your client machine to the EC2 nodes, the size of the cluster and also the configuration of these nodes (see the connection requirements), this initialization phase can be quite expensive.
As you can imagine the time spent preparing wouldn't represent a large percentage of running a DDL script and a decent size of inserts/updates. But for an interactive scenario, it will result in a suboptimal behavior (the one you are describing)
The next version(s) of DevCenter will address the interactive scenario and optimize for it so the user experience would be what you'd expect. And supporting this scenario is pretty high on our list of priorities.
The underlying Java driver obtains the whole cluster topology when it initially connects. This enables it to automatically connect to any node in the cluster. On EC2 it only obtains the private addresses, tries each one, and then times out. It then sends the request over the initial connection

How Can I run more than one cassandra server in single machine and form one cluster ring?

I would like know is there any way to run multiple Cassandra servers on a single machine, so tall the servers on that machine form one ring (cluster).
I would like know is there any way to run the cassandra servers in a single machine ?
There's always a way!
There is an excellent tool available that allows you to configure a multi-node cluster locally, but it's currently not supported under windows. When you build a cluster and start it, it will configure the ring for you. You can check out the ring using ./nodetool -h 127.0.0.1 -p 7100 ring after it has started.
*Just a side-note, the ccm tool starts the cluster as a background process.

install multi-node cassandra in windows

Is there any detail step-by-step document to address the multi-node cassandra installation in Windows? I read some documents/blogs and tried on Window7 workstations/Windows2008 servers but not be able to establish connection from the 2nd node to the 1st node.
When I was setting up my first cluster on windows I found this blogpost to be excellent. It covers many aspects of the setup including:
Firewall / Networking issues.
Running Cassandra as a service.
Monitoring and maintenance.
If you want to create a complete setup with using just cassandra have a look at this blog.
But to setup a multi-node cluster, you basically need to have the correct ports open on your servers. When it comes to configuration you are basically going to have identical cassandra.yaml configs accross all your nodes, with the same seeds list, and the only two fields need to be changed are the listen_address and possibly rpc_address (although you could just listen an all interfaces for the rpc_address by setting it to:
rpc_address: 0.0.0.0

Resources