Open a port for Kafka communication to the outside-world - linux

I have a VM (Linux OS) in Azure which has Hortonworks on it, which launches Kafka.
Kafka service is running and I am able of creating producer and consumer inside the VM.
I have the server IP and I'm also able to log into Ambari using 8080 port.
When I am trying to send a message to Kafka from my Java application I get a TimoutEception after 60 seconds.
What do I need to do in order to set the right port for Kafka communication from outside the VM?
I think that the m,ain issue here, is that Kafka is listening on local IP and not on the VM IP (WAN).
Any help will be really appreciated...

If you have used the Azure Resource Manager workflow to create the VM you have a Network Security Group that has been created automatically. You need to create rules in the NSG to make Kafka available. See : https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-nsg/
If you have used the Azure classic deployment workflow, you need to define an endpoint to expose Kafka. See: https://azure.microsoft.com/fr-fr/documentation/articles/virtual-machines-windows-classic-setup-endpoints/
Hope this helps,
Julien

Did you set Kafka advertised.host.name and advertised.port environment variables? That's how you present yourself to the outside world.

(Copy and pasting my response to a similar post)
For the recent versions of Kafka (0.10.0 as of this writing), you don't want to use advertised.host.name at all. In fact, even the [documentation] states that advertised.host.name is already deprecated. Moreover, Kafka will use this not only as the "advertised" host name for the producers/consumers, but for other brokers as well (in a multi-broker environment)...which is kind of a pain if you're using using a different (perhaps internal) DNS for the brokers...and you really don't want to get into the business of adding entries to the individual /etc/hosts of the brokers (ew!)
So, basically, you would want the brokers to use the internal name, but use the external FQDNs for the producers and consumers only. To do this, you will update advertised.listeners instead.

Related

How to properly connect client application to Scylla or Cassandra?

Let's say I have a cluster of 3 nodes for ScyllaDB in my local network (it can be AWS VPC).
I have my Java application running in the same local network.
I am concerned how to properly connect app to DB.
Do I need to specify all 3 IP addresses of DB nodes for the app?
What if over time one or several nodes die and get resurrected on other IPs? Do I have to manually reconfigure application?
How is it done properly in big real production cases with tens of DB servers, possibly in different data centers?
I would be much grateful for a code sample of how to connect Java app to multi-node cluster.
You need to specify contact points (you can use DNS names instead of IPs) - several nodes (usually 2-3), and driver will connect to one of them, and will discover the all nodes of the cluster after connection (see the driver's documentation). After connection is established, driver keeps the separate control connection opened, and via it receives the information about nodes that are going up & down, joining or leaving the cluster, etc., so it's able to keep information about cluster topology up-to-date.
If you're specifying DNS names instead of the IP addresses, then it's better to specify configuration parameter datastax-java-driver.advanced.resolve-contact-points as true (see docs), so the names will be resolved to IPs on every reconnect, instead of resolving at the start of application.
Alex Ott's answer is correct, but I wanted to add a bit more background so that it doesn't look arbitrary.
The selection of the 2 or 3 nodes to connect to is described at
https://docs.scylladb.com/kb/seed-nodes/
However, going forward, Scylla is looking to move away from differentiating between Seed and non-Seed nodes. So, in future releases, the answer will likely be different. Details on these developments at:
https://www.scylladb.com/2020/09/22/seedless-nosql-getting-rid-of-seed-nodes-in-scylla/
Answering the specific questions:
Do I need to specify all 3 IP addresses of DB nodes for the app?
No. Your app just needs one to work. But it might not be a bad idea to have a few, just in case one is down.
What if over time one or several nodes die and get resurrected on other IPs?
As long as your app doesn't stop, it maintains its own version of gossip. So it will see the new nodes being added and connect to them as it needs to.
Do I have to manually reconfigure application?
If you're specifying IP addresses, yes.
How is it done properly in big real production cases with tens of DB servers, possibly in different data centers?
By abstracting the need for a specific IP, using something like Consul. If you wanted to, you could easily build a simple restful service to expose an inventory list or even the results of nodetool status.

NodeJS TLS/TCP server in need of an external firewall

Problem:
I have an AWS EC2 instance running FreeBSD. In there, I'm running a NodeJS TLS/TCP server. I'd like to create a set of rules (in my NodeJS application) to be able to individually block IP addresses programmatically based on a few logical conditions.
I'd like to run an external (not on the same machine/instance) firewall or load-balancer, that I can control from NodeJS programmatically, such that when certain conditions are given, I can block a specific remote-address(IP) before it reaches the NodeJS instance.
Things I've tried:
I have initially looked into nginx as an option, running it on a second instance, and placing my NodeJS server behind it, but after skimming through the NGINX
Cookbook
Advanced Recipes for High Performance
Load Balancing I've learned that only the NGINX Plus (the paid version) allows for remote/API control & customization. While I believe that paying $3500/license is not too much (considering all NGINX Plus' features), I simply can not afford to buy it at this point in time; in addition the only feature I'd be using (at this point) would be the remote API control and the IP address blocking.
My second thought was to go with the AWS/ELB (elastic-load-balancer) by integrating AWS' SDK into my project. That sounded feasible, unfortunately, after reading a few forum threads and part of their documentation (unless I'm mistaken) it seems these two features I need are not available on the AWS/ELB. AWS seems to offer an entire different service called WAF that I honestly don't understand very well (both as a service and from a feature-stand-point).
I have also (briefly) looked into CloudFlare, as it was recommended in one of the posts, here on Sackoverflow, though I can't really tell if their firewall would allow this level of (remote) control.
Question:
What are my options? What would you guys recommend I did?
I think Nginx provide such kind of functionality please refer to link
If you want to block an IP with Node TCP you can just edit a nginx config file and deny IP address.
Frankly speaking, If I were you, I would use AWS WAF but if you don’t want to use it, you can simply use Node JS
In Node JS You should have a global array variable where you will store all blocked IP addresses and upon connection, you will check whether connected host IP is in blocked IP variable. However there occurs a problem when machine or application is restarted, you will lose all information about blocked IP-s. So as a solution to that you can just setup Redis (It is key-value database but there are also other datatypes) DB and store blocked IP-s there. Inasmuch as Redis DB is in RAM all interaction with DB will be instantly and as long as machine or node is restarted, Redis makes a backup on hard drive and it syncs from it and continue to work in RAM with old databases.

Submit & monitor spark jobs via java in cluster mode

I have a java class which manage jobs and execute them via spark(using 1.6).
I am using the API - sparkLauncher. startApplication(SparkAppHandle.Listener... listeners) in order to monitor the state of the job.
The problem is I moved to work in a real cluster environment and this way can’t work when the master and workers are not on the same machine, as the internal implementation is making a use of localhost only (loopback) to open a port for the workers to bind to.
The API sparkLauncher.launch() works but doesn’t let me monitor the status.
What is the best practice for cluster environment using a java code?
I also saw the option of hidden Rest API, is it mature enough? Should I enable it in spark somehow (I am getting access denied, even though the port is open from outside) ?
REST API
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
More details you can find here.
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://driver-node:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
More details you can find here.

mesos-dns, best practice for working with ports

I am quite new to Service Discovery and clustered systems. I started experimenting with Mesos and Marathon for the deployment of Docker containers, the Marathon REST API and UI seem to do a good job.
My problem is the actual discovery of deployed services. For testing purposes I deployed a Kafka Cluster scaled to 3 instances via Marathon, as so I did with a MongoDB test-cluster. The Mesos-DNS client gives me a record like kafka.marathon.mesos and mongo.marathon.mesos which implies the dynamically mapped port from the host to the container. The problem is, that my client explicitly needs information about the target port. Is there a general way to get those port information from the service automatically and dymanically? What about apps exposing multiple ports?
My thougts so far:
- Doing a REST call to get a status about the deployed app and somehow extract the relevant data
- Doing a DNS SRV lookup and somehow extract the relevant data
- Having some kind of "master", statically bound to a port, with dynamic "clients".
I searched a lot for those informations but in the end most of the tutorials ended with a manual lookup which is not what I aim for.
You're spot on. I recently gave a presentation at XebiCon around this topic and plan to publish a blog post with details about the setup incl. GitHub repo. For starters you could have a look at a Python implementation for the HTTP API consumption part.
UPDATE: the blog post is now available here.

distributed logging: JMS and log4j?

Been doing some searching for a solution to this problem: I need log entries from apps running on several machines to be sent to & aggregated on a remote server. Requirements:
logging in the app needs to be asynchronous (can't wait for log entry to traverse network)
logging in the app needs to be queued; if the network fails, log entries need to be queued locally and sent to
centralized server when the network becomes available again
I'm looking at using log4j and a JMSAppender. Assuming that's a suitable solution, are there any examples available? What process would be running on the centralized server to receive log entries in this scenario?
Thanks.
One simple setup I came to think about is to use Apache ActiveMQ
It is an open source messaging broker (JMS compatible) that is able to cluster queues among several physical machines and the ActiveMQ installation is rather lightweight. You simple install one ActiveMQ on each of your applications machines. Then on the logging server (Physical Server C in the picture) you would have another ActiveMQ. Your application would use a JMS appender (read more here) and you could actually just use the included apache camel to read from the queue and write a log on file or database without needing to write an application for that task.
It could be as simple as adding something like the following to the camel.xml in the activemq /conf installation and import the camel.xml in the activemq.xml configuration.
<route>
<from uri="activemq:queue:LogQueue"/>
<to uri="file:target/folder/?fileName=logfile.log&fileExist=Append"/>
</route>
You could use a myrriad of other frameworks, JMS servers and technologies, but I think this is a rather easy approach to achieve with very low cost and high stability.

Resources