spark jdbc conectivity from edgenode Jupyter notebook - apache-spark

I need to create a dataframe using jdbc connectivity to Oracle database. I am using a Jupyter notebook from edgenode to do this connectivity. Spark is running in client mode from Jupyter notebook. Database host and port doesnt have a connectivity from edgenode, but it is open from datanodes. But when I am trying to create dataframe, it is failing because of "connect timed out" error. Is it normal? I think it is trying to establish a connection from edgenode where connectivity cannot be established? How can I make sure that connectivity happens from executors (according to me, this is how it should be in theory).

It is "normal". In general driver node (in your case edge node) has to have the same access to data, as any worker node. While data loading is handled by executors, driver handles things like metadata (in your case fetching and translating schema) and computing splits (not relevant here).

Related

Why is so slow when use Spark Cassandra Connector in java code with Cassandra cluster?

we have tested a lot in many scenes in small data.
if use cassandra installed without cluster,then everything is ok,but if we use cassandra in cluster,then it will cost more then about 15 seconds at the same function.
Our java code is just as the sample code.Purely, call the dataset.collectAsList() or dataset.head(10)。
But if we use scala ,the same logic in spark-shell don't have the problem.
We have test a lot jdks and systems.Mac OS is fine, but window OS and linux OS like centos both have this problem.
collectAsList or head function,will try to getHostName,this is a expensive operation.So we can't use Ip to connect cassandra cluster, we have to use HOSTNAME to connect it.And it works!!!! the code of spark cassandra connector have to fix this problems.

NiFi Streaming to Spark on EMR

Using blog posts on Apache and Hortonworks I've been able to stream from NiFi to Spark when both are located on the same machine. Now I'm trying to stream from NiFi on one EC2 instance to an EMR cluster in the same subnet and security group and I'm running into problems. The specific error being reported by the EMR Core machine is
Failed to receive data from NiFi
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:708)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.establishSiteToSiteConnection(EndpointConnectionPool.java:682)
at org.apache.nifi.remote.client.socket.EndpointConnectionPool.getEndpointConnection(EndpointConnectionPool.java:300)
at org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:129)
at org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)
Using netstat on the core machine I see it does have an open TCP connection to the NiFi box on the site-to-site port (in my case 8090). On the NiFi machine, in the nifi-app.log file, I see logs from the "Site-to-Site Worker Thread" about my core machine making connection (and nothing about any errors). So the initial connection seems to be successful but not much after that.
When I ran my Spark code locally I was on the NiFi EC2 instance, so I know that in general it works. I'm just hitting something, probably security related, once the client is an EMR cluster.
As a work around I can post a file to S3 and then launch a Spark step from NiFi (using a Python script), but I'd much rather stream the data (and using Kafka isn't an option). Has anyone else gotten streaming from NiFi to EMR working?
This post is similar: Getting data from Nifi to spark streaming the difference being I have security turned off and I'm using http, not https (and I'm getting connection refused as opposed to a 401).
Edit:
nifi.properties:
# Site to Site properties
nifi.remote.input.host=
nifi.remote.input.secure=false
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
Bryan Bende had the solution in a comment above: once I set nifi.remote.input.host to the IP address of the current machine streaming started working.

Connecting to Remote Spark Cluster for TitanDB SparkGraphComputer

I am attempting to leverage a Hadoop Spark Cluster in order to batch load a graph into Titan using the SparkGraphComputer and BulkLoaderVertex program, as specified here. This requires setting the spark configuration in a properties file, telling Titan where Spark is located, where to read the graph input from, where to store its output, etc.
The problem is that all of the examples seem to specify a local spark cluster through the option:
spark.master=local[*]
I, however, want to run this job on a remote Spark cluster which is on the same VNet as the VM where the titan instance is hosted. From what I have read, it seems that this can be accomplished by setting
spark.master=<spark_master_IP>:7077
This is giving me the error that all Spark masters are unresponsive, which disallows me from sending the job to the spark cluster to distribute the batch loading computations.
For reference, I am using Titan 1.0.0 and a Spark 1.6.4 cluster, which are both hosted on the same VNet. Spark is being managed by yarn, which also may be contributing to this difficulty.
Any sort of help/reference would be appreciated. I am sure that I have the correct IP for the spark master, and that I am using the right gremlin commands to accomplish bulk loading through the SparkGraphComputer. What I am not sure about is how to properly configure the Hadoop properties file in order to get Titan to communicate with a remote Spark cluster over a VNet.

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Cassandra native transport port 9042 slow on EC2 Machine

I have a 5 node Cassandra cluster set up on EC2, all in the same region.
If I connect over cqlsh (9160), queries respond in under a second.
When I connect via Dev Center, or using the native Java Driver, both of which use port 9042, the queries take over 20 seconds to respond.
They consistently respond in the same 21 second region. Never fast and then slow.
I have set up a few Cassandra Clusters on EC2 and have seen this before but do not know how to fix the problem. The last time, I scrapped the cluster and built a new one and the response time on port 9042 was fine.
Any help in how to debug or fix this problem would be appreciated, thanks.
The current version of DevCenter was designed to support as main scenario running (longish) CQL scripts (vs an interactive console with queries executed one after another). DevCenter is using as an underlying connector the DataStax Java driver for Cassandra.
For the above mentioned scenario, in order to ensure there are no "conflicts", a new Session is created for each execution. When a Session is initialized, the driver performs an auto-node discovery, creates connection pools, etc. Basically it does a lot of preparation work. Depending on the latency from your client machine to the EC2 nodes, the size of the cluster and also the configuration of these nodes (see the connection requirements), this initialization phase can be quite expensive.
As you can imagine the time spent preparing wouldn't represent a large percentage of running a DDL script and a decent size of inserts/updates. But for an interactive scenario, it will result in a suboptimal behavior (the one you are describing)
The next version(s) of DevCenter will address the interactive scenario and optimize for it so the user experience would be what you'd expect. And supporting this scenario is pretty high on our list of priorities.
The underlying Java driver obtains the whole cluster topology when it initially connects. This enables it to automatically connect to any node in the cluster. On EC2 it only obtains the private addresses, tries each one, and then times out. It then sends the request over the initial connection

Resources