How to configure multiple node connection in spark dataframe? - apache-spark

I have setup vertica on cluster , there are 5 nodes . I am using below code to write data frame to vertica table:
Map<String, String> opts = new HashMap<>();
opts.put("table", tableName);
opts.put("db", verticaDB);
opts.put("dbschema", dashboardSchema);
opts.put("user", verticaUserName);
opts.put("password", options.verticaPassword);
opts.put("host", verticaHost);
opts.put("hdfs_url",hdfs url);
opts.put("web_hdfs_url",web_hdfs_url);
String SPARK_VERTICA_SOURCE = "com.vertica.spark.datasource.DefaultSource";
dataFrame.write().format(SPARK_VERTICA_SOURCE).options(opts).
mode(saveMode).save();
Above code is working fine, But it is connection to single master node of vertica.
I tried to pass host as connection url for multi cluster node
master_node_ip:5433/schema?Connectionloadbalance=1&backupservernode=node2_ip,node3_ip
I am new to spark , How i can use load balancing to connect vertica from Spark ?
Thank in Advance .

If you connect to Vertica that way, ConnectionLoadBalance has exactly the effect that you send the connection request to master_node_ip (strange name, as Vertica has no master node). To put it in a simplified way: The node in the cluster receiving the connect request "asks" all nodes in the cluster which is the one with the currently lowest load in number of connections. That node will then respond to the connection request, and you will be connected with that one.
If you want more than that, your client (Spark in this case) will have to instantiate for example as many threads as you have Vertica nodes; each connects to a different Vertica node, with ConnectionLoadBalance=False, so that they remain connected exactly where they "wanted" to.
Hope this helps - Marco

Related

Azure Mysql server connection with Azure Synpase Spark doesnt work

I´m trying to connect to an Azure Mysql database server to create a table from a Dataframe in Azure Synapse with Spark.
I have this url and this properties
All variables like jdbcXYZ are fulled with the correct values from the database
import java.util.Properties
val jdbc_url = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/database=${jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=60;"
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
And i try to write to the database with
spark.table("tabletemp").write.mode("append").jdbc(jdbc_url, "table", connectionProperties)
I also tried
df.write.format("jdbc").mode("append").option("url", jdbc_url).option("dbtable", jdbcDatabase).option("user", jdbcUsername).option("password", jdbcPassword).save()
And i´m always receiving the same error
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server
Do you know how to solve it? Thanks in advance
I am very confident that this a config issue , you are not passing all correct value for all the properties . I quick scan makes me think that this is not correct .
.option("dbtable", jdbcDatabase).

Why is a test node.js app slow compared to running a query in the Astra CQL console?

I made an test account in datastax (https://astra.datastax.com/) and want to test cassandra.
In there homepage is an cqlsh console. If I select datas is goes very fast maybe 1ms.
If I use it with nodejs and cassandra driver it takes 2-3 seconds. And I have only ONE row.
Why it takes time? Its my code fault?
const { Client } = require("cassandra-driver");
async function run() {
const client = new Client({
cloud: {
secureConnectBundle: "secure-connect-weinf.zip",
},
keyspace: 'wf_db',
credentials: {
username: "admin",
password: "password",
},
});
await client.connect();
// Execute a query
const rs = await client.execute("SELECT * FROM employ_by_id;");
console.log(`${rs}`);
await client.shutdown();
}
// Run the async function
run();
Unfortunately, it's not an apples-for-apples comparison.
Every time your app connects to a Cassandra cluster (Astra or otherwise), the driver executes these high-level steps:
Unpack the secure bundle to get cluster info
Open a TCP connection over the internet
Create a control connection to one of the nodes in the cluster
Obtain schema from the cluster using the control connection
Discover the topology of the cluster using the control connection
Open connections to the nodes in the cluster
Compute query plan (list of hosts to connect to based on load-balancing policy)
And finally, run the query
In contrast when you access the CQL Console on the Astra dashboard, the UI automatically connects + authenticates to the cluster and when you type a CQL statement it goes through the following steps:
Skipped (you're already authenticated to the cluster)
Skipped (it's already connected to a node within the same local VPC)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
And finally, run the query
As you can see, the CQL Console does not have the same overhead as running an app repeatedly which only has 1 CQL statement in it.
In reality, your app will be reusing the same cluster session to execute queries throughout the life of the app so it doesn't have the same overhead as just re-running the app you have above. The initialisation phase (steps 1 to 6 above) are only done when the app is started. Once it's already running, it only has to do steps 7 and 8. Cheers!

Cassandra Connection with Groovy Script In SoapUI

thanks for the time. I am trying to access a remote Cassandra DB in order to complete my assertions. I see that the Server is running:
Cassandra V 3.0.8.1293
Driver Type: Cassandra CQL
Datastax Java Driver for Apache Cassandra - Core [3.0.5]
So, I am trying with the following simple code to access the DB
import com.datastax.driver.core.*
Cluster cluster = null;
try {
cluster = Cluster.builder().addContactPoint("x.x.x.x").withCredentials("xxxxxxx", "xxxxxx").withPort(9042).build()
Session session = cluster.connect();
ResultSet rs = session.execute("select * from TABLE");
Row row = rs.one();
} finally {
if (cluster != null) cluster.close();
}
when I use the cassandra-driver-core-2.0.1.jar I am getting the error :
ERROR:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /x.x.x.x(null))
Read the documentation and a lot of posts here and on other blogs and I saw that there may be an incompatibility with the driver version so I tried to upgrade the driver to many versions (cassandra-driver-core-2.5,cassandra-driver-core-3,cassandra-driver-core-3.2), but on that I am getting the following:
ERROR:java.lang.ExceptionInInitializerError
Have also tried to connect using JDBC, but to no avail, using the configuration proposed at this thread
SoapUI JDBC connection with Apache Cassandra
Actually I am running out of ideas. Can anyone propose or point to some direction on how to actually achieve this, either by pointing me to some tutorial or any idea.
Thank you very much
I think you haven't enable remote access to cassandra.
Try enabling remote access using below configuration -
File Path /etc/cassandra/default.conf/cassandra.yaml
rpc_address: 0.0.0.0
broadcast_rpc_address: <serverIPAddress>
After that, restart cassandra service.

How to make workers to query only local cassandra nodes?

Suppose I have several machines each having spark worker and cassandra node installed. Is it possible to require each spark worker to query only its local cassandra node (on the same machine), so that no network operation involved when I do joinWithCassandraTable after repartitionByCassandraReplica using spark-cassandra-connector, so each spark worker fetches data from its local storage?
Inside the Spark-Cassandra connector, the LocalNodeFirstLoadBalancingPolicy handles this work. It prefers local nodes first, then checks for nodes in the same DC. Specifically local nodes are determined using java.net.NetworkInterface to find an address in the host list that matches one in the list of local addresses, as follows:
private val localAddresses =
NetworkInterface.getNetworkInterfaces.flatMap(_.getInetAddresses).toSet
/** Returns true if given host is local host */
def isLocalHost(host: Host): Boolean = {
val hostAddress = host.getAddress
hostAddress.isLoopbackAddress || localAddresses.contains(hostAddress)
}
This logic is used in the creation of the query plan, which returns a list of candidate hosts for the query. Regardless of the plan type (token aware or unaware), the first host in the list is always the local host if it exists.

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

Resources