Suppose I have several machines each having spark worker and cassandra node installed. Is it possible to require each spark worker to query only its local cassandra node (on the same machine), so that no network operation involved when I do joinWithCassandraTable after repartitionByCassandraReplica using spark-cassandra-connector, so each spark worker fetches data from its local storage?
Inside the Spark-Cassandra connector, the LocalNodeFirstLoadBalancingPolicy handles this work. It prefers local nodes first, then checks for nodes in the same DC. Specifically local nodes are determined using java.net.NetworkInterface to find an address in the host list that matches one in the list of local addresses, as follows:
private val localAddresses =
NetworkInterface.getNetworkInterfaces.flatMap(_.getInetAddresses).toSet
/** Returns true if given host is local host */
def isLocalHost(host: Host): Boolean = {
val hostAddress = host.getAddress
hostAddress.isLoopbackAddress || localAddresses.contains(hostAddress)
}
This logic is used in the creation of the query plan, which returns a list of candidate hosts for the query. Regardless of the plan type (token aware or unaware), the first host in the list is always the local host if it exists.
Related
I made an test account in datastax (https://astra.datastax.com/) and want to test cassandra.
In there homepage is an cqlsh console. If I select datas is goes very fast maybe 1ms.
If I use it with nodejs and cassandra driver it takes 2-3 seconds. And I have only ONE row.
Why it takes time? Its my code fault?
const { Client } = require("cassandra-driver");
async function run() {
const client = new Client({
cloud: {
secureConnectBundle: "secure-connect-weinf.zip",
},
keyspace: 'wf_db',
credentials: {
username: "admin",
password: "password",
},
});
await client.connect();
// Execute a query
const rs = await client.execute("SELECT * FROM employ_by_id;");
console.log(`${rs}`);
await client.shutdown();
}
// Run the async function
run();
Unfortunately, it's not an apples-for-apples comparison.
Every time your app connects to a Cassandra cluster (Astra or otherwise), the driver executes these high-level steps:
Unpack the secure bundle to get cluster info
Open a TCP connection over the internet
Create a control connection to one of the nodes in the cluster
Obtain schema from the cluster using the control connection
Discover the topology of the cluster using the control connection
Open connections to the nodes in the cluster
Compute query plan (list of hosts to connect to based on load-balancing policy)
And finally, run the query
In contrast when you access the CQL Console on the Astra dashboard, the UI automatically connects + authenticates to the cluster and when you type a CQL statement it goes through the following steps:
Skipped (you're already authenticated to the cluster)
Skipped (it's already connected to a node within the same local VPC)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
Skipped (already connected to cluster)
And finally, run the query
As you can see, the CQL Console does not have the same overhead as running an app repeatedly which only has 1 CQL statement in it.
In reality, your app will be reusing the same cluster session to execute queries throughout the life of the app so it doesn't have the same overhead as just re-running the app you have above. The initialisation phase (steps 1 to 6 above) are only done when the app is started. Once it's already running, it only has to do steps 7 and 8. Cheers!
I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.
Because my worker node cannot send result back to master node? how to verify that?
I use spark2.2.
I use same username for master and worker node.
I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.
val result = joined.distinct()
result.write.mode("overwrite").format("csv")
.option("header", "true").option("delimiter", ";")
.save("file:///home/data/KPI/KpiDensite.csv")
also, for input file, I load like this:
val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
.option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()
but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.
You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.
If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.
The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.
I have setup vertica on cluster , there are 5 nodes . I am using below code to write data frame to vertica table:
Map<String, String> opts = new HashMap<>();
opts.put("table", tableName);
opts.put("db", verticaDB);
opts.put("dbschema", dashboardSchema);
opts.put("user", verticaUserName);
opts.put("password", options.verticaPassword);
opts.put("host", verticaHost);
opts.put("hdfs_url",hdfs url);
opts.put("web_hdfs_url",web_hdfs_url);
String SPARK_VERTICA_SOURCE = "com.vertica.spark.datasource.DefaultSource";
dataFrame.write().format(SPARK_VERTICA_SOURCE).options(opts).
mode(saveMode).save();
Above code is working fine, But it is connection to single master node of vertica.
I tried to pass host as connection url for multi cluster node
master_node_ip:5433/schema?Connectionloadbalance=1&backupservernode=node2_ip,node3_ip
I am new to spark , How i can use load balancing to connect vertica from Spark ?
Thank in Advance .
If you connect to Vertica that way, ConnectionLoadBalance has exactly the effect that you send the connection request to master_node_ip (strange name, as Vertica has no master node). To put it in a simplified way: The node in the cluster receiving the connect request "asks" all nodes in the cluster which is the one with the currently lowest load in number of connections. That node will then respond to the connection request, and you will be connected with that one.
If you want more than that, your client (Spark in this case) will have to instantiate for example as many threads as you have Vertica nodes; each connects to a different Vertica node, with ConnectionLoadBalance=False, so that they remain connected exactly where they "wanted" to.
Hope this helps - Marco
Datastax C/C++ driver has a blacklist filtering functionality as part of its load balancing controls.
https://docs.datastax.com/en/developer/cpp-driver/2.5/topics/configuration/
Correct me If I missing something but my understanding is that a CQL client can't connect to blacklisted hosts.
I'm using C/C++ driver v2.5 and the below codeblock and trying to connect to a multinode cluster:
CassCluster* cluster = cass_cluster_new();
CassSession* session = cass_session_new();
const char* hosts = "192.168.57.101";
cass_cluster_set_contact_points(cluster, hosts);
cass_cluster_set_blacklist_filtering(cluster, hosts);
CassFuture* connect_future = cass_session_connect(session, cluster);
In this codeblock the host to which the CQL client is trying to connect is set as blacklisted. However, CQL client seems to connect to this host and executes any queries. Is there something wrong with the above codeblock? If not so, is this the expected behavior? Does it behaves differently because it is a multinode cluster and establish connection to the other peers?
Any help will be appreciated.
Thank you in advance
Since you are supplying only one contact point, that IP address is being used to establish the control connection into the cluster. Once that control connection is established and the peers table is read to determine other nodes available in the cluster, connections are made to those other nodes. At this point all queries will be routed to those other nodes and not your initial/blacklisted contact point; however the connection to the initial contact point will remain as it is the control connection into the cluster.
To get a better look at what is going on inside the driver you can enable logging in the driver. Here is an example to enable logging via the console:
void on_log(const CassLogMessage* message, void* data) {
fprintf(stderr, "%u.%03u [%s] (%s:%d:%s): %s\n",
(unsigned int) (message->time_ms / 1000),
(unsigned int) (message->time_ms % 1000),
cass_log_level_string(message->severity),
message->file, message->line, message->function,
message->message);
}
/* Log configuration *MUST* be done before any other driver call */
cass_log_set_level(CASS_LOG_TRACE);
cass_log_set_callback(on_log, NULL);
In order to reduce the extra connection on a node that will be blacklisted you can supply a different contact point into the cluster that is not the same as the node (or nodes) that will be blacklisted.
Below is my basic program
public static void main(String[] args) {
Cluster cluster;
Session session;
cluster = Cluster.builder().addContactPoint("192.168.20.131").withPort(9042).build();
System.out.println("Connection Established");
cluster.close();
}
Now i want to know that i have a 7 node cluster and i have cassandra instance running on all 7 nodes. Assuming that the above mentioned IP address is my entry point how does it actually works. Suppose some other user try to run a program on any other cassandra node out of 7 so which IP will be entered as Contact Point. Or do i have to add all the 7 nodes IP addresses comma separated in my main() method ..?
As it is described here,
The driver discovers the nodes that constitute a cluster by querying
the contact points used in building the cluster object. After this it
is up to the cluster's load balancing policy to keep track of node
events (that is add, down, remove, or up) by its implementation of the
Host.StateListener interface.