Why is so slow when use Spark Cassandra Connector in java code with Cassandra cluster? - cassandra

we have tested a lot in many scenes in small data.
if use cassandra installed without cluster,then everything is ok,but if we use cassandra in cluster,then it will cost more then about 15 seconds at the same function.
Our java code is just as the sample code.Purely, call the dataset.collectAsList() or dataset.head(10)。
But if we use scala ,the same logic in spark-shell don't have the problem.
We have test a lot jdks and systems.Mac OS is fine, but window OS and linux OS like centos both have this problem.

collectAsList or head function,will try to getHostName,this is a expensive operation.So we can't use Ip to connect cassandra cluster, we have to use HOSTNAME to connect it.And it works!!!! the code of spark cassandra connector have to fix this problems.

Related

Can we use repartitionByCassandraReplica functionality of spark-cassandra-connector in kubernetes environment?

I am trying to undertand how to use repartitionByCassandraReplica functionality of spark-cassandra-connector in Kubernetes environment?
My initial thought is that hosting executor on the same host on which Cassandra pod is running will solve my problem. Am i right in my thinking?
Data locality can only be achieved with repartitionByCassandraReplica if both the Spark worker/executor and Cassandra JVMs run in the same OSI. This applies to physical servers, VMs, containers, pods, etc.
Unless you have a way of running both the Spark and Cassandra image in the same container/pod, it won't be possible to achieve data locality.
For what it's worth, there's an open spark-cassandra-connector ticket to look into how this can be achieved (SPARKC-655). It's just a stub right now and there has not been any work done on it yet. Cheers!

How to run spark sql queries against Hadoop without running a spark job

I develop spark sql to run against hadoop. Today I must run a spark job that invokes my query. Is there another way to do this? I find I spend too much time fixing side issues with running the jobs in spark. Ideally I want to be able to compose and execute Spark SQL queries directly against hadoop/hbase and bypass the spark job altogether. This would permit much more rapid iteration when debugging or trying alternate queries.
Note that my queries are often 100 lines long or more so working from the command line is challenging.
I have to do this from a WIndows workstation
The best thing you can use for HBase is to use Apache Phoenix. It provides an SQL interface.
As an example, on my last project I used NIFI with Phoenix to read and mutate HBase data. Worked great from command line. I did discover a bug in my usage of it.
See https://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html. You van use an SQL file. In addition you can use Hue.
Never tried the following for Windows, but it is possible. See https://community.cloudera.com/t5/Community-Articles/How-to-connect-a-Windows-JDBC-client-to-Cluster-enabled-with/ta-p/247787

YCSB for Cassandra 3.0 Benchmarking

I have a cassandra ubuntu visual cluster and need to benchmark it.
I try to do it with yahoo's ycsb (without use of maven if possible).
I use cassandra 3.0.1 but I cant find a suitbale version of ycsb.
I dont want to change to an oldest version of cassandra (ycsb latest cassandra-binding is for cassandra 2.x)
What should I do?
As suggested here, despite Cassandra 3.x is not officially supported, you can use the cassandra-cql binding.
For instance:
/bin/ycsb load cassandra-cql -threads 4 -P workloads/workloada
I just tested it on Cassandra 3.11.0 and it works for both load and run.
That said, the benchmark software to use depends on your test schedule. If you want to benchmark only Cassandra, then #gsteiner 's solution might be the best. If you want to benchmark different databases using the same tool to avoid variability, then YCSB is the right one.
I would recommend using Cassandra-stress to perform a load/performance test on your Cassandra cluster. It is very customizable, to the point that you can test distributions with different data models as well as specify how hard you want to push your cluster.
Here is a link to the Datastax documentation for it that goes into how to use the tool in depth.
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsCStress_t.html

Running Spark on a Cluster of machines

I want to run Spark on four computers, and i read theory of running Spark on cluster using Mesos, Yarn and SSH, but i want a practical method and tutorial for this. the Operating System of these machines are Mac and Ubuntu. I've written my code on IntelliJIDEA using Scala.
Can anybody help me?

Enable Spark on Same Node As Cassandra

I am trying to test out Spark so I can summarize some data I have in Cassandra. I've been through all the DataStax tutorials and they are very vague as to how you actually enable spark. The only indication I can find is that it comes enabled automatically when you select "Analytics" node during install. However, I have an existing Cassandra node and I don't want to have to use a different machine for testing as I am just evaluating everything on my laptop.
Is it possible to just enable Spark on the same node and deal with any performance implications? If so how can I enable it so that it can be tested?
I see the folders there for Spark (although I'm not positive all the files are present) but when I check to see if it's set to Spark master, it says that no spark nodes are enabled.
dsetool sparkmaster
I am using Linux Ubuntu Mint.
I'm just looking for a quick and dirty way to get my data averaged and so forth and Spark seems like the way to go since it's a massive amount of data, but I want to avoid having to pay to host multiple machines (at least for now while testing).
Yes, Spark is also able to interact with a cluster even if it is not on all the nodes.
Package install
Edit the /etc/default/dse file, and then edit the appropriate line
to this file, depending on the type of node you want:
...
Spark nodes:
SPARK_ENABLED=1
HADOOP_ENABLED=0
SOLR_ENABLED=0
Then restart the DSE service
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseServ.html
Tar Install
Stop DSE on the node and the restart it using the following command
From the install directory:
...
Spark only node: $ bin/dse cassandra -k - Starts Spark trackers on a cluster of Analytics nodes.
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseStandalone.html
Enable spark by changing SPARK_ENABLED=1
using the command: sudo nano /usr/share/dse/resources/dse/conf/dse.default

Resources