Spark-cassandra join: Pool is busy no available connection and the queue has reached its max size 256 - apache-spark

I am trying to join a dataframe using joinWithCassandraTable function.
With the small dataset in non-prod everything went fine and when we go to prod, due to the huge data and other connections to cassandra, it has thrown exception as below.
ERROR [org.apache.spark.executor.Executor] [Executor task launch worker for task 498] - Exception in task 4.0 in stage 8.0 (TID 498)
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /<host1>:9042
(com.datastax.driver.core.exceptions.BusyPoolException: [/<host3>] Pool is busy (no available connection and the queue has reached its max size 256)), Pool is busy (no available connection and the queue has reached its max size 256)),
We have the same code in cassandra connector 1.6 which worked absolutely fine. But, when we upgrade spark to 2.1.1 and spark cassandra connector to 2.0.1, it had given these issues.
Please let me know, if you faced similar issue and what could be the resolution.
Code we used:
ourDF.select("joincolumn")
.rdd
.map(row => Tuple1(row.getString(0)))
.joinWithCassandraTable("key_space", "table", AllColumns, SomeColumns("<join_column_from_cassandra>"))
Spark Version: 2.1.1
Cassandra connector version: 2.0.1
Regards,
Srini

Tune this parameter in your spark conf.
spark.cassandra.input.reads_per_sec
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Error opening block StreamChunkId : BlockNotFoundException

I am getting some transient excepting in using spark-streaming with Amazon Kinesis with storage level "MEMORY_AND_DISK_2". We are using Spark 2.2.0 with emr-5.9.0.
19/05/22 01:56:16 ERROR TransportRequestHandler: Error opening block StreamChunkId{streamId=438690479801, chunkIndex=0} for request from /10.1.100.56:38074
org.apache.spark.storage.BlockNotFoundException: Block broadcast_13287_piece0 not found
I have checked that are no lost nodes in EMR cluster. And HDFS utilization percentage is 35%

"FAILED: Execution Error, return code 3" after setting Hive engine from mr to Spark

I am trying use Spark engine in my Hive query.
It is an old query, and I don't want to convert the whole code to a spark job.
But when I run the query, it gives following error:
Status: Failed
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
The only thing I have changed is the execution engine:
set hive.execution.engine=spark;
The above change works for other similar queries. So I don't think that it's a configuration issue...
Or am I not aware that it is?
Has anybody faced this issue before?
Check the logs of the job to see the true error. Return code 1, 2 and 3 are all generic errors in both MR and Spark.
use verbose mode of beeline to run the query.
check query exeption logs, hiveserver logs, spark logs and spark webui worker logs (this often has the exact stack trace).
Try running spark in local mode.
What versions of hive, spark, hadoop do u use?
execute below command in hive client with hiveserver2 jdbc connection:
set hive.auto.convert.join=false;
It works for me.
Here is detail reason: https://www.cnblogs.com/CYan521/p/16716361.html

Spark on cluster: I would like to know the meaning of the following error and possible causes:

I've the follow errors/warns:
1) WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;#58149ee3,BlockManagerId(2, 192.168.0.171, 49714))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
2) ERROR CoarseGrainedExecutorBackend: Driver 192.168.0.131:41837 disassociated! Shutting down.
I'm running a Spark (v. 1.4.0) app in a cluster of 4 machines in which the driver has less memory (4 GB) of the workers (8 Gb each one). Is it possible that the driver produces the error due to its workload?
The driver was not able to respond to the executors since it was under stress during the computation.
The problem was solved simply by adding mroe RAM to the driver.

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

Resources