Error opening block StreamChunkId : BlockNotFoundException - apache-spark

I am getting some transient excepting in using spark-streaming with Amazon Kinesis with storage level "MEMORY_AND_DISK_2". We are using Spark 2.2.0 with emr-5.9.0.
19/05/22 01:56:16 ERROR TransportRequestHandler: Error opening block StreamChunkId{streamId=438690479801, chunkIndex=0} for request from /10.1.100.56:38074
org.apache.spark.storage.BlockNotFoundException: Block broadcast_13287_piece0 not found
I have checked that are no lost nodes in EMR cluster. And HDFS utilization percentage is 35%

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Spark-cassandra join: Pool is busy no available connection and the queue has reached its max size 256

I am trying to join a dataframe using joinWithCassandraTable function.
With the small dataset in non-prod everything went fine and when we go to prod, due to the huge data and other connections to cassandra, it has thrown exception as below.
ERROR [org.apache.spark.executor.Executor] [Executor task launch worker for task 498] - Exception in task 4.0 in stage 8.0 (TID 498)
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /<host1>:9042
(com.datastax.driver.core.exceptions.BusyPoolException: [/<host3>] Pool is busy (no available connection and the queue has reached its max size 256)), Pool is busy (no available connection and the queue has reached its max size 256)),
We have the same code in cassandra connector 1.6 which worked absolutely fine. But, when we upgrade spark to 2.1.1 and spark cassandra connector to 2.0.1, it had given these issues.
Please let me know, if you faced similar issue and what could be the resolution.
Code we used:
ourDF.select("joincolumn")
.rdd
.map(row => Tuple1(row.getString(0)))
.joinWithCassandraTable("key_space", "table", AllColumns, SomeColumns("<join_column_from_cassandra>"))
Spark Version: 2.1.1
Cassandra connector version: 2.0.1
Regards,
Srini
Tune this parameter in your spark conf.
spark.cassandra.input.reads_per_sec
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters

Spark - Cosmos - connector problems

I am playing around with the Azure Spark-CosmosDB connector which lets you access CosmosDB nodes directly from a Spark cluster for analytics using Jupyter on HDINsight
I have been following the steps described here,including uploading the required jars to Azure storage and executing the %%configure magic to prepare the environment.
But it always seems to terminate due to an I/O exception when trying to open the jar (see yarn log below)
17/10/09 20:10:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)
17/10/09 20:10:35 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/10/09 20:10:35 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)`
Not sure whether this is related to the jar not being copied to the worker nodes.
any idea? thanks, Nick

Cassandra Streaming error - Unknown keyspace system_traces

In our dev cluster, which has been running smooth before, when we replace a node (which we have been doing constantly) the following failure occurs and prevents the replacement node from joining.
cassandra version is 2.0.7
What can be done about it?
ERROR [STREAM-IN-/10.128.---.---] 2014-11-19 12:35:58,007 StreamSession.java (line 420) [Stream #9cad81f0-6fe8-11e4-b575-4b49634010a9] Streaming error occurred
java.lang.AssertionError: Unknown keyspace system_traces
at org.apache.cassandra.db.Keyspace.<init>(Keyspace.java:260)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
at org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:239)
at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:368)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
at java.lang.Thread.run(Thread.java:745)
I got the same error while I was trying to setup my cluster, and as I was experimenting with different switches in cassandra.yaml, I restarted the service multiple times and removed the system dir under data directory (/var/lib/cassandra/data as mentioned here).
I guess for some reason cassandra tries to load system_traces keyspace and fails (the other dir under /var/lib/cassandra/data), and nodetool throws this error. You can just remove both system and system_traces before starting cassandra service, or even better delete all content of bommitlog, data and savedcache there.
This works obviously if you dont have any data just yet in the system.

cassandra sstable-loader error: "Got an unknow host from describe_ring()"

I am trying to load sstables to cassandra cluster of two nodes with sstable-loader utility provided in cassandra 0.8.4
1) I have loaded the data successfully on single node environment .
2) As i have created the cluster of two nodes ,while loading ,after gossip it throws exception
java.lang.RuntimeException: Got an unknow host from describe_ring()
This is a bug in 0.8.4 (https://issues.apache.org/jira/browse/CASSANDRA-3044). It's fixed in 0.8.5; you can test that by following the link on the release thread here.

Resources