I'm trying to ingest data (one partition = 1MB BLOB) from Spark to Cassandra with theses conf parameters :
spark.sql.catalog.cassandra.spark.cassandra.output.batch.size.rows 1
spark.sql.catalog.cassandra.spark.cassandra.output.concurrent.writes 100
spark.sql.catalog.cassandra.spark.cassandra.output.batch.grouping.key none
spark.sql.catalog.cassandra.spark.cassandra.output.throughputMBPerSec 1
spark.sql.catalog.cassandra.spark.cassandra.output.consistency.level LOCAL_QUORUM
spark.sql.catalog.cassandra.spark.cassandra.output.metrics false
spark.sql.catalog.cassandra.spark.cassandra.connection.timeoutMS 90000
spark.sql.catalog.cassandra.spark.cassandra.query.retry.count 10
spark.sql.catalog.cassandra com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.extensions com.datastax.spark.connector.CassandraSparkExtensions
I start with a total 16 cores Spark Job, and down to juste 1 core Spark Job.
Anyway, every time, after some times, the response is as follow, and the driver go to state failed :
21/09/19 19:03:50 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatementWrapper#532adef2
com.datastax.oss.driver.api.core.servererrors.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency LOCAL_QUORUM (2 replica were required but only 0 acknowledged the write)
It may be related to some nodes overloaded.. but how to debug ? What conf to adjust ?
Thanks
Problem solved!
The problem was MY DATA, and NOT Cassandra.
Indeed, the size of few partitions (2000 of 60 000 000) were about 50 MB, instead of 1MB that I expected.
I just filtered to exclude large partition while ingesting in Spark :
import org.apache.spark.sql.functions.{col, expr, length}
...
spark.read.parquet("...")
// EXCLUDE LARGE PARTITIONS
.withColumn("bytes_count",length(col("blob")))
.filter("bytes_count< " + argSkipPartitionLargerThan)
// PROJECT
.select("data_key","blob")
// COMMIT
.writeTo(DS + "." + argTargetKS + "."+argTargetTable).append()
Ingestion is now OK with Spark in just 10 minutes (500 GB data)
Related
So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?
1st scenario
My PartitionsPerHost parameter of repartitionByCassandraReplica is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.
2nd scenario
But, according to the Spark Cassandra connector documentation, information from system.size_estimates table should be used in order to calculate the spark partitions. For example from my system.size_estimates:
estimated_table_size = mean_partition_size x number_of_partitions
= (24416287.87/1000000) MB x 332
= 8106.2 MB
spark_partitions = estimated_table_size / input.split.size_in_mb
= 8106.2 MB / 64 MB
= 126.6593 partitions
so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?
Those are two completely different paths by which the number of Spark partitions are calculated.
If you're calling repartitionByCassandraReplica(), the number of Spark partitions are determined by both partitionsPerHost and the number of Cassandra nodes in the local DC.
Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size. Cheers!
After few successfully ingested data into Cassandra with Spark,
an error is now returned every time I try to ingest data with Spark (after few minutes or instantly) :
Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses
I checked with simple CQLSH (not Spark), and similar error is indeed returned too (2 nodes of 4) :
Connection error: ('Unable to connect to any servers', {'1.2.3.4': error(111, "Tried connecting to [('1.2.3.4', 9042)]. Last error: Connection refused")})
So basically, when I do ingestion into Cassandra with Spark, some nodes go down at some point. And I have to reboot the node, in order to access it again through cqlsh (and spark).
What is strange, is that it is still written "UP" for the given node when I run nodetool status, while cqlsh tells connection refused for that node.
I try to investigate logs, but I have a big problem : nothing in the logs, no single exception triggered server-side.
What to do in my case ? Why a node go down or become unresponsive in that case ? How to prevent it ?
Thanks
!!! edit !!!
Some of the details asked for, bellow :
Cassandra infrastructure :
network : 10 gbps
two datacenters : datacenter1 and datacenter2
4 nodes in each datacenter
2 replicas per datacenter :
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '2', 'datacenter2': '2'} AND durable_writes = true;
consistency used for input and output : LOCAL_QUORUM
total physical memory per node : 128GB.
memory repartition per node : 64GB dedicated for each Cassandra instance, and 64GB dedicated for each Spark worker (colocated on each Cassandra node)
storage : 4 TB NVME for each node
Spark application config :
total executors cores : 24 cores (4 instances * 6 cores each)
total executors ram : 48 GB (4 instances * 8 GB each)
cassandra config on spark :
spark.sql.catalog.cassandra.spark.cassandra.output.batch.size.rows 1
spark.sql.catalog.cassandra.spark.cassandra.output.concurrent.writes 100
spark.sql.catalog.cassandra.spark.cassandra.output.batch.grouping.key none
spark.sql.catalog.cassandra.spark.cassandra.output.throughputMBPerSec 80
spark.sql.catalog.cassandra.spark.cassandra.output.consistency.level LOCAL_QUORUM
spark.sql.catalog.cassandra.spark.cassandra.output.metrics false
spark.sql.catalog.cassandra.spark.cassandra.connection.timeoutMS 90000
spark.sql.catalog.cassandra.spark.cassandra.query.retry.count 10
spark.sql.catalog.cassandra com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.extensions com.datastax.spark.connector.CassandraSparkExtensions
(2 nodes of 4)
Just curious, but what is the replication factor (RF) of the keyspace, and what consistency level is being used for the write operation?
I'll echo Alex, and say that usually this happens because Spark is writing faster than Cassandra can process. That leaves you with two options:
Increase the size of the cluster to handle the write load.
Throttle-back the write throughput of the Spark job.
One thing worth calling out:
2 replicas per datacenter
consistency used for input and output : LOCAL_QUORUM
So you'll probably get more throughput by dropping the write consistency to LOCAL_ONE.
Remember, quorum == RF / 2 + 1, which means LOCAL_QUORUM of 2 is 2.
So I do recommend dropping to LOCAL_ONE, because right now Spark is effectively operating # ALL consistency.
Which JMX indicators I need to care about ?
Can't remember the exact name of it, but if you can find the metric for disk IOPs or throughput, I wonder if it's hitting a threshold and plateauing.
I am connecting to oracle database using JDBC connection using Spark and trying to read an oracle table containing 40 million rows. I am using 30 executors , 5 executor cores and 4g memory for each executors while launching spark-shell/submit. While reading the count or trying to write data of the dataframe, its using only one executor to read/write the data from oracle. Tried re partitioning the dataframe but still using only 1 executor causing huge performance degradation.
Below is the syntax used, any suggestion is highly appreciated.
Command snippet:-
spark-shell --executor-memory 4G --executor-cores 5 --num-executors 30
val source_df = spark.read.format("jdbc").option("url", JDBC_URL).option("dbtable", src_table).option("user", *****).option("password", *****).option("driver", "oracle.jdbc.driver.OracleDriver").option("numPartitions", 40).option("partitionColumn", "*****").option("lowerBound", 1).option("upperBound", 100000).load()
val df_1_msag=source_table_DF_raw_msag.repartition(40)
df_1_msag.count
[Stage 0:=======================================================> (39 + 1
The number of concurrent connections allowed for the user in oracle db is also important.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
numPartitions
The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
My Hive Insert Query is getting failed with the below error :
java.lang.OutOfMemoryError: GC overhead limit exceeded
Data in table2 = 1.7tb
Query :
set hive.exec.dynamic.partition.mode= nonstrict;set hive.exec.dynamic.partition=true;set mapreduce.map.memory.mb=15000;set mapreduce.map.java.opts=-Xmx9000m;set mapreduce.reduce.memory.mb=15000;set mapreduce.reduce.java.opts=-Xmx9000m;set hive.rpc.query.plan=true;
insert into database1.table1 PARTITION(trans_date) select * from database1.table2;
Error info: Launching Job 1 out of 1 Number of reduce tasks is set to
0 since there's no reduce operator FAILED: Execution Error, return
code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. GC
overhead limit exceeded
cluster info : total memory : 1.2TB total vcores :288 total nodes :
8 node version : 2.7.0-mapr-1808
Please note :
I am trying to insert the data from table 2 which is in parquet format to table 1 which is in ORC format .
the data size is 1.8TB in total.
Adding distribute by partition key should solve the problem:
insert into database1.table1 PARTITION(trans_date) select * from database1.table2
distribute by trans_date;
distribute by trans_date will trigger reducer step, and each reducer will process single partition, this will reduce pressure on memory. When each process writing many partitions, it keeps too many buffers for ORC in memory.
Also consider adding this setting to control how much data each reducer will process:
set hive.exec.reducers.bytes.per.reducer=67108864; --this is example only, reduce the figure to increase parallelism
I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs
spark.cassandra.input.split.size_in_mb 67108864
When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.
There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.
Why so many partitions and how come spark decide how much partitions to create
The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size
If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.