Spark 2.x Dataframe write consistency check in Append Mode - apache-spark

I am reading the data in Spark inside for loop and performing joins and writing the data into the path in append mode.
for (partition <- partitionlist) {
var df = spark.read.parquet("path")
var df2 = df.join(anotherdf, col("col1") === col("col1"))
df2.write.mode("SaveMode.Append").partitionBy("partitionColumn").format("parquet").save("anotherpath")
}
In the above sample code, we are using spark 2.X version. Since spark 2 write APIs are not consistent, Is it possible that in case of any iteration, if the stages/task go in retries(in writing to the path) and get successful after a few retries, Is it possible that we see the data redundancy in the written data of that for loop's iteration where retry happened?
EDIT: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 is being used.

Related

Does spark load all data from Kudu when a filter is used?

I am new to spark. Will the following code load all data or just filter data from kudu?
val df: DataFrame = spark.read.options(Map(
"kudu.master" -> kuduMaster,
"kudu.table" -> s"impala::platform.${table}")).kudu
val outPutDF = df.filter(row => {
val recordAt: Long = row.getAs("record_at").toString.toLong
recordAt >= XXX && recordAt < YYY
})
The easiest way to check if the filter is pushed down or not for a given connector is using Spark UI.
The scan nodes in Spark will have the metrics of number of records read from the datasource.(You can check this Spark UI -> SQL tab, ater running a query)
Write a query with and without an explicit predicate(on a small dataset).
Inferences
1. If the number of records in scan node is the same with and without predicate - Spark has read the data completely from datasource and filtering will be done in Spark.
2. If the numbers are different, predicate pushdown has been implemented in the datasource connector.
3. Using this experiment you could also figure which kinds of predicates are pushed down.(depends on connector implementation)
Try .explain.
Not sure I got the code, but here is an example of some code that works.
val dfX = df.map(row => XYZ{row.getAs[Long]("someVal")}).filter($"someVal" === 2)
But assuming you can get code to work, Spark "predicate pushdown" will apply in your case and filtering in Kudu Storage Manager applied. So, not all data loaded.
This is from the KUDU Guide:
<> and OR predicates are not pushed to Kudu, and instead will be
evaluated by the Spark task. Only LIKE predicates with a suffix
wildcard are pushed to Kudu, meaning that LIKE "FOO%" is pushed down
but LIKE "FOO%BAR" isn’t.
That is to say, your case is OK.

kafka streaming behaviour for more than one partition

I am consuming from Kafka topic. This topic has 3 partitions.
I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).
Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)
Now , the output is something like this :
1 1 1 2 3 2 4 3 5 ....
This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.
But the final output DataSet that i get has all the records.
So , does Spark internally union all the data? How does it makes out what to union?
I am trying to understand the behaviour , so that i can form my logic
int count = 0;
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
System.out.println("NUmber of elements in RDD : "+ rdd.count());
List<Row> rows = rdd.map(record -> processData(record))
.reduce((rows1, rows2) -> {
rows1.addAll(rows2);
return rows1;
});
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(rows, schema);
ds.createOrReplaceTempView("trades");
ds.show();
}
});
The assumptions are not completely accurate.
foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.
In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.
So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.
Now comes a transformation of the RDD:
rdd.map(record -> processData(record))
As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.
So, does Spark internally union all the data? How does it make out what to union?
The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.
In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()
If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.
A better approach would be:
val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...
In this case, the data of all partitions will remain local to the executor where they are located.
Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.

use spark to scan multiple cassandra tables using spark-cassandra-connector

I have a problem of how to use spark to manipulate/iterate/scan multiple tables of cassandra. Our project uses spark&spark-cassandra-connector connecting to cassandra to scan multiple tables , try to match related value in different tables and if matched, take the extra action such as table inserting. The use case is like below:
sc.cassandraTable(KEYSPACE, "table1").foreach(
row => {
val company_url = row.getString("company_url")
sc.cassandraTable(keyspace, "table2").foreach(
val url = row.getString("url")
val value = row.getString("value")
if (company_url == url) {
sc.saveToCassandra(KEYSPACE, "target", SomeColumns(url, value))
}
)
})
The problems are
As spark RDD is not serializable, the nested search will fail cause sc.cassandraTable returns a RDD. The only way I know to work around is to use sc.broadcast(sometable.collect()). But if the sometable is huge, the collect will consume all the memory. And also, if in the use case, several tables use the broadcast, it will drain the memory.
Instead of broadcast, can RDD.persist handle the case? In my case, I use sc.cassandraTable to read all tables in RDD and persist back to disk, then retrieve the data back for processing. If it works, how can I guarantee the rdd read is done by chunks?
Other than spark, is there any other tool (like hadoop etc.??) which can handle the case gracefully?
It looks like you are actually trying to do a series of Inner Joins. See the
joinWithCassandraTable Method
This allows you to use the elements of One RDD to do a direct query on a Cassandra Table. Depending on the fraction of data you are reading from Cassandra this may be your best bet. If the fraction is too large though you are better off reading the two table separately and then using the RDD.join method to line up rows.
If all else fails you can always manually use the CassandraConnector Object to directly access the Java Driver and do raw requests with that from a distributed context.

FiloDB + Spark Streaming Data Loss

I'm using FiloDB 0.4 with Cassandra 2.2.5 column and meta store and trying to insert data into it using Spark Streaming 1.6.1 + Jobserver 0.6.2. I use the following code to insert data:
messages.foreachRDD(parseAndSaveToFiloDb)
private static Function<JavaPairRDD<String, String>, Void> parseAndSaveToFiloDb = initialRdd -> {
final List<RowWithSchema> parsedMessages = parseMessages(initialRdd.collect());
final JavaRDD<Row> rdd = javaSparkContext.parallelize(createRows(parsedMessages));
final DataFrame dataFrame = sqlContext.createDataFrame(rdd, generateSchema(rawMessages);
dataFrame.write().format("filodb.spark")
.option("database", keyspace)
.option("dataset", dataset)
.option("row_keys", rowKeys)
.option("partition_keys", partitionKeys)
.option("segment_key", segmentKey)
.mode(saveMode).save();
return null;
};
Segment key is ":string /0", row key is set to column which is unique for each row and partition key is set to column which is const for all rows. In other words all my test data set goes to single segment on single partition. When I'm using single one-node Spark then everything works fine and I get all data inserted but when I'm running two separate one-node Sparks(not as a cluster) at the same time then I get lost about 30-60% of data even if I send messages one by one with several seconds as interval.
I checked that dataFrame.write() is executed for each message so the issue happens after this line.
When I'm setting segment key to column which is unique for each row then all data reaches Cassandra/FiloDB.
Please suggest me solutions for scenario with 2 separate sparks.
#psyduck, this is most likely because data for each partition can only be ingested on one node at a time -- for the 0.4 version. So to stick with the current version, you would need to partition your data into multiple partitions and then ensure each worker only gets one partition. The easiest way to achieve the above is to sort your data by partition key.
I would highly encourage you to move to the latest version though - master (Spark 2.x / Scala 2.11) or spark1.6 branch (spark 1.6 / Scala 2.10). The latest version has many changes that are not in 0.4 that would solve your problem:
Using Akka Cluster to automatically route your data to the right ingestion node. In this case with the same model your data would all go to the right node and ensure no data loss
TimeUUID-based chunkID, so even in case multiple workers (in case of a split brain) somehow write to the same partition, data loss is avoided
A new "segment less" data model so you don't need to define any segment keys, more efficient for both reads and writes
Feel free to reach out on our mailing list, https://groups.google.com/forum/#!forum/filodb-discuss

Spark SQL + Streaming issues

We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the results as alerts to the users. Executing the query based on each incoming event batch seems to be very slow. Would appreciate if anyone can suggest a better approach to implementing this use case. Also, would like know if Spark is executing the sql on the driver or workers? Thanks in advance. Given below are the steps we perform in order to achieve this -
1) Load the initial dataset from an external database as a JDBCRDD
JDBCRDD<SomeState> initialRDD = JDBCRDD.create(...);
2) Create an incoming DStream (that captures updates to the initialized data)
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream =
FlumeUtils.createStream(ssc, flumeAgentHost, flumeAgentPort);
JavaDStream<SomeState> incomingDStream = flumeStream.map(...);
3) Create a Pair DStream using the incoming DStream
JavaPairDStream<Object,SomeState> pairDStream =
incomingDStream.map(...);
4) Create a Stateful DStream from the pair DStream using the initialized RDD as the base state
JavaPairDStream<Object,SomeState> statefulDStream = pairDStream.updateStateByKey(...);
JavaRDD<SomeState> updatedStateRDD = statefulDStream.map(...);
5) Run a user-driven query against the updated state based on the values in the incoming stream
incomingStream.foreachRDD(new Function<JavaRDD<SomeState>,Void>() {
#Override
public Void call(JavaRDD<SomeState> events) throws Exception {
updatedStateRDD.count();
SQLContext sqx = new SQLContext(events.context());
schemaDf = sqx.createDataFrame(updatedStateRDD, SomeState.class);
schemaDf.registerTempTable("TEMP_TABLE");
sqx.sql(SELECT col1 from TEMP_TABLE where <condition1> and <condition2> ...);
//collect the results and process and send alerts
...
}
);
The first step should be to identify which step is taking most of the time.
Please see the Spark Master UI and identify which Step/ Phase is taking most of the time.
There are few best practices + my observations which you can consider: -
Use Singleton SQLContext - See example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
updateStateByKey can be a memory intensive operation in case of large number of keys. You need to check size of data processed by
updateStateByKey function and also if it fits well in the given
memory.
How is your GC behaving?
Are you really using "initialRDD"? if not then do not load it. In case it is static dataset then cache it.
Check the time taken by your SQL Query too.
Here are few more questions/ areas which can help you
What is the StorageLevel for DStreams?
Size of cluster and configuration of Cluster
version of Spark?
Lastly - ForEachRDD is an Output Operation which executes the given function on the Driver but RDD might actions and those actions are executed on worker nodes.
You may need to read this for better explaination about Output Operations - http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
I too facing the same issue could you please let me know if you have got the solution for the same? Though I have mentioned the detailed use case in below post.
Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

Resources