Does RDD's .first() method shuffle? - apache-spark

Imagine, we have small_table and big_table, and need do this:
small_table.join(big_table, "left_outer")
Could it be faster if i do this:
small_table.map(row => {
val find = big_table.filter('id === row.id)
if (find.isEmpty) return Smth(row.id, null)
return Smth(row.id, find.first().name)
})

If you were able to access the data of one RDD inside a mapping of another RDD, you could run some performance tests here to see the difference. Unfortunately, the following code:
val find = big_table.filter('id === row.id)
is not possible, because this attempts to access the data of one RDD inside another RDD.

Related

how to do record count for different dataframe write without action using sparkListener?

Need to know the count of the dataframe after write without invoking additional action
I know using spark listener we can calculate like below. But Below code getting called for all task completed. Say i have dataframe1 and dataframe 2
for both dataframe write of each task onTaskEnd getting called. so i need a flag to segregate this call for dataframe1 and datafarme2 to increase counter.
var dataFrame_1_counter = 0L
var dataFrame_2_counter = 0L
sparkSession.sparkContext.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
if(`isDataFrame1Call`){ // any way for isDataFrame1Call?
dataFrame_1_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}else{
dataFrame_2_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
Need isDataFrame1Call flag. is there any way?
this solved by using jobgroup setting for each thread in spark

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).
I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Spark : Call a custom method before processing rdd on each executor

I am working on a spark Streaming application . I have a requirement where i need to verify certain condition( by reading file present in local FS).
I tried doing:
lines.foreachRDD{rdd =>
verifyCondition
rdd.map() ..
}
def verifyCondition(){
...
}
But verifyCondition is being executed only by Driver. Is there any way we can execute it by each executors?
Thanks
You can move verifyCondition function inside rdd.map() like
rdd.map{
verifyCondition
...
}
because inside map is a closure(a closure is a record storing a function together with an environment) and spark will distribute it over executors and it will be executed by each executor.
lines.foreachRDD { rdd =>
rdd.foreachPartition => partition
verifyCondition(...) // This will be executed by executors, once per every partition
partition.map(...)
}
}

How do I pass functions into Spark transformations during scalatest?

I am using Flatspec to run a test and keep hitting an error because I pass a function into map. I've encountered this problem a few times, but just found a workaround by just using an anonymous function. That doesn't seem to be possible in this case. Is there a way of passing functions into transformations in scalatest?
code:
“test” should “fail” in {
val expected = sc.parallelize(Array(Array(“foo”, “bar”), Array(“bar”, “qux”)))
def validateFoos(firstWord: String): Boolean = {
if (firstWord == “foo”) true else false
}
val validated = expected.map(x => validateFoos(x(0)))
val trues = expected.map(row => true)
assert(None === RDDComparisons.compareWithOrder(validated, trues))
}
error:
org.apache.spark.SparkException: Task not serializable
*This uses Holden Karau's Spark testing base:
https://github.com/holdenk/spark-testing-base
The "normal" way of handing this is to define the outer class to be serilizable, this is a bad practice in anything except for tests since you don't want to ship a lot of data around.

How to insert (not save or update) RDD into Cassandra?

I am working with Apache Spark and Cassandra, and I want to save my RDD to Cassandra with spark-cassandra-connector.
Here's the code:
def saveToCassandra(step: RDD[(String, String, Date, Int, Int)]) = {
step.saveToCassandra("keyspace", "table")
}
This works fine most of the time, but overrides data that's already present in the db. I would like not to override any data. Is it somehow possible ?
What I do is this:
rdd.foreachPartition(x => connector.WithSessionDo(session => {
someUpdater.UpdateEntries(x, session)
// or
x.foreach(y => someUpdater.UpdateEntry(y, session))
}))
The connector above is CassandraConnector(sparkConf).
It's not as nice as a simple saveToCassandra, but it allows for a fine-grained control.
I think it's better to use WithSessionDo outside the foreach partition instead. There's overhead involved in that call that need not be repeated.

Resources