Remove duplicates from a Spark JavaPairDStream / JavaDStream - apache-spark

I'm building a Spark Streaming Application which receives data via a SocketTextStream. The problem is, that the sended data has some duplicates. I would like to remove them on the Spark-side (without a pre-filtering on sender side). Can I use the JavaPairRDD's distinct function via DStream's foreach (I can't find a way how to do that)??? I need the "filtered" Java(Pair)DStream for later actions...
Thank you!

The .transform() method can be used to do arbitrary operations on each time slice of RDDs. Assuming your data are just strings:
someDStream.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
#Override
public JavaRDD<String> call(JavaRDD<String> rows) throws Exception {
return rows.distinct();
}
});

Related

how to do record count for different dataframe write without action using sparkListener?

Need to know the count of the dataframe after write without invoking additional action
I know using spark listener we can calculate like below. But Below code getting called for all task completed. Say i have dataframe1 and dataframe 2
for both dataframe write of each task onTaskEnd getting called. so i need a flag to segregate this call for dataframe1 and datafarme2 to increase counter.
var dataFrame_1_counter = 0L
var dataFrame_2_counter = 0L
sparkSession.sparkContext.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
if(`isDataFrame1Call`){ // any way for isDataFrame1Call?
dataFrame_1_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}else{
dataFrame_2_counter += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
Need isDataFrame1Call flag. is there any way?
this solved by using jobgroup setting for each thread in spark

Azure DataBricks Stream foreach fails with NotSerializableException

I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet (lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long]. This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]):
class MyStreamProcessor extends ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)
}
override def close(errorOrNull: Throwable): Unit = {}
}
val query = lastContacts
.writeStream
.foreach(new MyStreamProcessor())
.start()
query.awaitTermination()
I receive a huge stack trace, which the relevant part (I think) is this: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter
Could anyone explain why this exception occurs and how to avoid? Thank you!
This question is related to the following two:
DataFrame to RDD[(String, String)] conversion
Call a function with each element a stream in Databricks
Spark Context is not serializable.
Any implementation of ForeachWriter must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
In your code, you are trying to use spark context within process method,
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
*sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)*
}
To send data to redis, you need to create your own connection and open it in the open method and then use it in the process method.
Take a look how to create redis connection pool. https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).
I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Spark Broadcasting a HashMap no nullpointer but it doesnt fetch any values aswell

I am broadcasting a hashmap and returning a map from the below method
public static Map<Object1, Object2> lkpBC (JavaSparkContext ctx, String FilePath) {
Broadcast<Map<Object1, Object2>> CodeBC = null;
Map<Object1, Object2> codePairMap=null;
try {
Map<Object1, Object2> CodepairMap = LookupUtil.loadLookup(ctx, FilePath);
CodeBC = ctx.broadcast(codePairMap);
codePairMap= CodeBC.value();
} catch (Exception e) {
LOG.error("Error while broadcasting ", e);
}
return codePairMap;
}
and passing the map to the below method
public static JavaRDD<Object3> fetchDetails(
JavaSparkContext ctx,
JavaRDD<Object3> CleanFileRDD,
String FilePath,
Map<Object1, Object2> BcMap
) {
JavaRDD<Object3r> assignCd = CleanFileRDD.map(row -> {
object3 FileData = null;
try {
FileData = row;
if (BCMap.containsKey("some key")) {......}
} catch (Exception e) {
LOG.error("Error in Map function ", e);
}
return some object;
});
return assignCd;
}
in the local mode it works fine without any issues but when i run this on a spark standalone cluster(1 master 3 slaves) on EC2 this doesn't fetch any values nor throws an error. All the objects you see in the methods are serialized. Does it matter if i am calling these methods from a main class or any other different class?
PS: We use Kyro serializer in the spark conf
I think what's going on is you are not accessing the broadcast variable inside the closure of your map function. I think you are directly accessing the underlying BcMap (or BCMap, not sure if they are supposed to be different).
Line if (BCMap.containsKey("some key")) isn't accessing the broadcast variable CodeBC. Since it seems the type of BCMap is Map, not Broadcast.
To access the broadcast variable you would call CodeBC.value.containsKey.
Spark is designed in a functional way, it doesn't "do" anything to the underlying map, it makes a copy of it, broadcasts the copy, and wraps that copy in a Broadcast type.
I don't know what LookupUtil.loadLookup does, but I guess if the file doesn't exist or is empty does it return an empty map?
Here is an example of how you would do it in Scala:
val bcMap = ctx.broadcast(LookupUtil.loadLookup(ctx, FilePath))
cleanFileRDD.map(row =>
if (bcMap.value.containsKey("some key") ...
else ...)
I think you will solve your situation by following the wise words of a friend of mine "first solve all the obvious issues, then the harder issues seem to solve themselves". In your case they are:
Using mutable variables that get initialised to null
Using try catches that log errors but don't re-throw them. Just let exceptions bubble up.
Prematurely splitting things out into lots of different methods before you have it working as just one method.
And just because something works locally doesn't mean it will work when distributed. There are a lot of differences between running something locally and across a cluster, like: a) Data locality b) Serialization c) Closure capture d) Number of threads e) execution order ... etc

How to insert (not save or update) RDD into Cassandra?

I am working with Apache Spark and Cassandra, and I want to save my RDD to Cassandra with spark-cassandra-connector.
Here's the code:
def saveToCassandra(step: RDD[(String, String, Date, Int, Int)]) = {
step.saveToCassandra("keyspace", "table")
}
This works fine most of the time, but overrides data that's already present in the db. I would like not to override any data. Is it somehow possible ?
What I do is this:
rdd.foreachPartition(x => connector.WithSessionDo(session => {
someUpdater.UpdateEntries(x, session)
// or
x.foreach(y => someUpdater.UpdateEntry(y, session))
}))
The connector above is CassandraConnector(sparkConf).
It's not as nice as a simple saveToCassandra, but it allows for a fine-grained control.
I think it's better to use WithSessionDo outside the foreach partition instead. There's overhead involved in that call that need not be repeated.

Resources