Doing flatmap on a function returning RDD - apache-spark

I am trying to process multiple avro files in the code below. the idea is to first get a series of avro files in a list. then open each avro file and generate a steam of tuples (string, int). then finally group the stream of tuples by key and sum the ints.
object AvroCopyUtil {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf().setAppName("Leads Data Analysis").setMaster("local[*]")
val sc = new SparkContext(conf)
val fs = FileSystem.get(new Configuration())
val avroList = GetAvroList(fs, args(0))
avroList.flatMap(av =>
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av)
.map(r => (r._1.datum.get("field").toString, 1)))
.reduceByKey(_ + _)
.foreach(println)
}
def GetAvroList(fs: FileSystem, input: String) : List[String] = {
// get all children
val masterList : List[FileStatus] = fs.listStatus(new Path(input)).toList
val (allFiles, allDirs) = masterList.partition(x => x.isDirectory == false)
allFiles.map(_.getPath.toString) ::: allDirs.map(_.getPath.toString).flatMap(x => GetAvroList(fs, x))
}
}
The compile error i get is
[error] found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
[error] required: TraversableOnce[?]
[error] avroRdd.flatMap(av => sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av))
[error] ^
[error] one error found
Edit: based on the suggestion below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
but I got the error
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 2015-10-
15-00-1576041136-flumetracker.foo.com-FooAvroEvent.1444867200044.avro,hdfs:

Your function is unnecessary. You are also attempting to create an RDD within a transformation which doesn't really make sense. The transformation (in this case, flatMap) runs on top of an RDD and the records within an RDD will be what is transformed. In the case of a flatMap, the expected output of the anonymous function is a TraversableOnce object which will then be flattened into multiple records by the transformation. Looking at your code though, you don't really need to do a flatMap as a simply map will suffice. Keep in mind also that due to the immutability of RDD's, you must always reassign your transformations into new values.
Try something like:
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](filePath)
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
It seems as though you may need to take some time to grasp some of Spark's basic framework nuances. I would recommend fully reading the Spark Programming Guide. Lastly, if you want to use Avro, please also check out spark-avro as much of the boiler plate around working with Avro is taken care of there (and DataFrames may perhaps be more intuitive and easier to use for your use case).
(EDIT:)
It seems like you may have misunderstood how to load data to be processed in Spark. The parallelize() method is used to distribute collections across an RDD and not data within files. To do the latter, you actually only need to provide a comma-separated list of input files to the newAPIHadoopFile() loader. So assuming your GetAvroList() function works, you can do:
val avroList = GetAvroList(fs, args(0))
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
flatMappedRDD.foreach(println)

Related

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

Decoding Java enums/custom non case classes using Structured Spark Streaming

I am trying to use structured streaming in Spark 2.1.1 to read from Kafka and decode Avro encoded messages. I have the a UDF defined as per this question.
val sr = new CachedSchemaRegistryClient(conf.kafkaSchemaRegistryUrl, 100)
val deser = new KafkaAvroDeserializer(sr)
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
val topic = conf.inputTopic
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.kafkaServers)
.option("subscribe", topic)
.load()
df.printSchema()
val result = df.selectExpr("CAST(key AS STRING)", """decodeMessage($"value") as "value_des"""")
val query = result.writeStream
.format("console")
.outputMode(OutputMode.Append())
.start()
However I get the following failure.
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type DeviceRelayStateEnum is not supported
It fails on this line
val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }
An alternate approach was to define encoders for the custom classes I have
implicit val enumEncoder = Encoders.javaSerialization[DeviceRelayStateEnum]
implicit val messageEncoder = Encoders.product[DeviceRead]
but that fails with the following error when the messageEncoder is getting registered.
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for DeviceRelayStateEnum
- option value class: "DeviceRelayStateEnum"
- field (class: "scala.Option", name: "deviceRelayState")
- root class: "DeviceRead"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:476)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
When I attempt to do this using a map after the load() I get the following compilation error.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[DeviceRead])org.apache.spark.sql.Dataset[DeviceRead].
Unspecified value parameter evidence$6.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Does that essentially mean that I cannot use Structured Streaming for Java enums? And it can only be used with either primitives or case classes?
I read a few related questions 1, 2, 3 around this and it seems the possibility of specifying a custom Encoder for a class i.e. UDT was removed in 2.1 and the new functionality was not added.
Any help will be appreciated.
I think you may be asking for too much in the current version of Structured Streaming (and Spark SQL) in general.
I've been yet unable to fully understand how to deal with the issue of missing encoders in a so-called more professional way, but the same issue you'd get when you tried to create a Dataset of enums. That might not simply be supported yet.
Structured Streaming is just a streaming library on top of Spark SQL and uses it for serialization-deserialization (SerDe).
To make the story short and to get you going (until you figure out a better way), I'd recommend avoid using enums in the business objects you use to represent the schema of your datasets.
So, I'd recommend doing something along the lines:
val decodeMessage = udf { bytes:Array[Byte] =>
val dr = deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead]
// do additional transformation here so you use a custom streaming-specific class
// Here I'm using a simple tuple to hold what might be relevant
// You could create a case class instead to have proper names
(dr.id, dr.value)
}

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"

Spark: Converting nested JSON DataFrame to Parquet crashes with NPE

Spark 1.5.0. My source data is lots of JSON files containing nested JSON structures, which I want to turn into Parquet. However, this crashes with what appears to be a Spark bug, and I'm struggling to find a way around it.
I start by loading:
val df = sqlContext.read.json("s3://...")
This has a schema of 7 columns, but that's just the top-level. Some of those columns are JSON objects, and if we traverse the entire thing we end up with something like 125 columns. However, Parquet doesn't understand this nesting, so we need to flatten it.
Therefore I do like this:
val flat = flattenSchema(df.schema.toArray) // Array[StructField]
val goodfields = flat filter(isgood) // gets rid of problematic columns
I have a function that flattens a row the same way we flattened the schema, and which also filters on the way. So this means that goodfields(x) has the definition for row(x) for all x.
Anyway, we now make an RDD of flattened rows, then turn it into a DataFrame with the flattened schema:
val rows = df.map{ row => Row.fromSeq(smartFlattenRow(row, df.schema, isgood)) }
val newdf = sqlContext.createDataFrame(rows, StructType(goodfields))
So far, everything looks great. Then I try to save to Parquet and it goes kaboom:
newdf.saveAsParquetFile("s3://whatever.parquet")
It runs a little while, then fails with
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290)
at $line122.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:27)
The line that fails is this one:
def schema: StructType = queryExecution.analyzed.schema
However, newdf.schema works just fine, so it's not my new DataFrame that's causing the problem. What's confusing is that rows is an RDD, so that's not the cause either. And we started out by using df.schema, so it's not that DataFrame, either. So where is this coming from?
Am I somehow building this new structure the wrong way? I based the entire thing on the Spark programming guide, so seems like it should be legit.
Additional info
To provide a little more detail, here is some example data and the supporting functions.
The actual data is sensitive, so I can't share that, but to give you an idea it's basically W3C Activity Streams 2.0 with some custom extensions. Example 6 looks a lot like our real data. Posting the entire thing is too much as it's literally 127 fields.
The code to flattenSchema:
def flattenSchema(fields: Array[StructField]) : Array[StructField] = {
if (fields isEmpty)
fields
else {
val first = fields head
val rest = fields tail
first.dataType match {
case subfields: StructType =>
flattenSchema(subfields.fields).map { (f : StructField) =>
StructField(first.name + "." + f.name,
f.dataType, f.nullable, f.metadata)
} ++ flattenSchema(rest)
case _ => first +: flattenSchema(rest)
}
}
}
And the code to smartFlattenRow:
def smartFlattenRow(row : Row, schema : StructType,
pred : StructField => Boolean) : Seq[Any] = {
val cols = schema.toArray
(0 until row.size).flatMap { ix =>
cols(ix).dataType match {
case subfields: StructType =>
val subrow = if (row.isNullAt(ix))
Row.fromSeq(Seq.fill(subfields.fields.size)(null))
else
row(ix).asInstanceOf[Row]
smartFlattenRow(subrow, subfields, pred )
case _ => if (pred(cols(ix))) {
if (row.isNullAt(ix))
Seq(null)
else
Seq(row(ix))
} else
Seq()
}
}
}

NotSerializableException: org.apache.hadoop.io.LongWritable

I know this question has been answered many times, but I tried everything and I do not come to a solution. I have the following code which raises a NotSerializableException
val ids : Seq[Long] = ...
ids.foreach{ id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[MyWritable]).lookup(new LongWritable(id))
}
With the following exception
Caused by: java.io.NotSerializableException: org.apache.hadoop.io.LongWritable
Serialization stack:
...
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
When creating the SparkContext, I do
val sparkConfig = new SparkConf().setAppName("...").setMaster("...")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.registerKryoClasses(Array(classOf[BitString[_]], classOf[MinimalBitString], classOf[org.apache.hadoop.io.LongWritable]))
sparkConfig.set("spark.kryoserializer.classesToRegister", "org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable")
and looking at the environment tab, I can see these entries. However, I do not understand why
the Kryo serializer does not seem to be used (the stack does not mention Kryo)
LongWritable is not serialized.
I'm using Apache Spark v. 1.5.1
Loading repeatedly the same data inside a loop is extremely inefficient. If you perform actions against the same data load it once and cache:
val rdd = sc
.sequenceFile("file", classOf[LongWritable], classOf[MyWritable])
rdd.cache()
Spark doesn't consider Hadoop Writables to be serializable. There is an open JIRA (SPARK-2421) for this. To handle LongWritables simple get should be enough:
rdd.map{case (k, v) => k.get()}
Regarding your custom class it is your responsibility to deal with this problem.
Effective lookup requires a partitoned RDD. Otherwise it has to search every partition in your RDD.
import org.apache.spark.HashPartitioner
val numPartitions: Int = ???
val partitioned = rdd.partitionBy(new HashPartitioner(numPartitions))
Generally speaking RDDs are not designed for random access. Even with defined partitioner lookup has to linearly search candidate partition. With 5000 uniformly distributed keys and 10M objects in an RDD it most likely means a repeated search over a whole RDD. You have few options to avoid that:
filter
val idsSet = sc.broadcast(ids.toSet)
rdd.filter{case (k, v) => idsSet.value.contains(k)}
join
val idsRdd = sc.parallelize(ids).map((_, null))
idsRdd.join(rdd).map{case (k, (_, v)) => (k, v)}
IndexedRDD - it doesn't like a particularly active project though
With 10M entries you'll probably be better with searching locally in memory than using Spark. For a larger data you should consider using a proper key-value store.
I'm new to apache spark but tried to solve your problem, please evaluate it, if it can help you out with the problem of serialization, it's occurring because for spark - hadoop LongWritable and other writables are not serialized.
val temp_rdd = sc.parallelize(ids.map(id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[LongWritable]).toArray.toSeq
)).flatMap(identity)
ids.foreach(id =>temp_rdd.lookup(new LongWritable(id)))
Try this solution. It worked fine for me.
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkMapReduceApp");
conf.registerKryoClasses(new Class<?>[]{
LongWritable.class,
Text.class
});

Resources