Reading binaryFile with Spark Streaming - apache-spark

Does any one know how to setup the `
streamingContext.fileStream [KeyClass, ValueClass, InputFormatClass] (dataDirectory)
to actually consume binary files.
Where can I find all the inputformatClass ? The documentation give no
links for that. I imagine that the ValueClass is related to the
inputformatClass somehow.
In the non-streaming version using the method binaryfiles, I can get
ByteArrays for each files. Is there a way i can get the same with
sparkStreaming ? If not where can i find those details. Meaning the
inputformat supportted and the value class it produces. Finally Can
one pick any KeyClass, aren't all those element connected ?
If someone clarify the use of the method.
EDIT1
I have tried the following:
val bfiles = ssc.fileStreamBytesWritable, BytesWritable, SequenceFileAsBinaryInputFormat
However the compiler complain as such:
[error] /xxxxxxxxx/src/main/scala/EstimatorStreamingApp.scala:14: type arguments [org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat] conform to the bounds of none of the overloaded alternatives of
[error] value fileStream: [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path => Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$10: scala.reflect.ClassTag[K], implicit evidence$11: scala.reflect.ClassTag[V], implicit evidence$12: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path => Boolean, newFilesOnly: Boolean)(implicit evidence$7: scala.reflect.ClassTag[K], implicit evidence$8: scala.reflect.ClassTag[V], implicit evidence$9: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$4: scala.reflect.ClassTag[K], implicit evidence$5: scala.reflect.ClassTag[V], implicit evidence$6: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)]
[error] val bfiles = ssc.fileStream[BytesWritable, BytesWritable, SequenceFileAsBinaryInputFormat]("/xxxxxxxxx/Casalini_streamed")
What am i doing wrong ?

Follow link to read about about all hadoop input formats
I found here well documented answer about sequence file format.
You are facing the compilation issue because of import missmatch.
Hadoop Mapred vs mapreduce
E.g.
Java
JavaPairInputDStream<Text,BytesWritable> dstream=
sc.fileStream("/somepath",org.apache.hadoop.io.Text.class,
org.apache.hadoop.io.BytesWritable.class,
org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat.class);
I didn't try in scala but it should be something similar;
val dstream = sc.fileStream("/somepath",
classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.BytesWritable],
classOf[org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat] ) ;

I finally got it to compile.
The compilation problem was in the import. I used
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat
I replaced it with
import org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat
Then it works. However i have no idea why. I don't understand the difference between the two hierarchy. The two files seem to have the same content. So it hard to say. If someone could help clarify that here, i think it would help a lot

Related

Why does RDD work when I don't have the implicit encoder? Why does providing an implicit encoder also fix the issue?

The Spark quickstart example provides some code to count the occurrences of each word that appears in the README.md document:
val textFile = spark.read.textFile("README.md")
textFile.flatMap(line => line.split(" ")).groupByKey(identity).count().collect()
res0: Array[(String, Long)] = Array(([![PySpark,1), (online,1), (graphs,1), (["Building,1), (documentation,3), (command,,2), (abbreviated,1), (overview,1), (rich,1), (set,2), (-DskipTests,1), (1,000,000,000:,2), (name,1), (["Specifying,1), (stream,1), (run:,1), (not,1), (programs,2), (tests,2), (./dev/run-tests,1), (will,1), ([run,1), (particular,2), (Alternatively,,1), (must,1), (using,3), (./build/mvn,1), (you,4), (MLlib,1), (DataFrames,,1), (variable,1), (Note,1), (core,1), (protocols,1), (Guide](https://spark.apache.org/docs/latest/configuration.html),1), (guidance,2), (shell:,2), (can,6), (site,,1), (*,4), (systems.,1), ([building,1), (configure,1), (for,12), (README,1), (Interactive,2), (how,3), ([Configuration,1), (Hive,2), (provides,1), (Hadoop-supporte...
I thought it would be a good excercise to figure out how to modify the code so that I could count the characters instead of the words. I inncorrectly assume I could replace line.split(" ") which toCharArray(). The original was a list of String, the replacement is a list of Char. This didn't work:
textFile.flatMap(line => line.toCharArray()).groupByKey(identity).count().collect()
<console>:24: error: Unable to find encoder for type Char. An implicit Encoder[Char] is needed to store Char instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
textFile.flatMap(line => line.toCharArray()).groupByKey(identity).count().collect()
^
<console>:24: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
textFile.flatMap(line => line.toCharArray()).groupByKey(identity).count().collect()
After exploring stack overflow for the error message I found Unable to find Encode[Char] while using flatMap with toCharArray in spark which says you need to use rdd. As a result, I modified the above with rdd:
textFile.flatMap(line => line.split("")).groupByKey(identity).count().collect()
res6: Array[(String, Long)] = Array((K,1), (7,3), (l,156), (x,12), (=,5), (<,2), (],19), (g,96), (3,3), (F,7), (Q,2), (*,4), (0,37), (m,75), (!,4), (E,9), (T,14), (f,47), (B,6), ((,22), (n,225), (k,58), (.,81), (_,4), (Y,6), (L,4), (M,8), (V,3), (U,2), (v,43), (e,316), (D,8), (O,2), (o,246), (h,126), (z,1), (C,5), (p,160), (d,108), (J,2), (-,35), (A,18), (/,109), (N,5), (X,1), (y,38), (w,23), (),22), (c,116), (S,35), (u,119), (:,29), (i,215), (R,9), (G,3), (",12), (1,8), (q,2), (j,8), (#,22), (%,1), (`,6), (b,59), (I,6), (&,1), (P,14), (,,32), (a,299), (r,230), ("",41), (" ",462), (?,3), (t,264), (>,6), (2,2), (H,10), (s,228), ([,19))
The issue is summarized by a commenter in the linked stack overflow article:
char is not a default Spark datatype and so it cannot be encoded.
An alternative solution is also using rdd as suggested in one of the answers from the article:
textFile.rdd.flatMap(line => line.toCharArray().map(c=>(c,1))).reduceByKey(_+_).collect()
res7: Array[(Char, Int)] = Array((w,23), (",12), (`,6), (Q,2), (e,316), (G,3), (7,3), (R,9), (B,6), (P,14), (O,2), (b,59), (y,38), (A,18), (#,22), (2,2), (h,126), (o,246), (i,215), (K,1), (3,3), (%,1), (k,58), (n,225), (-,35), (j,8), (J,2), (?,3), (H,10), (S,35), (F,7), (Y,6), (&,1), (1,8), (g,96), (N,5), (l,156), (m,75), (c,116), (T,14), (d,108), (),22), (=,5), (z,1), (s,228), (/,109), (L,4), (x,12), (p,160), (M,8), (a,299), (_,4), (t,264), (.,81), (0,37), (u,119), (I,6), ( ,462), (>,6), (],19), (!,4), (*,4), (f,47), (q,2), (v,43), ((,22), (C,5), (E,9), (U,2), (:,29), (,,32), (V,3), (<,2), ([,19), (X,1), (r,230), (D,8))
Given my experience above I have a fundamental misunderstanding. I have these two questions to help me understand spark better:
Why does RDD work with a non-primitive type?
Why does providing an implicit encoder also fix the issue? How can I provide an implicit Encoder[Char], so that I don't need RDD?

Merging duplicate columns in seq json hdfs files in spark

I am reading a seq json file from HDFS using spark like this :
val data = spark.read.json(spark.sparkContext.sequenceFile[String, String]("/prod/data/class1/20190114/2019011413/class2/part-*").map{
case (x,y) =>
(y.toString)})
data.registerTempTable("data")
val filteredData = data.filter("sourceInfo='Web'")
val explodedData = filteredData.withColumn("A", explode(filteredData("payload.adCsm.vfrd")))
val explodedDataDbg = explodedData.withColumn("B", explode(filteredData("payload.adCsm.dbg"))).drop("payload")
On which I am getting this error:
org.apache.spark.sql.AnalysisException:
Ambiguous reference to fields StructField(adCsm,ArrayType(StructType(StructField(atfComp,StringType,true), StructField(csmTot,StringType,true), StructField(dbc,ArrayType(LongType,true),true), StructField(dbcx,LongType,true), StructField(dbg,StringType,true), StructField(dbv,LongType,true), StructField(fv,LongType,true), StructField(hdr,LongType,true), StructField(hidden,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(hvrx,DoubleType,true), StructField(hvry,DoubleType,true), StructField(inf,StringType,true), StructField(isP,LongType,true), StructField(ltav,StringType,true), StructField(ltdb,StringType,true), StructField(ltdm,StringType,true), StructField(lteu,StringType,true), StructField(ltfm,StringType,true), StructField(ltfs,StringType,true), StructField(lths,StringType,true), StructField(ltpm,StringType,true), StructField(ltpq,StringType,true), StructField(ltts,StringType,true), StructField(ltut,StringType,true), StructField(ltvd,StringType,true), StructField(ltvv,StringType,true), StructField(msg,StringType,true), StructField(nl,LongType,true), StructField(prerender,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(pt,StringType,true), StructField(src,StringType,true), StructField(states,StringType,true), StructField(tdr,StringType,true), StructField(tld,StringType,true), StructField(trusted,BooleanType,true), StructField(tsc,LongType,true), StructField(tsd,DoubleType,true), StructField(tsz,DoubleType,true), StructField(type,StringType,true), StructField(unloaded,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(vdr,StringType,true), StructField(vfrd,LongType,true), StructField(visible,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(xpath,StringType,true)),true),true), StructField(adcsm,ArrayType(StructType(StructField(tdr,DoubleType,true), StructField(vdr,DoubleType,true)),true),true);
Not sure how, but ONLY SOMETIMES there are two structs with the same name "adCsm" inside "payload". Since I am interested in fields present in one of them, I need to deal with this ambiguity.
I know one way is to check for the field A and B and drop the column if the fields are absent and hence choose the other adCsm. Was wondering if there is any better way to handle this? If I can probably just merge the duplicate columns (with different data) instead of this explicit filtering?
Not sure how duplicate structs are even present in a seq "json" file
TIA!
I think, the ambiguity happened due to case sensitivity issue in spark dataframe column name. In the last part of the schema i see
StructField(adcsm,
ArrayType(StructType(
StructField(tdr,DoubleType,true),
StructField(vdr,DoubleType,true)),true),true)
So there is two same name structFields (adScm and adscm) inside plain StructType.
First enable case sensitivity of spark sql by
sqlContext.sql("set spark.sql.caseSensitive=true")
then it'll differentiate the two fields. Here is details to solve case sensitive issue solve case sensitivity issue
. Hopefully it'll help you.

Spark 2.1 - Support for string parameters in callUDF

I have a UDF that accepts string parameters as well as fields, but it seems that "callUDF" can only accept fields.
I found a workaround using selectExpr(...) or by using spark.sql(...), but I wonder if there is any better way of doing that.
Here is an example:
Schema - id, map[String, String]
spark.sqlContext.udf.register("get_from_map", (map: Map[String, String], att: String) => map.getOrElse(att, ""))
val data = spark.read...
data.selectExpr("id", "get_from_map(map, 'attr')").show(15)
This will work, but I was kind of hoping for a better approach like:
data.select($"id", callUDF("get_from_map", $"map", "attr"))
Any ideas? Am I missing something?
I haven't seen any JIRA ticket open about this, so either I'm missing something or I'm miss-using.
Thanks!
You can use a lit function for that
data.select($"id", callUDF("get_from_map", $"map", lit("attr")))
essentially using lit() would allow you to pass literals (strings, numbers) where columns are expected.
You might also want to register your function using the udf function - so you'd be able to use it directly rather than call callUDF :
import org.apache.spark.sql.functions._
val getFromMap = udf((map:Map[String,String], att : String) => map.getOrElse(att,""))
data.select($"id", getFromMap($"map", lit("attr")))

How to specify Encoder when mapping a Spark Dataset from one type to another?

I have a Spark dataset of the following type:
org.apache.spark.sql.Dataset[Array[Double]]
I want to map the array to a Vector so that I can use it as the input dataset for ml.clustering.KMeans.fit(...). So I try to do something like this:
val featureVectors = vectors.map(r => Vectors.dense(r))
But this fails with the following error:
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I guess I need to specify an encoder for the map operation, but I struggle with finding a way to do it. Any ideas?
You need the encoder to be available as implicit evidence:
def map[U : Encoder](func: T => U): Dataset[U]
breaks down to:
def map[U](func: T => U)(implicit evidence$1: Encoder[U]): Dataset[U]
So, you need to pass it in or have it available implicitly.
That said, I do not believe that Vector is supported as of yet, so you might have to drop to a DataFrame.

How to extend Spark Catalyst optimizer with custom rules?

I want to use Catalyst rules to transform star-schema (https://en.wikipedia.org/wiki/Star_schema) SQL query to SQL query to denormalized star-schema where some fields from dimensions tables are represented in facts table.
I tried to find some extension points to add own rules to make a transformation described above. But I didn't find any extension points. So there are the following questions:
How can I add own rules to catalyst optimizer?
Is there another solution to implement a functionality described above?
Following #Ambling advice you can use the sparkSession.experimental.extraStrategies to add your functionality to the SparkPlanner.
An example strategy that simply prints "Hello world" on the console
object MyStrategy extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = {
println("Hello world!")
Nil
}
}
with example run:
val spark = SparkSession.builder().master("local").getOrCreate()
spark.experimental.extraStrategies = Seq(MyStrategy)
val q = spark.catalog.listTables.filter(t => t.name == "five")
q.explain(true)
spark.stop()
You can find a working example project on friend's GitHub: https://github.com/bartekkalinka/spark-custom-rule-executor
As a clue, now in Spark 2.0, you can import extraStrategies and extraOptimizations through SparkSession.experimental.

Resources