NPE in Spark Java toBytes(row.<Long>getAs("value2")) - apache-spark

Why does the following generate a NPE?
final SparkSession spark = SparkSession.builder()
.appName(" test")
.getOrCreate();
spark.createDataFrame(singletonList(new GenericRow(new Object[] { 1L, null, 3L })),
new StructType(new StructField[] { DataTypes.createStructField("value1", DataTypes.LongType, true),
DataTypes.createStructField("value2", DataTypes.LongType, true),
DataTypes.createStructField("value3", DataTypes.LongType, true)}))
.foreach((ForeachFunction<Row>) row -> {
System.out.println("###" + row.getAs("value1"));
System.out.println(row.<Long>getAs("value2"));
System.out.println(toBytes(row.<Long>getAs("value2")));
System.out.println("###" + row.getAs("value3"));
});
I think this does not occur in Spark 1.6, but unsure, could just be better test data.

So line
System.out.println(toBytes(row.<Long>getAs("value2")));
The line
row.<Long>getAs(“value2”))
returns a null “Long” Object
but then
toBytes(long l)
wants a “long”, so java unboxes the Long into a null long => NPE as shown in this answer.
To protect against this we can use Java Optional
toBytes(Optional.ofNullable(row.<Long>getAs(name)).orElse(0L)));

Related

How to force Spark SQL into codegen mode?

I'm writing a custom Spark catalyst Expression with custom codegen, but it seems that Spark (3.0.0) doesn't want to use the generated code, and falls back to interpreted mode.
I create my SparkSession in a pretty standard way, except that I try to force codegen:
val spark = SparkSession.builder()
.appName("test-spark")
.master("local[5]")
.config("spark.sql.codegen.factoryMode", "CODEGEN_ONLY")
.config("spark.sql.codegen.fallback", "false")
.getOrCreate()
And then I have this custom Expression with both interpreted mode and codegen defined:
case class IsTrimmedExpr(child: Expression) extends UnaryExpression with ExpectsInputTypes {
override def inputTypes: Seq[DataType] = Seq(StringType)
override lazy val dataType: DataType = BooleanType
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
throw new RuntimeException("expected code gen")
nullSafeCodeGen(ctx, ev, input => s"($input.trim().equals($input))")
}
override protected def nullSafeEval(input: Any): Any = {
throw new RuntimeException("should not eval")
val str = input.asInstanceOf[org.apache.spark.unsafe.types.UTF8String]
str.trim.equals(str)
}
}
which I register into the session's registry:
spark.sessionState.functionRegistry.registerFunction(
FunctionIdentifier("is_trimmed"), {
case Seq(s) => IsTrimmedExpr(s)
}
)
To invoke the function/Expression, I do
val df = Seq(" abc", "def", "56 ", " 123 ", "what is a trim").toDF("word")
df.selectExpr("word", "is_trimmed(word)").show()
But instead of the expected exception from the doGenCode function, I got the exception from the nullSafeEval function which should not be run at all.
How do I force Spark to use codegen mode?
Enabling codegen is done via setting spark.sql.codegen to True

Spark Structured Streaming to read nested Kafka Connect jsonConverter message

I have ingested xml file using KafkaConnect file-pulse connector 1.5.3
Then I want to read it with Spark Streaming to parse/flatten it. As it is quite nested.
the string I read out of the kafka (I used the consumer console read this out, and put an Enter/new line before the payload for illustration) is like below:
{
"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"string","optional":true,"field":"city"},{"type":"array","items":{"type":"struct","fields":[{"type":"array","items":{"type":"struct","fields":[{"type":"string","optional":true,"field":"unit"},{"type":"string","optional":true,"field":"value"}],"optional":true,"name":"Value"},"optional":true,"field":"value"}],"optional":true,"name":"ForcedArrayType"},"optional":true,"field":"forcedArrayField"},{"type":"string","optional":true,"field":"lastField"}],"optional":true,"name":"Data","field":"data"}],"optional":true}
,"payload":{"data":{"city":"someCity","forcedArrayField":[{"value":[{"unit":"unitField1","value":"123"},{"unit":"unitField1","value":"456"}]}],"lastField":"2020-08-02T18:02:00"}}
}
datatype I attempted:
StructType schema = new StructType();
schema = schema.add( "schema", StringType, false);
schema = schema.add( "payload", StringType, false);
StructType Data = new StructType();
StructType ValueArray = new StructType(new StructField[]{
new StructField("unit", StringType,true,Metadata.empty()),
new StructField("value", StringType,true,Metadata.empty())
});
StructType ForcedArrayType = new StructType(new StructField[]{
new StructField("valueArray", ValueArray,true,Metadata.empty())
});
Data = Data.add("city",StringType,true);
Data = Data.add("forcedArrayField",ForcedArrayType,true);
Data = Data.add("lastField",StringType,true);
StructType Record = new StructType();
Record = Record.add("data", Data, false);
query I attempted:
//below worked for payload
Dataset<Row> parsePayload = lines
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema=schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload");
System.out.println(parsePayload.isStreaming());
//below makes the output empty:
Dataset<Row> parseValue = parsePayload.select(functions.from_json(functions.col("payload"), Record).as("cols"))
.select(functions.col("cols.data.city"));
//.select(functions.col("cols.*"));
StreamingQuery query = parseValue
.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
query.awaitTermination();
when I oupput the parsePayload stream, i could see the data(still json struture), but when i want to select certain/all field like above city. it is empty.
help needed
Is the cause data type defined wrong? or the query is wrong?
Ps.
at the console, when i tried to output the 'parsePayload', instead of 'parseValue', it displays some data, which made me think the 'payload' part worked.
|{"data":{"city":"...|
...
Your schema definition seems to be not fully correct. I was replicating your problem and was able to parse the JSON with the following schema
val payloadSchema = new StructType()
.add("data", new StructType()
.add("city", StringType)
.add("forcedArrayField", ArrayType(new StructType()
.add("value", ArrayType(new StructType()
.add("unit", StringType)
.add("value", StringType)))))
.add("lastField", StringType))
When I then access individual fields I used the following selection:
val parsePayload = df
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload")
.select(functions.from_json(functions.col("payload"), payloadSchema).as("cols"))
.select(col("cols.data.city").as("city"), explode(col("cols.data.forcedArrayField")).as("forcedArrayField"), col("cols.data.lastField").as("lastField"))
.select(col("city"), explode(col("forcedArrayField.value").as("middleFields")), col("lastField"))
This gives the output
+--------+-----------------+-------------------+
| city| col| lastField|
+--------+-----------------+-------------------+
|someCity|[unitField1, 123]|2020-08-02T18:02:00|
|someCity|[unitField1, 456]|2020-08-02T18:02:00|
+--------+-----------------+-------------------+
Your Schema Definition is wrong.
payload and schema might not be a column/field
Read it as a static Json ( Spark.read.json) and get the schema then use it in structured streaming.

How to set variables in "Where" clause when reading cassandra table by spark streaming?

I'm doing some statistics using spark streaming and cassandra. When reading cassandra tables by spark-cassandra-connector and make the cassandra row RDD to a DStreamRDD by ConstantInputDStream, the "CurrentDate" variable in where clause still stays the same day as the program starts.
The purpose is to analyze the total score by some dimensions till current date, but now the code runs analysis just till the day it start running. I run the code in 2019-05-25 and data inserted into table after that time cannot be take in.
The code I use is like below:
class TestJob extends Serializable {
def test(ssc : StreamingContext) : Unit={
val readTableRdd = ssc.cassandraTable(Configurations.getInstance().keySpace1,Constants.testTable)
.select(
"code",
"date",
"time",
"score"
).where("date<= ?",new Utils().getCurrentDate())
val DStreamRdd = new ConstantInputDStream(ssc,readTableRdd)
DStreamRdd.foreachRDD{r=>
//DO SOMETHING
}
}
}
object GetSSC extends Serializable {
def getSSC() : StreamingContext ={
val conf = new SparkConf()
.setMaster(Configurations.getInstance().sparkHost)
.setAppName(Configurations.getInstance().appName)
.set("spark.cassandra.connection.host", Configurations.getInstance().casHost)
.set("spark.cleaner.ttl", "3600")
.set("spark.default.parallelism","3")
.set("spark.ui.port","5050")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
#transient lazy val ssc = new StreamingContext(sc,Seconds(30))
ssc
}
}
object Main {
val logger : Log = LogFactory.getLog(Main.getClass)
def main(args : Array[String]) : Unit={
val ssc = GetSSC.getSSC()
try{
new TestJob().test(ssc)
ssc.start()
ssc.awaitTermination()
}catch {
case e : Exception =>
logger.error(Main.getClass.getSimpleName+"error :
"+e.printStackTrace())
}
}
}
Table used in this Demo like:
CREATE TABLE test.test_table (
code text PRIMARY KEY, //UUID
date text, // '20190520'
time text, // '12:00:00'
score int); // 90
Any help is appreciated!
In general, RDDs that are returned by Spark Cassandra Connector aren't the streaming RDDs - there is no such functionality in Cassandra that will allow to subscribe to the changes feed and analyze it. You can implement something like by explicitly looping and fetching the data, but it will require careful design of the tables, but it's hard to say something without digging more deeply into requirements for latency, amount of data, etc.

How to repartition Spark DStream Kafka ConsumerRecord RDD

I am getting uneven size of Kafka topics. We want to repartition the input RDD based on some logic.
But when I try to apply the repartition I am getting object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord error.
I found following workaround
Job aborted due to stage failure: Task not serializable
Call rdd.forEachPartition and create the NotSerializable object in there like this:
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
ABOVE LOGIC APPLIED HERE not sure if I missed anything
val stream =KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParam) ).map(_.value())
stream.foreachRDD { rdd =>
val repartitionRDD = flow.repartitionRDD(rdd,1)
println("&&&&&&&&&&&&&& repartitionRDD " + repartitionRDD.count())
val modifiedRDD = rdd.mapPartitions {
iter =>{
val customerRecords: List[ConsumerRecord[String, String]] = List[ConsumerRecord[String, String]]()
while(iter.hasNext){
val consumerRecord :ConsumerRecord[String, String] = iter.next()
customerRecords:+ consumerRecord
}
customerRecords.iterator
}
}
val r = modifiedRDD.repartition(1)
println("************* after repartition " + r.count())
BUT still getting same object not Serializable error. Any help is greatly appreciated.
I tried to make stream transient but that did not resolve the issue either.
I made the test class as Serializable but did not fix the issue.

Trying to understand spark streaming flow

I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow

Resources