Spark JSON text field to RDD with null values - apache-spark

I've got a cassandra table with a field named table
[app, ts, data]
I have several JSON objects inside data field in the cassandra table and I need to compute each object independently
val conf = new SparkConf().setAppName("datamonth")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val rdd = sc.cassandraTable[(String, String, String)]("datam", "data_month")
val jsons = rdd.map(_._3)
val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
jsonSchemaRDD.registerTempTable("testjson")
sqlContext.sql("SELECT * FROM testjson where .... ").collect
because of NULL values i got this error
17/03/02 12:00:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 24) java.lang.NullPointerException: Unexpected null value of column data in datam.data_month.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper
How do i proceed ? I am new on Spark and Cassandra and don't know how can i wrap the column type into Option.

Related

Creating an RDD from ConsumerRecord Value in Spark Streaming

I am trying to create a XmlRelation based on ConsumerRecord Value.
val value = record.value();
logger.info(".processRecord() : Value ={}" , value)
if(value !=null) {
val rdd = spark.sparkContext.parallelize(List(new String(value)))
How ever when i try to create an RDD based on the value i am getting NullPointerException.
org.apache.spark.SparkException: Job aborted due to stage failure:
Is this because i cannot create an RDD as i cannot get sparkContext on on worker nodes. Obviously i cannot send this information to back to the Driver as this is an infinite Stream.
What alternatives do i have.
The other alternative is write this record data along with Header info to another topic and write it back to another topic and have another streaming job process that info.
The ConsumerRecord Value i am getting is String (XML) and i want to parse it using an existing schema into an RDD and process it further.
Thanks
Sateesh
I am able to use the following code and make it work
val xmlStringDF:DataFrame = batchDF.selectExpr("value").filter($"value".isNotNull)
logger.info(".convert() : xmlStringDF Schema ={}",xmlStringDF.schema.treeString)
val rdd: RDD[String] = xmlStringDF.as[String].rdd
logger.info(".convert() : Before converting String DataFrame into XML DataFrame")
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
val xmlDF = spark.baseRelationToDataFrame(relation)

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

Applying a specific schema on an Apache Spark data frame

I am trying to apply a particular schema on a dataframe , the schema seems to have been applied but all dataframe operations like count, show, etc. always fails with NullPointerException as shown below:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:218)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
Here is my code:
var fieldSchema = ListBuffer[StructField]()
val columns = mydf.columns
for (i <- 0 until columns.length) {
val columns = mydf.columns
val colName = columns(i)
fieldSchema += StructField(colName, mydf.schema(i).dataType, true, null)
}
val schema = StructType(fieldSchema.toList)
val newdf = sqlContext.createDataFrame(df.rdd, schema) << df is the original dataframe
newdf.printSchema() << this prints the new applied schema
println("newdf count:"+newdf.count()) << this fails with null pointer exception
In short,there are actually 3 dataframes:
df - the original data frame
mydf- the schema that I'm trying to apply on df is coming from this dataframe
newdf- creating a new dataframe same as that of df, but with different schema

Convert org.apache.avro.generic.GenericRecord to org.apache.spark.sql.Row

I have list of org.apache.avro.generic.GenericRecord, avro schemausing this we need to create dataframe with the help of SQLContext API, to create dataframe it needs RDD of org.apache.spark.sql.Row and avro schema. Pre-requisite to create DF is we should have RDD of org.apache.spark.sql.Row and it can be achieved using below code but some how it is not working and giving error, sample code.
1. Convert GenericRecord to Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
def convertGenericRecordToRow(genericRecords: Seq[GenericRecord], avroSchema: Schema, schemaType: StructType): Seq[Row] =
{
val fields = avroSchema.getFields
var rows = new Seq[Row]
for (avroRecord <- genericRecords) {
var avroFieldsSeq = Seq[Any]();
for (i <- 0 to fields.size - 1) {
avroFieldsSeq = avroFieldsSeq :+avroRecord.get(fields.get(i).name)
}
val avroFieldArr = avroFieldsSeq.toArray
val genericRow = new GenericRowWithSchema(avroFieldArr, schemaType)
rows = rows :+ genericRow
}
return rows;
}
2. Convert `Avro schema` to `Structtype`
Use `com.databricks.spark.avro.SchemaConverters -> toSqlType` function , it will convert avro schema to StructType
3. Create `Dataframe` using `SQLContext`
val rowSeq= convertGenericRecordToRow(genericRecords, avroSchema, schemaType)
val rowRdd = sc.parallelize(rowSeq, 1)
val finalDF =sqlContext.createDataFrame(rowRDD,structType)
But it is throwing an error at creation of DataFrame. Can someone please help me what is wrong in above code. Apart from this if someone has different logic for converting and creation of dataframe.
Whenever I will invoke any action on Dataframe, it will execute DAG and try to create DF object but in this it is failing with below exception as
ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Error :Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hdpoc-c01-r06-01, executor 1): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateFormat; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 1
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
After this I am trying to give correct version jar in jar parameter of spark submit and with other parameter as --conf spark.driver.userClassPathFirst=true
but now it is failing with MapR as
ERROR CLDBRpcCommonUtils: Exception during init
java.lang.UnsatisfiedLinkError: com.mapr.security.JNISecurity.SetClusterOption(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)
at com.mapr.security.JNISecurity.SetClusterOption(Native Method)
at com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils.init(CLDBRpcCommonUtils.java:163)
at com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils.<init>(CLDBRpcCommonUtils.java:73)
at com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils.<clinit>(CLDBRpcCommonUtils.java:63)
at org.apache.hadoop.conf.CoreDefaultProperties.<clinit>(CoreDefaultProperties.java:69)
at java.lang.Class.forName0(Native Method)
We are using MapR distribution and after class path change in spark-submit, it is failing with above exception.
Can someone please help here or my basic need it to convert Avro GenericRecord into Spark Row so i can create Dataframe with it, please help
Thanks.
Maybe this helps somebody coming a bit later to the game.
Since spark-avro is deprecated and now integrated in Spark, there is a different way this can be accomplished.
import org.apache.spark.sql.avro._
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
val avroSchema = data.head.getSchema
val sparkTypes = SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
val converter = new AvroDeserializer(avroSchema, sparkTypes)
val enconder = RowEncoder.apply(sparkTypes).resolveAndBind()
val rows = data.map { record =>
enconder.fromRow(converter.deserialize(record).asInstanceOf[InternalRow])
}
val df = sparkSession.sqlContext.createDataFrame(sparkSession.sparkContext.parallelize(rows), sparkTypes)
While creating dataframe from RDD[GenericRecord] there are few steps
First need to convert org.apache.avro.generic.GenericRecord into org.apache.spark.sql.Row
Use com.databricks.spark.avro.SchemaConverters.createConverterToSQL(
sourceAvroSchema: Schema,targetSqlType: DataType)
this is private method in spark-avro 3.2 version. If we are having same or less than 3.2 then copy this method into your own util class and use it else directly use it.
Create Dataframe from collection of Row (rowSeq).
val rdd = ssc.sparkContext.parallelize(rowSeq,numParition) val
dataframe = sparkSession.createDataFrame(rowRDD, schemaType)
This resolves my problem.
Hopefully this will help. In the first part you can find how convert from GenericRecord to Row
How to convert RDD[GenericRecord] to dataframe in scala?

WriteConf of Spark-Cassandra Connector being used or not

I am using Spark version 1.6.2, Spark-Cassandra Connector 1.6.0, Cassandra-Driver-Core 3.0.3
I am writing a simple Spark job in which I am trying to insert some rows to a table in Cassandra. The code snippet used was:
val sparkConf = (new SparkConf(true).set("spark.cassandra.connection.host", "<Cassandra IP>")
.set("spark.cassandra.auth.username", "test")
.set("spark.cassandra.auth.password", "test")
.set("spark.cassandra.output.batch.size.rows", "1"))
val sc = new SparkContext(sparkConf)
val cassandraSQLContext = new CassandraSQLContext(sc)
cassandraSQLContext.setKeyspace("test")
val query = "select * from test"
val dataRDD = cassandraSQLContext.cassandraSql(query).rdd
val addRowList = (ListBuffer(
Test(111, 10, 100000, "{'test':'0','test1':'1','others':'2'}"),
Test(111, 20, 200000, "{'test':'0','test1':'1','others':'2'}")
))
val insertRowRDD = sc.parallelize(addRowList)
insertRowRDD.saveToCassandra("test", "test")
Test() is a case class
Now, I have passed the WriteConf parameter output.batch.size.rows when making sparkConf object. I am expecting that this code will write 1 row in a batch at a time in Cassandra. I am not getting any method through which I can cross verify that the configuration of writing a batch in cassandra is not the default one but the one passed in the code snippet.
I could not find anything in the cassandra cassandra.log, system.log and debug.log
So can anyone help me with the method of cross verifying the WriteConf being used by Spark-Cassandra Connector to write batches in Cassandra?
There are two things you can do to verify that your setting was correctly set.
First you can call the method which creates WriteConf
WriteConf.fromSparkConf(sparkConf)
The resulting object can be inspected to make sure all the values are what you want. This is the default arg to SaveToCassandra
You can explicitly pass a WriteConf to the saveToCassandraMethod
saveAsCassandraTable(keyspace, table, writeConf = WriteConf(...))

Resources