Spark RDD to Dataframe with schema specifying - apache-spark

It seems that spark can't apply schema (differs from String) for DataFrame when converting it from RDD through Row object. I've tried both on Spark 1.4 and 1.5 version.
Snippet (Java API):
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
directKafkaStream.foreachRDD(rdd -> {
rdd.foreach(x -> System.out.println("x._1() = " + x._1()));
rdd.foreach(x -> System.out.println("x._2() = " + x._2()));
JavaRDD<Row> rowRdd = rdd.map(x -> RowFactory.create(x._2().split("\t")));
rowRdd.foreach(x -> System.out.println("x = " + x));
SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
StructField id = DataTypes.createStructField("id", DataTypes.IntegerType, true);
StructField name = DataTypes.createStructField("name", DataTypes.StringType, true);
List<StructField> fields = Arrays.asList(id, name);
StructType schema = DataTypes.createStructType(fields);
DataFrame sampleDf = sqlContext.createDataFrame(rowRdd, schema);
sampleDf.printSchema();
sampleDf.show();
return null;
});
jssc.start();
jssc.awaitTermination();
It produces following output if specify DataTypes.StringType for "id" field:
x._1() = null
x._2() = 1 item1
x = [1,item1]
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
+---+-----+
| id| name|
+---+-----+
| 1|item1|
+---+-----+
For specified code it throws error:
x._1() = null
x._2() = 1 item1
x = [1,item1]
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
15/09/16 04:13:33 ERROR JobScheduler: Error running job streaming job 1442402013000 ms.0
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:40)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:220)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$IntConverter$.toScalaImpl(CatalystTypeConverters.scala:358)
Similar issue was on Spark Confluence, but it marked as resolved for 1.3 ver.

You are mixing two different things - data types and DataFrame schema. When you create Row like this:
RowFactory.create(x._2().split("\t"))
you get Row(_: String, _: String) but your schema states that you have Row(_: Integer, _: String). Since there is no automatic type conversion you get the error.
To make it work you can either cast values when you create rows or define id as a StringType and use Column.cast method after you've created a DataFrame.

Related

Spark Structured streaming - reading timestamp from file using schema

I am working on a Structured Streaming job.
The data I am reading from files contains the timestamp (in millis), deviceId and a value reported by that device.
Multiple devices report data.
I am trying to write a job that aggregates (sums) values sent by all devices into tumbling windows of 1 minute.
The issue that I am having is with timestamp.
When I am trying to parse "timestamp" into Long, window function complains that it expects "timestamp type".
When I am trying to parse into TimestampType as in the snippet below I am getting .MatchError exception (the full exception can be seen below) and I am struggling to figure out why and what is the correct way to handle it
// Create schema
StructType readSchema = new StructType().add("value" , "integer")
.add("deviceId", "long")
.add("timestamp", new TimestampType());
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
Dataset<Row> aggregations = inputDataFrame.groupBy(window(inputDataFrame.col("timestamp"), "1 minutes"),
inputDataFrame.col("deviceId"))
.agg(sum("value"));
The exception:
org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
scala.MatchError: org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1692)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:232)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:242)
at org.apache.spark.sql.streaming.DataStreamReader.parquet(DataStreamReader.scala:450)
Typically, when your timestamp is stored in milis as a long you would convert it into a timestamp type as shown below:
// Create schema and keep column 'timestamp' as long
StructType readSchema = new StructType()
.add("value", "integer")
.add("deviceId", "long")
.add("timestamp", "long");
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
// convert timestamp column into a proper timestamp type
Dataset<Row> df1 = inputDataFrame.withColumn("new_timestamp", expr("timestamp/1000").cast(DataTypes.TimestampType));
df1.show(false)
+-----+--------+-------------+-----------------------+
|value|deviceId|timestamp |new_timestamp |
+-----+--------+-------------+-----------------------+
|1 |1337 |1618836775397|2021-04-19 14:52:55.397|
+-----+--------+-------------+-----------------------+
df1.printSchema();
root
|-- value: integer (nullable = true)
|-- deviceId: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- new_timestamp: timestamp (nullable = true)

error: overloaded method value createDataFrame

I tried to create Apache Spark dataframe
val valuesCol = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
valuesCol: Seq[(String, String)] = List((Male,2019-09-06), (Female,2019-09-06), (Male,2019-09-07))
Schema
val someSchema = List(StructField("sex", StringType, true),StructField("date", DateType, true))
someSchema: List[org.apache.spark.sql.types.StructField] = List(StructField(sex,StringType,true), StructField(date,DateType,true))
It does not work
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
I got error
<console>:30: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(String, String)], org.apache.spark.sql.types.StructType)
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
Should I change date formatting in valuesCol? What actually causes this error?
With import spark.implicits._ you could convert Seq into Dataframe in place
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF() // <--- Here
Explicitly setting column names:
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
For the desired schema, you could either cast column or use a different type
//Cast
Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
.select($"sex", $"date".cast(DateType))
.printSchema()
//Types
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
Seq(
("Male", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Female", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Male", new java.sql.Date(format.parse("2019-09-07").getTime)))
.toDF("sex", "date")
.printSchema()
//Output
root
|-- sex: string (nullable = true)
|-- date: date (nullable = true)
Regarding your question, your rdd type is known, Spark will create schema accordingly to it.
val rdd: RDD[(String, String)] = spark.sparkContext.parallelize(valuesCol)
spark.createDataFrame(rdd)
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
You can specify your valuesCol as Seq of Row instead of Seq of Tuple :
val valuesCol = Seq(
Row("Male", "2019-09-06"),
Row ("Female", "2019-09-06"),
Row("Male", "2019-09-07"))

Convert from long to Timestamp for Insert into DB

Goal:
Read data from a JSON file where timestamp is a long type, and insert into a table that has a Timestamp type. The problem is that I don't know how to convert the long type to a Timestamp type for the insert.
Input File Sample:
{"sensor_id":"sensor1","reading_time":1549533263587,"notes":"My Notes for
Sensor1","temperature":24.11,"humidity":42.90}
I want to read this, create a Bean from it, and insert into a table. Here is my Bean Definition:
public class DummyBean {
private String sensor_id;
private String notes;
private Timestamp reading_time;
private double temperature;
private double humidity;
Here is the table I want to insert into:
create table dummy (
id serial not null primary key,
sensor_id varchar(40),
notes varchar(40),
reading_time timestamp with time zone default (current_timestamp at time zone 'UTC'),
temperature decimal(15,2),
humidity decimal(15,2)
);
Here is my Spark app to read the JSON file and do the insert (append)
SparkSession spark = SparkSession
.builder()
.appName("SparkJDBC2")
.getOrCreate();
// Java Bean used to apply schema to JSON Data
Encoder<DummyBean> dummyEncoder = Encoders.bean(DummyBean.class);
// Read JSON file to DataSet
String jsonPath = "input/dummy.json";
Dataset<DummyBean> readings = spark.read().json(jsonPath).as(dummyEncoder);
// Diagnostics and Sink
readings.printSchema();
readings.show();
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readings.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
Output and Error Message:
root
|-- humidity: double (nullable = true)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = true)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = true)
+--------+--------------------+-------------+---------+-----------+
|humidity| notes| reading_time|sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
| 42.9|My Notes for Sensor1|1549533263587| sensor1| 24.11|
+--------+--------------------+-------------+---------+-----------+
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column "reading_time" not found in schema Some(StructType(StructField(id,IntegerType,false), StructField(sensor_id,StringType,true), StructField(notes,StringType,true), StructField(temperature,DecimalType(15,2),true), StructField(humidity,DecimalType(15,2),true)));
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4$$anonfun$6.apply(JdbcUtils.scala:147)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$4.apply(JdbcUtils.scala:146)
The exception in your post says "reading_time" column not found.. so please cross check if the table is having the required column in the db end. Also, the timestamp is coming in milli seconds, so you need to divide that by 1000 before applying the to_timestamp() function, otherwise you'll get a weird date.
I'm able to replicate below and convert the reading_time.
scala> val readings = Seq((42.9,"My Notes for Sensor1",1549533263587L,"sensor1",24.11)).toDF("humidity","notes","reading_time","sensor_id","temperature")
readings: org.apache.spark.sql.DataFrame = [humidity: double, notes: string ... 3 more fields]
scala> readings.printSchema();
root
|-- humidity: double (nullable = false)
|-- notes: string (nullable = true)
|-- reading_time: long (nullable = false)
|-- sensor_id: string (nullable = true)
|-- temperature: double (nullable = false)
scala> readings.show(false)
+--------+--------------------+-------------+---------+-----------+
|humidity|notes |reading_time |sensor_id|temperature|
+--------+--------------------+-------------+---------+-----------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |
+--------+--------------------+-------------+---------+-----------+
scala> readings.withColumn("ts", to_timestamp('reading_time/1000)).show(false)
+--------+--------------------+-------------+---------+-----------+-----------------------+
|humidity|notes |reading_time |sensor_id|temperature|ts |
+--------+--------------------+-------------+---------+-----------+-----------------------+
|42.9 |My Notes for Sensor1|1549533263587|sensor1 |24.11 |2019-02-07 04:54:23.587|
+--------+--------------------+-------------+---------+-----------+-----------------------+
scala>
Thanks for your help. Yes the table was missing the column so I fixed that.
This is what solved it (Java version)
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.to_timestamp;
...
Dataset<Row> readingsRow = readings.withColumn("reading_time", to_timestamp(col("reading_time").$div(1000L)));
// Write to JDBC Sink
String url = "jdbc:postgresql://dbhost:5432/mydb";
String table = "dummy";
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user", "foo");
connectionProperties.setProperty("password", "bar");
readingsRow.write().mode(SaveMode.Append).jdbc(url, table, connectionProperties);
If your date is String you can use
String readtime = obj.getString("reading_time");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); //Z for time zone
Date reading_time = sdf.parse(readtime);
or use
new Date(json.getLong(milliseconds))
if it is long

How do I convert Binary string to scala string in spark scala

I am reading an avro file which contains a field as binary string, I need to convert it into a java.lang.string to pass it to another library(spark-xml-util), how do I convert it into java.lang.string efficiently. This is the code I have got so far : -
val df = sqlContext.read.format("com.databricks.spark.avro").load("filePath/fileName.avro")
df.select("myField").collect().mkString
The last line gives me the following exception: -
Exception in thread "main" java.lang.ClassCastException: [B cannot be cast to java.lang.String
at org.apache.spark.sql.Row$class.getString(Row.scala:255)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:165)
df schema is: -
root
|-- id: string (nullable = true)
|-- myField: binary (nullable = true)
Considering the state of the API right now (2.2.0), your best call is to create a UDF to do just that and replace the column :
import org.apache.spark.sql.functions.udf
val toString = udf((payload: Array[Byte]) => new String(payload))
df.withColumn("myField", toString(df("myField")))
or if as you seem to imply the data is compressed using GZIP you can :
import org.apache.spark.sql.functions.udf
val toString = udf((payload: Array[Byte]) => {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(payload))
scala.io.Source.fromInputStream(inputStream).mkString
})
df.withColumn("myField", toString(df("myField")))
In Spark 3.0 you can cast between BINARY and STRING data.
scala> val df = sc.parallelize(Seq("ABC", "BCD", "CDE", "DEF")).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.select($"value", $"value".cast("BINARY"),
$"value".cast("BINARY").cast("STRING")).show()
+-----+----------+-----+
|value| value|value|
+-----+----------+-----+
| ABC|[41 42 43]| ABC|
| BCD|[42 43 44]| BCD|
| CDE|[43 44 45]| CDE|
| DEF|[44 45 46]| DEF|
+-----+----------+-----+
I don't have your data to test with, but you should be able to do:
df.select($"myField".cast("STRING"))
This obviously depends on the actual data (i.e. don't cast a JPEG to STRING) but assuming it's UTF-8 encoded this should work.
In the previous solution, the code new String(payload) did not work for me on true binary data.
Ultimately the solution was a little more involved, with the length of the binary data required as a 2nd parameter.
def binToString(payload: Array[Byte], payload_length: Int): String = {
val ac: Array[Char] = Range(0,payload_length).map(i => payload(i).toChar).toArray
return ac.mkString
}
val binToStringUDF = udf( binToString(_: Array[Byte], _: Int): String )

How to parse Nested Json string from DynamoDB table in spark? [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Resources