How do I convert Binary string to scala string in spark scala - string

I am reading an avro file which contains a field as binary string, I need to convert it into a java.lang.string to pass it to another library(spark-xml-util), how do I convert it into java.lang.string efficiently. This is the code I have got so far : -
val df = sqlContext.read.format("com.databricks.spark.avro").load("filePath/fileName.avro")
df.select("myField").collect().mkString
The last line gives me the following exception: -
Exception in thread "main" java.lang.ClassCastException: [B cannot be cast to java.lang.String
at org.apache.spark.sql.Row$class.getString(Row.scala:255)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:165)
df schema is: -
root
|-- id: string (nullable = true)
|-- myField: binary (nullable = true)

Considering the state of the API right now (2.2.0), your best call is to create a UDF to do just that and replace the column :
import org.apache.spark.sql.functions.udf
val toString = udf((payload: Array[Byte]) => new String(payload))
df.withColumn("myField", toString(df("myField")))
or if as you seem to imply the data is compressed using GZIP you can :
import org.apache.spark.sql.functions.udf
val toString = udf((payload: Array[Byte]) => {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(payload))
scala.io.Source.fromInputStream(inputStream).mkString
})
df.withColumn("myField", toString(df("myField")))

In Spark 3.0 you can cast between BINARY and STRING data.
scala> val df = sc.parallelize(Seq("ABC", "BCD", "CDE", "DEF")).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.select($"value", $"value".cast("BINARY"),
$"value".cast("BINARY").cast("STRING")).show()
+-----+----------+-----+
|value| value|value|
+-----+----------+-----+
| ABC|[41 42 43]| ABC|
| BCD|[42 43 44]| BCD|
| CDE|[43 44 45]| CDE|
| DEF|[44 45 46]| DEF|
+-----+----------+-----+
I don't have your data to test with, but you should be able to do:
df.select($"myField".cast("STRING"))
This obviously depends on the actual data (i.e. don't cast a JPEG to STRING) but assuming it's UTF-8 encoded this should work.

In the previous solution, the code new String(payload) did not work for me on true binary data.
Ultimately the solution was a little more involved, with the length of the binary data required as a 2nd parameter.
def binToString(payload: Array[Byte], payload_length: Int): String = {
val ac: Array[Char] = Range(0,payload_length).map(i => payload(i).toChar).toArray
return ac.mkString
}
val binToStringUDF = udf( binToString(_: Array[Byte], _: Int): String )

Related

error: overloaded method value createDataFrame

I tried to create Apache Spark dataframe
val valuesCol = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
valuesCol: Seq[(String, String)] = List((Male,2019-09-06), (Female,2019-09-06), (Male,2019-09-07))
Schema
val someSchema = List(StructField("sex", StringType, true),StructField("date", DateType, true))
someSchema: List[org.apache.spark.sql.types.StructField] = List(StructField(sex,StringType,true), StructField(date,DateType,true))
It does not work
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
I got error
<console>:30: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(String, String)], org.apache.spark.sql.types.StructType)
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
Should I change date formatting in valuesCol? What actually causes this error?
With import spark.implicits._ you could convert Seq into Dataframe in place
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF() // <--- Here
Explicitly setting column names:
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
For the desired schema, you could either cast column or use a different type
//Cast
Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
.select($"sex", $"date".cast(DateType))
.printSchema()
//Types
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
Seq(
("Male", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Female", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Male", new java.sql.Date(format.parse("2019-09-07").getTime)))
.toDF("sex", "date")
.printSchema()
//Output
root
|-- sex: string (nullable = true)
|-- date: date (nullable = true)
Regarding your question, your rdd type is known, Spark will create schema accordingly to it.
val rdd: RDD[(String, String)] = spark.sparkContext.parallelize(valuesCol)
spark.createDataFrame(rdd)
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
You can specify your valuesCol as Seq of Row instead of Seq of Tuple :
val valuesCol = Seq(
Row("Male", "2019-09-06"),
Row ("Female", "2019-09-06"),
Row("Male", "2019-09-07"))

spark custom kryo encoder not providing schema for UDF

When following along with How to store custom objects in Dataset? and trying to register my own kryo encoder for a data frame I face an issue of Schema for type com.esri.core.geometry.Envelope is not supported
There is a function which will parse a String (WKT) to an geometry object like:
def mapWKTToEnvelope(wkt: String): Envelope = {
val envBound = new Envelope()
val spatialReference = SpatialReference.create(4326)
// Parse the WKT String into a Geometry Object
val ogcObj = OGCGeometry.fromText(wkt)
ogcObj.setSpatialReference(spatialReference)
ogcObj.getEsriGeometry.queryEnvelope(envBound)
envBound
}
This is applied to an UDF like:
implicit val envelopeEncoder: Encoder[Envelope] = Encoders.kryo[Envelope]
val ST_Envelope = udf((wkt: String) => mapWKTToEnvelope(wkt))
However, the UDF will compile but throw a runtime error of:
[error] Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type com.esri.core.geometry.Envelope is not supported
[error] at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
[error] at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
[error] at org.apache.spark.sql.functions$.udf(functions.scala:3076)
edit
Whereas
val first = df[(String, String)].first
val envBound = new Envelope()
val ogcObj = OGCGeometry.fromText(first._1)
ogcObj.setSpatialReference(spatialReference)
ogcObj.getEsriGeometry.queryEnvelope(envBound)
spark.createDataset(Seq((envBound)))(envelopeEncoder)
Works just fine:
root
|-- value: binary (nullable = true)
+--------------------+
| value|
+--------------------+
|[01 00 63 6F 6D 2...|
+--------------------+
How can I get it to work in the UDF as well

How to add a schema to a Dataset in Spark?

I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?
When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)

How to parse Nested Json string from DynamoDB table in spark? [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Spark RDD to Dataframe with schema specifying

It seems that spark can't apply schema (differs from String) for DataFrame when converting it from RDD through Row object. I've tried both on Spark 1.4 and 1.5 version.
Snippet (Java API):
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
directKafkaStream.foreachRDD(rdd -> {
rdd.foreach(x -> System.out.println("x._1() = " + x._1()));
rdd.foreach(x -> System.out.println("x._2() = " + x._2()));
JavaRDD<Row> rowRdd = rdd.map(x -> RowFactory.create(x._2().split("\t")));
rowRdd.foreach(x -> System.out.println("x = " + x));
SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
StructField id = DataTypes.createStructField("id", DataTypes.IntegerType, true);
StructField name = DataTypes.createStructField("name", DataTypes.StringType, true);
List<StructField> fields = Arrays.asList(id, name);
StructType schema = DataTypes.createStructType(fields);
DataFrame sampleDf = sqlContext.createDataFrame(rowRdd, schema);
sampleDf.printSchema();
sampleDf.show();
return null;
});
jssc.start();
jssc.awaitTermination();
It produces following output if specify DataTypes.StringType for "id" field:
x._1() = null
x._2() = 1 item1
x = [1,item1]
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
+---+-----+
| id| name|
+---+-----+
| 1|item1|
+---+-----+
For specified code it throws error:
x._1() = null
x._2() = 1 item1
x = [1,item1]
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
15/09/16 04:13:33 ERROR JobScheduler: Error running job streaming job 1442402013000 ms.0
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:40)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:220)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$IntConverter$.toScalaImpl(CatalystTypeConverters.scala:358)
Similar issue was on Spark Confluence, but it marked as resolved for 1.3 ver.
You are mixing two different things - data types and DataFrame schema. When you create Row like this:
RowFactory.create(x._2().split("\t"))
you get Row(_: String, _: String) but your schema states that you have Row(_: Integer, _: String). Since there is no automatic type conversion you get the error.
To make it work you can either cast values when you create rows or define id as a StringType and use Column.cast method after you've created a DataFrame.

Resources