error: overloaded method value createDataFrame - apache-spark

I tried to create Apache Spark dataframe
val valuesCol = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
valuesCol: Seq[(String, String)] = List((Male,2019-09-06), (Female,2019-09-06), (Male,2019-09-07))
Schema
val someSchema = List(StructField("sex", StringType, true),StructField("date", DateType, true))
someSchema: List[org.apache.spark.sql.types.StructField] = List(StructField(sex,StringType,true), StructField(date,DateType,true))
It does not work
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
I got error
<console>:30: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(String, String)], org.apache.spark.sql.types.StructType)
val someDF = spark.createDataFrame(spark.sparkContext.parallelize(valuesCol),StructType(someSchema))
Should I change date formatting in valuesCol? What actually causes this error?

With import spark.implicits._ you could convert Seq into Dataframe in place
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF() // <--- Here
Explicitly setting column names:
val df: DataFrame = Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
For the desired schema, you could either cast column or use a different type
//Cast
Seq(("Male","2019-09-06"),("Female","2019-09-06"),("Male","2019-09-07"))
.toDF("sex", "date")
.select($"sex", $"date".cast(DateType))
.printSchema()
//Types
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
Seq(
("Male", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Female", new java.sql.Date(format.parse("2019-09-06").getTime)),
("Male", new java.sql.Date(format.parse("2019-09-07").getTime)))
.toDF("sex", "date")
.printSchema()
//Output
root
|-- sex: string (nullable = true)
|-- date: date (nullable = true)
Regarding your question, your rdd type is known, Spark will create schema accordingly to it.
val rdd: RDD[(String, String)] = spark.sparkContext.parallelize(valuesCol)
spark.createDataFrame(rdd)
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)

You can specify your valuesCol as Seq of Row instead of Seq of Tuple :
val valuesCol = Seq(
Row("Male", "2019-09-06"),
Row ("Female", "2019-09-06"),
Row("Male", "2019-09-07"))

Related

Connecting to HBase through Spark Phoenix Connector

I am trying to load a HBase Table through spark sql connector.
I am able to get the schema of the table
val port = s"${configuration.get(ZOOKEEPER_CLIENT_PORT, "2181")}"
val znode = s"${configuration.get(ZOOKEEPER_ZNODE_PARENT, "/hbase")}"
val zkUrl = s"${configuration.get(ZOOKEEPER_QUORUM, "localhost")}"
val url = s"jdbc:phoenix:$zkUrl:$port:$znode"
val props = new Properties()
val table ="SOME_Metrics_Test"
props.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
val df = spark.read.jdbc(url, getEscapedFullTableName(table), props)
If I do df.printSchema I can fetch the schema of the table
scala> df.printSchema
root
|-- PK: string (nullable = false)
|-- status: string (nullable = true)
|-- other_Status: string (nullable = true)
But when I do df.show I am getting this error :
org.apache.phoenix.schema.TableNotFoundException: ERROR 1012 (42M03): Table undefined. tableName=SOME_Metrics_Test
at org.apache.phoenix.query.ConnectionQueryServicesImpl.getAllTableRegions(ConnectionQueryServicesImpl.java:542)
at org.apache.phoenix.iterate.BaseResultIterators.getParallelScans(BaseResultIterators.java:480)
Any idea why this error is coming and what can I do to resolve it?
While starting the spark shell I have added phoenix-4.7.0-HBase-1.1-client-spark.jar and hbase-site.xml in the spark-shell command.
Try this,
val phoenixDF = spark.read.format("org.apache.phoenix.spark")
.option("table", "my_table")
.option("zkUrl", "0.0.0.0:2181")
.load()

How to filter all the rows in a dataframe_1 for each value in dataframe_2?

I have two spark dataframes: dataframe_1 with all the transactions of items and dataframe_2 with unique item ids. I want to filter all the rows from dataframe1 for each values in dataframe_2 .
I am trying to use a join query, but not sure how to proceed below is the main program written in scala:
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val ss = SparkSession
.builder
.appName("Self_apriori_1").master("local[*]")
.getOrCreate()
import ss.implicits._
val df_1 = ss.read.option("inferSchema", value=true).csv("/home/balakumar/scala work files/matrimony_3.csv")
df_1.show()
df_1.printSchema()
val df_2 = ss.read.option("inferSchema" ,value=true) csv("/home/balakumar/scala work files/unique_1.txt")
df_2.show(false)
df_2.printSchema()
}
}
Below are the dataframe_1 and dataframe_2 output for show() command
for each values in dataframe_2, say (101), I require the following lines to be filtered where ever 101 is present
>|101|201| 301| 401|
>|101|201| 301|null|
>|101|201|null|null|
>|101|501|null|null|
Similarly for all the values in dataframe_2 I require all the rows to be filtered. Is it possible with join command?
>#dataframe_1
>+---+---+----+----+
> _c0|_c1| _c2| _c3
>+---+---+----+----+
>|101|201| 301| 401|
>|101|201| 301|null|
>|101|201|null|null|
>|101|501|null|null|
>|601|201| 301|null|
>|601|201|null|null|
>|601|501|null|null|
>+---+---+----+----+
>root
> |-- _c0: integer (nullable = true)
> |-- _c1: integer (nullable = true)
> |-- _c2: integer (nullable = true)
> |-- _c3: integer (nullable = true)
>#dataframe_2
>+---+
>|_c0|
>+---+
>|101|
>|201|
>|301|
>|401|
>|501|
>|601|

How to add a schema to a Dataset in Spark?

I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?
When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)

How to parse Nested Json string from DynamoDB table in spark? [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Spark RDD to Dataframe with schema specifying

It seems that spark can't apply schema (differs from String) for DataFrame when converting it from RDD through Row object. I've tried both on Spark 1.4 and 1.5 version.
Snippet (Java API):
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
directKafkaStream.foreachRDD(rdd -> {
rdd.foreach(x -> System.out.println("x._1() = " + x._1()));
rdd.foreach(x -> System.out.println("x._2() = " + x._2()));
JavaRDD<Row> rowRdd = rdd.map(x -> RowFactory.create(x._2().split("\t")));
rowRdd.foreach(x -> System.out.println("x = " + x));
SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
StructField id = DataTypes.createStructField("id", DataTypes.IntegerType, true);
StructField name = DataTypes.createStructField("name", DataTypes.StringType, true);
List<StructField> fields = Arrays.asList(id, name);
StructType schema = DataTypes.createStructType(fields);
DataFrame sampleDf = sqlContext.createDataFrame(rowRdd, schema);
sampleDf.printSchema();
sampleDf.show();
return null;
});
jssc.start();
jssc.awaitTermination();
It produces following output if specify DataTypes.StringType for "id" field:
x._1() = null
x._2() = 1 item1
x = [1,item1]
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
+---+-----+
| id| name|
+---+-----+
| 1|item1|
+---+-----+
For specified code it throws error:
x._1() = null
x._2() = 1 item1
x = [1,item1]
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
15/09/16 04:13:33 ERROR JobScheduler: Error running job streaming job 1442402013000 ms.0
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:40)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:220)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$IntConverter$.toScalaImpl(CatalystTypeConverters.scala:358)
Similar issue was on Spark Confluence, but it marked as resolved for 1.3 ver.
You are mixing two different things - data types and DataFrame schema. When you create Row like this:
RowFactory.create(x._2().split("\t"))
you get Row(_: String, _: String) but your schema states that you have Row(_: Integer, _: String). Since there is no automatic type conversion you get the error.
To make it work you can either cast values when you create rows or define id as a StringType and use Column.cast method after you've created a DataFrame.

Resources