Not able to convert from RDD to Dataset successfully at runtime [duplicate] - apache-spark

This question already has answers here:
How to convert a simple DataFrame to a DataSet Spark Scala with case class?
(2 answers)
Closed 4 years ago.
I am trying to run this small Spark program. Spark Version 2.1.1
val rdd = sc.parallelize(List((2012, "Tesla", "S"), (1997, "Ford", "E350"), (2015, "Chevy", "Volt")))
import spark.implicits._
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails] // Error Line
carDetails.map(car => {
val name = if (car.name == "Tesla") "S" else car.name
CarDetails(car.year, name, car.model)
}).collect().foreach(print)
It is throwing error on this line:
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd).as[CarDetails]
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`year`' given input columns: [_1, _2, _3];
There is no compilation error!
I tried by doing many changes like to use List instead of RDD. Also, tried to first convert to DS and then to as[CarDetails], but didn't work. Now I am clueless.
Why is it taking the columns as _1, _2 and _3 when I have already given the case class
case class CarDetails(year: Int, name: String, model: String)
I tried to change from Int to Long for year in case class. It still did not work.
Edit:
I changed this line after referring the probable duplicate question and it worked.
val carDetails: Dataset[CarDetails] = spark.createDataset(rdd)
.withColumnRenamed("_1","year")
.withColumnRenamed("_2","name")
.withColumnRenamed("_3","model")
.as[CarDetails]
But, I am still not clear as to why I need to rename the columns even after explicitly mapping to a case class.

The rules of as conversion are explained in detail in the API docs:
The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1).
When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
To explain this with code. Conversion from case class to Tuple* is valid (fields are matched structurally):
Seq(CarDetails(2012, "Tesla", "S")).toDF.as[(Int, String, String)]
but conversion from Tuple* to arbitrary case class is not (fields are matched by name). You have to rename fields first (ditto):
Seq((2012, "Tesla", "S")).toDF("year", "name", "model").as[CarDetails]
It has quite interesting practical implications:
Tuple typed object cannot contain extraneous fields:
case class CarDetailsWithColor(
year: Int, name: String, model: String, color: String)
Seq(
CarDetailsWithColor(2012, "Tesla", "S", "red")
).toDF.as[(Int, String, String)]
// org.apache.spark.sql.AnalysisException: Try to map struct<year:int,name:string,model:string,color:string> to Tuple3, but failed as the number of fields does not line up.;
While case class typed Dataset can:
Seq(
(2012, "Tesla", "S", "red")
).toDF("year", "name", "model", "color").as[CarDetails]
Of course, starting with case class typed variant would save you all the trouble:
sc.parallelize(Seq(CarDetails(2012, "Tesla", "S"))).toDS

Related

Problem with selecting json object in Pyspark, which may sometime have Null values

I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.
I am giving a simple example here:
When it is filled:
{"student_address": {"Door Number":"1234",
"Place":"xxxx",
"Zip code":"12345"}}
When it is empty:
{"student_address":""}
So, in the final DataFrame I have all the three columns Door Number, Place and Zip code. When the address is empty, I should put Null values in the respective columns and should fill them when there is data.
The code I tried:
test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
.withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
.withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))
So, I am trying to check wether the value is empty or not.
This is the error I am getting:
AnalysisException: Can't extract value from student_address#34: need struct type but got string
I am not able to understand why PySpark is checking the statement in otherwise, when condition is met in when itself. (I tried giving simple values in otherwise instead of json path and it worked).
I am struggling to understand what is happening here and would like to know if there is any simple way to do this.
val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")
use Get JSON object with the first object with is Student_address and the column should be the name of the column.
df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))

How to efficiently select distinct rows on an RDD based on a subset of its columns`

Consider a Case Class:
case class Prod(productId: String, date: String, qty: Int, many other attributes ..)
And an
val rdd: RDD[Prod]
containing many instances of that class.
The unique key is intended to be the (productId,date) tuple. However we do have some duplicates.
Is there any efficient means to remove the duplicates?
The operation
rdd.distinct
would look for entire rows that are duplicated.
A fallback would involve joining the unique (productId,date) combinations back to the entire rows: I am working through exactly how to do this. But even so it is several operations. A simpler approach (faster as well?) would be useful if it exists.
I'd use dropDuplicates on Dataset:
val rdd = sc.parallelize(Seq(
Prod("foo", "2010-01-02", 1), Prod("foo", "2010-01-02", 2)
))
rdd.toDS.dropDuplicates("productId", "date")
but reduceByKey should work as well:
rdd.keyBy(prod => (prod.productId, prod.date)).reduceByKey((x, _) => x).values

Can a DataFrame be converted to Dataset of a case class if a column name contains a space?

I have a Spark DataFrame where a column name contains a space. Is it possible to convert these rows into case classes?
For example, if I do this:
val data = Seq(1, 2, 3).toDF("a number")
case class Record(`a number`: Int)
data.as[Record]
I get this exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`a$u0020number`' given input columns: [a number];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
...
Is there any way to do this?
(Of course I can work around this by renaming the column before converting to a case class. I was hoping to have the case class match the input schema exactly.)
Can you try this solution , this worked for me without changing the column name.
import sqlContext.implicits._
case class Record(`a number`: Int)
val data = Seq(1, 2, 3)
val recDF = data.map(x => Record(x)).toDF()
recDF.collect().foreach(println)
[1]
[2]
[3]
I'm using Spark 1.6.0. The only part of your code that doesn't work for me is the part where you're setting up your test data. I have to use a sequence of tuples instead of a sequence of integers:
case class Record(`a number`:Int)
val data = Seq(Tuple1(1),Tuple1(2),Tuple1(3)).toDF("a number")
data.as[Record]
// returns org.apache.spark.sql.Data[Record] = [a$u0020number: int]
If you need a Dataframe instead of a Dataset you can always use another toDF:
data.as[Record].toDF

Defining DataFrame Schema for a table with 1500 columns in Spark

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
Using Case class
Using StructType.
Spark Version used is 1.4
For this type of requirements. I'd offer case class approach to prepare a dataframe
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
Another option :
Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)
The options for reading a table with 1500 columns
1) Using Case class
Case class would not work because its limited to 22 fields( for scala version < 2.11).
2) Using StructType
You can use the StructType to define the schema and create the dataframe.
Third option
You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.
You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.
For example,
Your schema json be:
[
{
"columnType": "VARCHAR",
"columnName": "NAME",
"nullable": true
},
{
"columnType": "VARCHAR",
"columnName": "AGE",
"nullable": true
},
.
.
.
]
Now you can read the the json to parse it to some case class to form the StructType.
case class Field(name: String, dataType: String, nullable: Boolean)
You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.
val dataType = Map(
"VARCHAR" -> StringType,
"NUMERIC" -> LongType,
"TIMESTAMP" -> TimestampType,
.
.
.
)
def parseJsonForSchema(jsonFilePath: String) = {
val jsonString = Source.fromFile(jsonFilePath).mkString
val parsedJson = parse(jsonString)
val fields = parsedJson.extract[Field]
val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
StructType(schemaColumns)
}

Define UDF in Spark Scala

I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !

Resources