Read parquet into spark dataset ignoring missing fields [duplicate] - apache-spark

This question already has an answer here:
Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
(1 answer)
Closed 5 years ago.
Lets assume I create a parquet file as follows :
case class A (i:Int,j:Double,s:String)
var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2"))
val ds = spark.createDataset(l1)
ds.write.parquet("/tmp/test.parquet")
Is it possible to read it into a Dataset of a type with a different schema, where the only difference is few additional fields?
Eg:
case class B (i:Int,j:Double,s:String,d:Double=1.0) // d is extra and has a default value
Is there a way that i can make this work? :
val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]

In Spark, if the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required. It means for the following code to work:
val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]
Following modifications needs to be done:
val ds2 = spark.read.parquet("/tmp/test.parquet").withColumn("d", lit(1D)).as[B]
Or, if creating additional column is not possible, then following can be done:
val ds2 = spark.read.parquet("/tmp/test.parquet").map{
case row => B(row.getInt(0), row.getDouble(1), row.getString(2))
}

Related

How to pass more than one column as a parameter to Spark dataframe

I want to pass more than one column name as a parameter to dataframe.
val readData = spark.sqlContext
.read.format("csv")
.option("delimiter",",")
.schema(Schema)
.load("emp.csv")
val cols_list1 = "emp_id,emp_dt"
val cols_list2 = "emp_num"
val RemoveDupli_DF = readData
.withColumn("rnk", row_number().over(Window.partitionBy(s"$cols_list1").orderBy(s"$cols_list2") ))
Above code is working, if i have one column name , whereas with two or more columns, its giving below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'emp_id,emp_dt'
Using Scala 2.x version.
The partitionBy method as multiple signatures:
def partitionBy(colName: String, colNames: String*)
// or
def partitionBy(cols: Column*)
Your code is providing the list of columns as a single string which will fail because there is no column called emp_id,emp_dt. Hence, you get the error message.
You could define your column names (as Strings) in a collection
val cols_seq1 = Seq("emp_id","emp_dt")
and then call partitionsBy like this:
Window.partitionBy(cols_seq1: _*)
The notation : _* tells the compiler to pass each element of cols_seq1 as its own argument into the partitionBy call rather than all of it as a single argument.
As an alternative you could also just use
Window.partitionBy("emp_id", "emp_dt")

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.
You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.
Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

How to convert from dataframe to RDD and back with a case class [duplicate]

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

withcolumn() in spark for adding new column is not showing result [duplicate]

This question already has answers here:
Create new Dataframe with empty/null field values
(2 answers)
Closed 4 years ago.
I am trying to add a new column using withcolumn whose value should be NULL but its not working.
val schema = StructType(
StructField("uid",StringType,true)::
StructField("sid",StringType,true)::
StructField("astid",StringType,true)::
StructField("timestamp",StringType,true)::
StructField("start",StringType,true)::
StructField("end",StringType,true)::
StructField("geo",StringType,true)::
StructField("stnid",StringType,true)::
StructField("end_type",LongType,true)::
StructField("like",LongType,true)::
StructField("dislike",LongType,true)::Nil
)
val Mobpath = spark.read.schema(schema).csv("/data/mob.txt")
Mobpath.printSchema()
Mobpath.createOrReplaceTempView("Mobpathsql")
val showall = spark.sql("select * from Mobpathsql")
showall.show()
val newcol = Mobpath.withColumn("new1",functions.lit("null"))
newcol.show()
using withcolumn it is not showing any error and also not showing any output.
what about this:
val newcol = showall.withColumn("new1",functions.lit("null"))
newcol.show()
I just test the above code and it worked, i don't know why it does not work with Mobpath

Partitioning by multiple columns in Spark SQL

With Spark SQL's window functions, I need to partition by multiple columns to run my data queries, as follows:
val w = Window.partitionBy($"a").partitionBy($"b").rangeBetween(-100, 0)
I currently do not have a test environment (working on settings this up), but as a quick question, is this currently supported as a part of Spark SQL's window functions, or will this not work?
This won't work. The second partitionBy will overwrite the first one. Both partition columns have to be specified in the same call:
val w = Window.partitionBy($"a", $"b").rangeBetween(-100, 0)
if you are using the columns at multiple places where you are doing partitionBy then you could assign that to a variable in form of list and then use that list directly as a argument value for the partitionBy in the code.
val partitioncolumns = List("a","b")
val w = Window.partitionBy(partitioncolumns:_*).rangeBetween(-100, 0)
By using :_* at the end of the list variable it convert that to varargs and that is the argument type that partitionBy takes. So your code would work the way you want.

Resources