How to add a schema to a Dataset in Spark? - apache-spark

I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?

When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)

Related

How to parse RDD to Dataframe

I'm trying to parse a RDD[Seq[String]] to Dataframe.
ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on.
For example, a line could be:
"hello", "1", "bye", "1.1"
"hello1", "11", "bye1", "2.1"
...
Another execution could have a different number of columns.
First column is going to be always a String, second an int and so on and it's going to be always on this way. On the other hand, one execution could have seq of five elements and others execution could have 2000, so it depends of the execution. In each execution the name of type of columns is defined.
To do it, I could have something like this:
//I could have a parameter to generate the StructType dinamically.
def getSchema(): StructType = {
var schemaArray = scala.collection.mutable.ArrayBuffer[StructField]()
schemaArray += StructField("col1" , IntegerType, true)
schemaArray += StructField("col2" , StringType, true)
schemaArray += StructField("col3" , DoubleType, true)
StructType(schemaArray)
}
//Array of Any?? it doesn't seem the best option!!
val l1: Seq[Any] = Seq(1,"2", 1.1 )
val rdd1 = sc.parallelize(l1).map(Row.fromSeq(_))
val schema = getSchema()
val df = sqlContext.createDataFrame(rdd1, schema)
df.show()
df.schema
I don't like at all to have a Seq of Any, but it's really what I have. Another chance??
On the other hand I was thinking that I have something similar to a CSV, I could create one. With spark there is a library to read an CSV and return a dataframe where types are infered. Is it possible to call it if I have already an RDD[String]?
Since number of columns changes for each execution I would suggest to go with CSV option with delimiter set to space or something else. This way spark will figure out columns types for you.
Update:
Since you mentioned that you read data from HBase, one way to go is to convert HBase row to JSON or CSV and then to convert the RDD to dataframe:
val jsons = hbaseContext.hbaseRDD(tableName, scan).map{case (_, r) =>
val currentJson = new JSONObject
val cScanner = r.cellScanner
while (cScanner.advance) {
currentJson.put(Bytes.toString(cScanner.current.getQualifierArray, cScanner.current.getQualifierOffset, cScanner.current.getQualifierLength),
Bytes.toString(cScanner.current.getValueArray, cScanner.current.getValueOffset, cScanner.current.getValueLength))
}
currentJson.toString
}
val df = spark.read.json(spark.createDataset(jsons))
Similar thing can be done for CSV.

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:
describe formatted <table name>
I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?
Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
You can also use .toDF method on desc formatted table then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
First approach
You can use input_file_name with dataframe.
it will give you absolute file-path for a part file.
spark.read.table("zen.intent_master").select(input_file_name).take(1)
And then extract table path from it.
Second approach
Its more of hack you can say.
package org.apache.spark.sql.hive
import java.net.URI
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession
class TableDetail {
def getTableLocation(table: String, spark: SparkSession): URI = {
val sessionState: SessionState = spark.sessionState
val sharedState: SharedState = spark.sharedState
val catalog: SessionCatalog = sessionState.catalog
val sqlParser: ParserInterface = sessionState.sqlParser
val client = sharedState.externalCatalog match {
case catalog: HiveExternalCatalog => catalog.client
case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
"support hive client API")
}
val idtfr = sqlParser.parseTableIdentifier(table)
require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
rawTable.location
}
}
Here is how to do it in PySpark:
(spark.sql("desc formatted mydb.myschema")
.filter("col_name=='Location'")
.collect()[0].data_type)
Use this as re-usable function in your scala project
def getHiveTablePath(tableName: String, spark: SparkSession):String =
{
import org.apache.spark.sql.functions._
val sql: String = String.format("desc formatted %s", tableName)
val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
result.show(false) // just for debug purpose
val info: String = result.collect().mkString(",")
val path: String = info.split(',')(1)
path
}
caller would be
println(getHiveTablePath("src", spark)) // you can prefix schema if you have
Result (I executed in local so file:/ below if its hdfs hdfs:// will come):
+--------+------------------------------------+-------+
|col_name|data_type |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src| |
+--------+------------------------------------+-------+
file:/Users/hive/spark-warehouse/src
USE ExternalCatalog
scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#4eba6e1f
scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog#24b05292
scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

Spark load files collection in batch and find the line from each file with additional info from file level

I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?
Core sql approach with UDF (the same thing you can achieve with join if you represent File -> ID mapping as Dataframe):
import org.apache.spark.sql.functions
val inputDf = sparkSession.read.text(".../src/test/resources/test")
.withColumn("fileName", functions.input_file_name())
def withId(mapping: Map[String, String]) = functions.udf(
(file: String) => mapping.get(file)
)
val mapping = Map(
"file:///.../src/test/resources/test/test1.txt" -> "id1",
"file:///.../src/test/resources/test/test2.txt" -> "id2"
)
val resutlDf = inputDf.withColumn("id", withId(mapping)(inputDf("fileName")))
resutlDf.show(false)
Result:
+-----+---------------------------------------------+---+
|value|fileName |id |
+-----+---------------------------------------------+---+
|row1 |file:///.../src/test/resources/test/test1.txt|id1|
|row11|file:///.../src/test/resources/test/test1.txt|id1|
|row2 |file:///.../src/test/resources/test/test2.txt|id2|
|row22|file:///.../src/test/resources/test/test2.txt|id2|
+-----+---------------------------------------------+---+
text1.txt:
row1
row11
text2.txt:
row2
row22
This could help (not tested)
// read single text file into DataFrame and add 'id' column
def readOneFile(filePath: String, fileId: String)(implicit spark: SparkSession): DataFrame = {
val dfOriginal: DataFrame = spark.read.text(filePath)
val dfWithIdColumn: DataFrame = dfOriginal.withColumn("id", lit(fileId))
dfWithIdColumn
}
// read all text files into DataFrame
def readAllFiles(filePathIdsSeq: Seq[(String, String)])(implicit spark: SparkSession): DataFrame = {
// create empty DataFrame with expected schema
val emptyDfSchema: StructType = StructType(List(
StructField("value", StringType, false),
StructField("id", StringType, false)
))
val emptyDf: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.emptyRDD[Row],
schema = emptyDfSchema
)
val unionDf: DataFrame = filePathIdsSeq.foldLeft(emptyDf) { (intermediateDf: DataFrame, filePathIdTuple: (String, String)) =>
intermediateDf.union(readOneFile(filePathIdTuple._1, filePathIdTuple._2))
}
unionDf
}
References
spark.read.text(..) method
Create empty DataFrame

Syntax for using toEpochDate with a Dataframe with Spark Scala - elegantly

The following is nice and easy with an RDD in terms of epochDate derivation:
val rdd2 = rdd.map(x => (x._1, x._2, x._3,
LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
The RDD are all of String Type. The desired result is gotten. Get this, for example:
...(Mike,2018-09-25,2018-09-30,17799,17804), ...
Trying to do the same if there is a String in the DF appears too tricky for me, and I would like to see something elegant, if possible. Something like this and variations do not work.
val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay)
Get:
notebook:50: error: value LocalDate is not a member of org.apache.spark.sql.ColumnName
I understand the error, but what is an elegant way of doing the conversion?
You can define to_epoch_day as datediff since the beginning of the epoch:
import org.apache.spark.sql.functions.{datediff, lit, to_date}
import org.apache.spark.sql.Column
def to_epoch_day(c: Column) = datediff(c, to_date(lit("1970-01-01")))
and apply it directly on a Column:
df.withColumn("s", to_epoch_day(to_date($"start")))
As long as the string format complies with ISO 8601 you could even skip data conversion (it will be done implicitly by datediff:
df.withColumn("s", to_epoch_day($"start"))
$"start" is of type ColumnName not String.
You will need to define a UDF
Example below:
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])

Spark.ml DataFrame containing SparseVector

I have a spark.ml DataFrame that contains many columns, each of these columns containing a SparseVector per row. I would like to apply MultivariateStatisticalSummary.colStats to each column, and colStats signature is:
def colStats(X: RDD[Vector]): MultivariateStatisticalSummary
which seems perfect... except that I can't seem to select a column from that DataFrame and get it to be a RDD[Vector]. Here is my attempt:
val df: DataFrame = data.select(shardId)
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
which doesn't compile with the message (in Scala):
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
I also tried:
val df = data.select(shardId)
val col: RDD[Vector] = df.map(x => x.asInstanceOf[org.apache.spark.mllib.linalg.Vector])
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
which fails at runtime with error:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to org.apache.spark.mllib.linalg.Vector
How can I bridge the gap between DataFrame and colStats?
I found the answer after all:
val df = data.select(shardId)
val col: RDD[Vector] = df.map { _.get(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector] }
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
The trick was only to extract the first element of each row before casting it.

Resources