Spark - How to convert text file into a multiple column schema DataFrame/Dataset - apache-spark

I am trying to read a text file and convert it into dataframe.
val inputDf: DataFrame = spark.read.text(filePath.get.concat("/").concat(fileName.get))
.map((row) => row.toString().split(","))
.map(attributes => {
Row(attributes(0), attributes(1), attributes(2), attributes(3), attributes(4))
}).as[Row]
When i do inputDf.printSchema, I am getting a single column;
root
|-- value: binary (nullable = true)
How can I convert this text file into a multiple column schema Dataframe/Dataset

Solved;
val inputSchema: StructType = StructType(
List(
StructField("1", StringType, true),
StructField("2", StringType, true),
StructField("3", StringType, true),
StructField("4", StringType, true),
StructField("5", StringType, true)
)
)
val encoder = RowEncoder(inputSchema)
val inputDf: DataFrame = spark.read.text(filePath.get.concat("/").concat(fileName.get))
.map((row) => row.toString().split(","))
.map(attributes => {
Row(attributes(0), attributes(1), attributes(2), attributes(3), "BUY")
})(encoder)

Related

spark sql add comment with withComment, it is not work

I want to add remarks to the dataframe, then write hive table,but it is not work.That is to say, the remarks of the table are not added.
I try in spark 2.4 and spark 3, it is not work. But the lower version seems to work, I don't know why,I tried to read the source code but found nothing, if you know why, please tell me, thank you
The code as follows
val personRDD: RDD[Row] = GetTestRDD.map((line: String) => {
val arr: Array[String] = line.split(" ")
Row(arr(0).toInt, arr(1), arr(2).toInt)
})
val schema: StructType = StructType(List(
StructField("id", IntegerType, nullable = false),
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val frame: DataFrame = sparkSession.createDataFrame(personRDD, schema)
println("输出原始信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
//添加备注后处理
val commentMap: Map[String, String] = Map("id" -> "唯一标识", "name" -> "姓名", "age" -> "年龄")
val newSchema: Seq[StructField] = frame.schema.map((s: StructField) => {
println(commentMap(s.name))
s.withComment(commentMap(s.name))
})
sparkSession.createDataFrame(frame.rdd, StructType(newSchema)).repartition(10)
println("输出处理后的信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
the output
输出原始信息
(id,{})
(name,{})
(age,{})
输出处理后的信息
(id,{})
(name,{})
(age,{})

Spark Dataframe returns NULL for entire row when one column value of that row is NULL

Input data -
{"driverId":1,"driverRef":"hamilton","number":44,"code":"HAM","name":{"forename":"Lewis","surname":"Hamilton"},"dob":"1985-01-07","nationality":"British","url":"http://en.wikipedia.org/wiki/Lewis_Hamilton"}
{"driverId":2,"driverRef":"heidfeld","number":"\\N","code":"HEI","name":{"forename":"Nick","surname":"Heidfeld"},"dob":"1977-05-10","nationality":"German","url":"http://en.wikipedia.org/wiki/Nick_Heidfeld"}
{"driverId":3,"driverRef":"rosberg","number":6,"code":"ROS","name":{"forename":"Nico","surname":"Rosberg"},"dob":"1985-06-27","nationality":"German","url":"http://en.wikipedia.org/wiki/Nico_Rosberg"}
{"driverId":4,"driverRef":"alonso","number":14,"code":"ALO","name":{"forename":"Fernando","surname":"Alonso"},"dob":"1981-07-29","nationality":"Spanish","url":"http://en.wikipedia.org/wiki/Fernando_Alonso"}
{"driverId":5,"driverRef":"kovalainen","number":"\\N","code":"KOV","name":{"forename":"Heikki","surname":"Kovalainen"},"dob":"1981-10-19","nationality":"Finnish","url":"http://en.wikipedia.org/wiki/Heikki_Kovalainen"}
{"driverId":6,"driverRef":"nakajima","number":"\\N","code":"NAK","name":{"forename":"Kazuki","surname":"Nakajima"},"dob":"1985-01-11","nationality":"Japanese","url":"http://en.wikipedia.org/wiki/Kazuki_Nakajima"}
{"driverId":7,"driverRef":"bourdais","number":"\\N","code":"BOU","name":{"forename":"Sébastien","surname":"Bourdais"},"dob":"1979-02-28","nationality":"French","url":"http://en.wikipedia.org/wiki/S%C3%A9bastien_Bourdais"}
After reading this data into spark dataframe while display that df, I could se entire row for driverId 2,5,6,7 is NULL. I could see column-number value is NULL for that driver id.
Here is my code. Any mistakes here?
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
name_field = StructType(fields =[
StructField("forename", StringType(), True),
StructField("surname", StringType(), True)
])
driver_schema = StructType(fields =[
StructField("driverId", IntegerType(), False),
StructField("driverRef", StringType(), True),
StructField("number", IntegerType(), True),
StructField("code", StringType(), True),
StructField("name", name_field),
StructField("dob", DateType(), True),
StructField("nationality", StringType(),True),
StructField("url", StringType(), True)
])
driver_df = spark.read\
.schema(driver_schema)\
.json('dbfs:/mnt/databrickslearnf1azure/raw/drivers.json')
driver_df.printSchema()
root
|-- driverId: integer (nullable = true)
|-- driverRef: string (nullable = true)
|-- number: integer (nullable = true)
|-- code: string (nullable = true)
|-- name: struct (nullable = true)
| |-- forename: string (nullable = true)
| |-- surname: string (nullable = true)
|-- dob: date (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
display(driver_df)
You can change your initial schema to be as follows which assume the number to be of type string.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
name_field = StructType(fields =[
StructField("forename", StringType(), True),
StructField("surname", StringType(), True)
])
driver_schema = StructType(fields =[
StructField("driverId", IntegerType(), False),
StructField("driverRef", StringType(), True),
StructField("number", StringType(), True),
StructField("code", StringType(), True),
StructField("name", name_field),
StructField("dob", DateType(), True),
StructField("nationality", StringType(),True),
StructField("url", StringType(), True)
])
then you can read the data from the json file using the same code that you are using as follows:
driver_df = spark.read\
.schema(driver_schema)\
.json('dbfs:/mnt/databrickslearnf1azure/raw/drivers.json')
driver_df.printSchema()
Once you have read the data then you can apply the logic to convert "\N" to null and then change the data type of the column from string to integer as below :
from pyspark.sql.functions import *
df = driver_df.withColumn("number", when(driver_df.number=="\\N","null").otherwise(driver_df.number))
finaldf = df.withColumn("number",df.number.cast(IntegerType()))
finaldf.printSchema()
Now if you do the display or show on the dataframe you can see the output as below :
You are seeing this because, according to the official databricks docs: Cause
Spark 3.0 and above (Databricks Runtime
7.3 LTS and above) cannot parse JSON arrays as structs. You should pass the schema as ArrayType instead of StructType.
Solution: Pass the schema as ArrayType instead of StructType.
driver_schema = ArrayType(StructType(fields =[
StructField("driverId", IntegerType(), False),
StructField("driverRef", StringType(), True),
StructField("number", IntegerType(), True),
StructField("code", StringType(), True),
StructField("name", name_field),
StructField("dob", DateType(), True),
StructField("nationality", StringType(),True),
StructField("url", StringType(), True)
]))

Spark withColumn changes column nullable property in schema

I'm using withColumn in order to override a certain column (applying the same value to the entire data frame), my problem is that withColumn changes the nullable property of the column:
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.lit
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true)
))
val data = Seq(Row(1, "pepsi"), Row(2, "coca cola"))
val rdd = spark.sparkContext.parallelize(data)
val df = spark.createDataFrame(rdd, schema)
df.withColumn("name", lit("*******"))
df.printSchema
result:
root
|-- id: string (nullable = true)
|-- name: string (nullable = false)
The best idea I have is change the schema after the manipulation, was wondering if someone has a better idea.
Thanks!

How to pro-grammatically generate Struct Type as StringType for all the fields in spark?

I have *n number of fields(like 200-300), all the fields Struct Type i want as string-type only. Any simple way are there, like below mentioned
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
Below code i tried,
StructType schema= new StructType()
.add("field1", StringType)
.add("field2", StringType)
.add("field3", StringType);
ExpressionEncoder express=RowCoder.apply(schema)
You can use pattern Matching
import org.apache.spark.sql.types._
val df = Seq(
(1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
(2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")
val newSchema = df.schema.fields.map{
case StructField(name, _: DecimalType, nullable, _)
=> StructField(name, DoubleType, nullable)
case field => field
}

Spark: create a nested schema

With spark,
import spark.implicits._
val data = Seq(
(1, ("value11", "value12")),
(2, ("value21", "value22")),
(3, ("value31", "value32"))
)
val df = data.toDF("id", "v1")
df.printSchema()
The result is the following:
root
|-- id: integer (nullable = false)
|-- v1: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
Now if I want to create the schema myself, how should I process?
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", ???)
))
Thanks.
According to example in here:
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/types/StructType.html
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val innerStruct =
StructType(
StructField("f1", IntegerType, true) ::
StructField("f2", LongType, false) ::
StructField("f3", BooleanType, false) :: Nil)
val struct = StructType(
StructField("a", innerStruct, true) :: Nil)
// Create a Row with the schema defined by struct
val row = Row(Row(1, 2, true))
And in your case it will be:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("nested", StructType(Array(
StructField("value1", StringType),
StructField("value2", StringType)
)))
))
Output:
StructType(
StructField(id,IntegerType,true),
StructField(nested,StructType(
StructField(value1,StringType,true),
StructField(value2,StringType,true)
),true)
)

Resources