Trim csv file before importing into a Spark Dataset - apache-spark

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

Related

Spark Delta Table Add new columns in middle Schema Evolution

Have to ingest a file with new column into a existing table structure.
create table sch.test (
name string ,
address string
) USING DELTA
--OPTIONS ('mergeSchema' 'true')
PARTITIONED BY (name)
LOCATION '/mnt/loc/fold'
TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);
Code to read the file:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
File in path contains below data
name,address
raghu,india
raj,usa
On writing it to a table,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.saveAsTable("sch.test")
display(spark.read.table("sch.test"))
Adding a new column,
name,address,age
raghu,india,12
raj,usa,13
Read the file,
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
While writing into the table using insertInto,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.insertInto("sch.test")
display(spark.read.table("sch.test"))
Getting the below error,
Setting overwriteSchema to true will wipe out the old schema and let you create a completely new table.
import org.apache.spark.sql.functions._
df.withColumn(""az_insert_ts"", current_timestamp())
.withColumn(""exec_run_id"",lit(""233""))
.withColumn(""az_inp_file_name"",lit(""24234filename""))
.coalesce(12)
.write
.mode(""append"")
.option(""overwriteSchema"", ""true"")
.format(""delta"")
.insertInto(""sch.test"")
display(spark.read.table(""sch.test""))

Trouble using withColumn() when reading stream

I'm trying to read stream data with spark using the following code:
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.withColumn("time", unix_timestamp("time")
.cast("double")
.cast("timestamp"))
.csv(inputPath)
)
But I'm getting the error:
'DataStreamReader' object has no attribute 'withColumn'
Is there an alternative for withColumn() in spark.readStream()? I just want to change the column type of my time column from string to timestamp.
Try moving .withColumn once the Dataframe is created - after .csv
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.csv(inputPath)
.withColumn("time", unix_timestamp().cast("double").cast("timestamp"))
)

Use hive on spark to merge small file

I would like to merge the output to 128mb per file in Hive. In the Spark, I have set up the following attributes, but it still can't work. Can someone give me a suggestion?
val spark = SparkSession.builder
.appName("MyExample")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
spark.sqlContext.setConf("hive.mapred.supports.subdirectories", "true")
spark.sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sqlContext.setConf("hive.hadoop.supports.splittable.combineinputformat", "true")
spark.sqlContext.setConf("hive.exec.compress.output", "false")
spark.sqlContext.setConf("hive.input.format", "org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
spark.sqlContext.setConf("hive.merge.mapredfiles", "true")
spark.sqlContext.setConf("hive.merge.size.per.task", "128000000")
spark.sqlContext.setConf("hive.merge.smallfiles.avgsize", "128000000")
spark.sqlContext.setConf("hive.groupby.skewindata", "true")
spark.sqlContext.setConf("hive.merge.sparkfiles", "true")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
val df = spark.read.format("csv")
.option("header", "false").load(path)
df.write.format("csv").saveAsTable("test_table")
You can either estimate or calculate the size of the dataframe as described in that post How to find spark RDD/Dataframe size?
And then do a
val nParitions = (sizeInMB/128).ceil
df.repartition(nPartitions).write.format(....).saveAsTable(...)```

Spark read avro

Trying to read an avro file.
val df = spark.read.avro(file)
Running into Avro schema cannot be converted to a Spark SQL StructType: [ "null", "string" ]
Tried to manually create a schema, but now running into the following:
val s = StructType(List(StructField("value", StringType, nullable = true)))
val df = spark.read
.option("inferSchema", "false")
.schema(s)
.avro(file)
com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException: Cannot convert Avro schema to catalyst type because schema at path is not compatible (avroType = StructType(StructField(value,StringType,true)), sqlType = STRING).
Source Avro schema: ["null","string"].
Target Catalyst type: StructType(StructField(value,StringType,true))
Trying to override the avro schema (without the null) also does not work:
val df = spark.read
.option("inferSchema", "false")
.option("avroSchema", """["string"]""")
.avro(file)
Avro schema cannot be converted to a Spark SQL StructType: [ "string" ]
Looks like spark-avro only creates a GenericDatumReader[GenericRecord] and I need a GenericDatumReader[Utf8] :(
Please make sure you are providing the correct AVSC with the data type.
["null", "String"] is placed to take care of null values in the Avro data.
You can create the schema of your Avro file by:-
val schema = new Schema.Parser().parse(new File("user.avsc")
Or if you have Java Schema file then you can get the schema by doing:-
val schema = Schema.getClassSchema
now once you have the schema it is very simple to build a data frame with it.
code snippet:-
val df =sparkSession.read.format("com.databricks.spark.avro")
.option("avroSchema", schema.toString)
.load("/home/garvit.vijay/000009_0.avro")
df.printSchema()
df.show()
Hope it works for you.

Reading csv file as data frame in spark

I am new to spark and I have a csv file with over 1500 columns. I like to load it as a dataframe in spark. I am not sure how to do this.
Thanks
Use this project https://github.com/databricks/spark-csv
There is an example from the front page:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")

Resources