Spark Delta Table Add new columns in middle Schema Evolution - azure

Have to ingest a file with new column into a existing table structure.
create table sch.test (
name string ,
address string
) USING DELTA
--OPTIONS ('mergeSchema' 'true')
PARTITIONED BY (name)
LOCATION '/mnt/loc/fold'
TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);
Code to read the file:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
File in path contains below data
name,address
raghu,india
raj,usa
On writing it to a table,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.saveAsTable("sch.test")
display(spark.read.table("sch.test"))
Adding a new column,
name,address,age
raghu,india,12
raj,usa,13
Read the file,
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
While writing into the table using insertInto,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.insertInto("sch.test")
display(spark.read.table("sch.test"))
Getting the below error,

Setting overwriteSchema to true will wipe out the old schema and let you create a completely new table.
import org.apache.spark.sql.functions._
df.withColumn(""az_insert_ts"", current_timestamp())
.withColumn(""exec_run_id"",lit(""233""))
.withColumn(""az_inp_file_name"",lit(""24234filename""))
.coalesce(12)
.write
.mode(""append"")
.option(""overwriteSchema"", ""true"")
.format(""delta"")
.insertInto(""sch.test"")
display(spark.read.table(""sch.test""))

Related

Mapping Kafka to Spark dataFrame with the Schema

I have application which runs query on Kafka topics with the schema specified,
Below is my code :
SparkSession spark = SparkSession.builder()
.appName("Spark-Kafka-Integration")
.config("spark.master", "local")
.getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
// Mapping it to the schema
Dataset<Row> ds2 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp"));
ds2.createOrReplaceTempView("ds2");
// Making a Row having timestamp and the values
Dataset<Row> ds3 = spark.sql("select rows.* , timestamp from ds2 ");
ds3.createOrReplaceTempView("table");
Dataset<Row> result2 = spark.sql(query.getQuery());
This runs fine, now I have view table which will have all columns and timestamp. Then I can run SQL like Select column1 , column2 from table group by window(timestamp,'1 minutes'),column1 , column2
My Question :
Is this is an efficient way to do it ? Because if I have multiple topics i.e .option("subscribe","topic1,topics2,...") then I have to create multiple data frame in order to run Join Query on them and how I can handle timestamp column ?
In case of multiple topics I will have the following code :
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic1, topic2,....topicn")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
Dataset<Row> ds = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic1");
Dataset<Row> ds1 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic2");
.... so on and have to same for other data frame

Trouble using withColumn() when reading stream

I'm trying to read stream data with spark using the following code:
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.withColumn("time", unix_timestamp("time")
.cast("double")
.cast("timestamp"))
.csv(inputPath)
)
But I'm getting the error:
'DataStreamReader' object has no attribute 'withColumn'
Is there an alternative for withColumn() in spark.readStream()? I just want to change the column type of my time column from string to timestamp.
Try moving .withColumn once the Dataframe is created - after .csv
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.csv(inputPath)
.withColumn("time", unix_timestamp().cast("double").cast("timestamp"))
)

Use hive on spark to merge small file

I would like to merge the output to 128mb per file in Hive. In the Spark, I have set up the following attributes, but it still can't work. Can someone give me a suggestion?
val spark = SparkSession.builder
.appName("MyExample")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
spark.sqlContext.setConf("hive.mapred.supports.subdirectories", "true")
spark.sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sqlContext.setConf("hive.hadoop.supports.splittable.combineinputformat", "true")
spark.sqlContext.setConf("hive.exec.compress.output", "false")
spark.sqlContext.setConf("hive.input.format", "org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
spark.sqlContext.setConf("hive.merge.mapredfiles", "true")
spark.sqlContext.setConf("hive.merge.size.per.task", "128000000")
spark.sqlContext.setConf("hive.merge.smallfiles.avgsize", "128000000")
spark.sqlContext.setConf("hive.groupby.skewindata", "true")
spark.sqlContext.setConf("hive.merge.sparkfiles", "true")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
val df = spark.read.format("csv")
.option("header", "false").load(path)
df.write.format("csv").saveAsTable("test_table")
You can either estimate or calculate the size of the dataframe as described in that post How to find spark RDD/Dataframe size?
And then do a
val nParitions = (sizeInMB/128).ceil
df.repartition(nPartitions).write.format(....).saveAsTable(...)```

Trim csv file before importing into a Spark Dataset

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

Reading csv file as data frame in spark

I am new to spark and I have a csv file with over 1500 columns. I like to load it as a dataframe in spark. I am not sure how to do this.
Thanks
Use this project https://github.com/databricks/spark-csv
There is an example from the front page:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")

Resources