Have to ingest a file with new column into a existing table structure.
create table sch.test (
name string ,
address string
) USING DELTA
--OPTIONS ('mergeSchema' 'true')
PARTITIONED BY (name)
LOCATION '/mnt/loc/fold'
TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);
Code to read the file:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
File in path contains below data
name,address
raghu,india
raj,usa
On writing it to a table,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.saveAsTable("sch.test")
display(spark.read.table("sch.test"))
Adding a new column,
name,address,age
raghu,india,12
raj,usa,13
Read the file,
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
While writing into the table using insertInto,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.insertInto("sch.test")
display(spark.read.table("sch.test"))
Getting the below error,
Setting overwriteSchema to true will wipe out the old schema and let you create a completely new table.
import org.apache.spark.sql.functions._
df.withColumn(""az_insert_ts"", current_timestamp())
.withColumn(""exec_run_id"",lit(""233""))
.withColumn(""az_inp_file_name"",lit(""24234filename""))
.coalesce(12)
.write
.mode(""append"")
.option(""overwriteSchema"", ""true"")
.format(""delta"")
.insertInto(""sch.test"")
display(spark.read.table(""sch.test""))
Related
I have application which runs query on Kafka topics with the schema specified,
Below is my code :
SparkSession spark = SparkSession.builder()
.appName("Spark-Kafka-Integration")
.config("spark.master", "local")
.getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
// Mapping it to the schema
Dataset<Row> ds2 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp"));
ds2.createOrReplaceTempView("ds2");
// Making a Row having timestamp and the values
Dataset<Row> ds3 = spark.sql("select rows.* , timestamp from ds2 ");
ds3.createOrReplaceTempView("table");
Dataset<Row> result2 = spark.sql(query.getQuery());
This runs fine, now I have view table which will have all columns and timestamp. Then I can run SQL like Select column1 , column2 from table group by window(timestamp,'1 minutes'),column1 , column2
My Question :
Is this is an efficient way to do it ? Because if I have multiple topics i.e .option("subscribe","topic1,topics2,...") then I have to create multiple data frame in order to run Join Query on them and how I can handle timestamp column ?
In case of multiple topics I will have the following code :
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "abc:9092,bcs:9092")
.option("subscribe","topic1, topic2,....topicn")
.option("auto.offset.reset", "latest")
.option("checkpointLocation", "/tmp")
.load();
Dataset<Row> ds = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic1");
Dataset<Row> ds1 = df.select( from_json(col("value").cast("string") , Kafkaschema).as("rows"),col("timestamp")).where("topic=topic2");
.... so on and have to same for other data frame
I'm trying to read stream data with spark using the following code:
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.withColumn("time", unix_timestamp("time")
.cast("double")
.cast("timestamp"))
.csv(inputPath)
)
But I'm getting the error:
'DataStreamReader' object has no attribute 'withColumn'
Is there an alternative for withColumn() in spark.readStream()? I just want to change the column type of my time column from string to timestamp.
Try moving .withColumn once the Dataframe is created - after .csv
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.csv(inputPath)
.withColumn("time", unix_timestamp().cast("double").cast("timestamp"))
)
I would like to merge the output to 128mb per file in Hive. In the Spark, I have set up the following attributes, but it still can't work. Can someone give me a suggestion?
val spark = SparkSession.builder
.appName("MyExample")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
spark.sqlContext.setConf("hive.mapred.supports.subdirectories", "true")
spark.sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sqlContext.setConf("hive.hadoop.supports.splittable.combineinputformat", "true")
spark.sqlContext.setConf("hive.exec.compress.output", "false")
spark.sqlContext.setConf("hive.input.format", "org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
spark.sqlContext.setConf("hive.merge.mapredfiles", "true")
spark.sqlContext.setConf("hive.merge.size.per.task", "128000000")
spark.sqlContext.setConf("hive.merge.smallfiles.avgsize", "128000000")
spark.sqlContext.setConf("hive.groupby.skewindata", "true")
spark.sqlContext.setConf("hive.merge.sparkfiles", "true")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
val df = spark.read.format("csv")
.option("header", "false").load(path)
df.write.format("csv").saveAsTable("test_table")
You can either estimate or calculate the size of the dataframe as described in that post How to find spark RDD/Dataframe size?
And then do a
val nParitions = (sizeInMB/128).ceil
df.repartition(nPartitions).write.format(....).saveAsTable(...)```
I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?
I am new to spark and I have a csv file with over 1500 columns. I like to load it as a dataframe in spark. I am not sure how to do this.
Thanks
Use this project https://github.com/databricks/spark-csv
There is an example from the front page:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")