Trouble using withColumn() when reading stream - apache-spark

I'm trying to read stream data with spark using the following code:
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.withColumn("time", unix_timestamp("time")
.cast("double")
.cast("timestamp"))
.csv(inputPath)
)
But I'm getting the error:
'DataStreamReader' object has no attribute 'withColumn'
Is there an alternative for withColumn() in spark.readStream()? I just want to change the column type of my time column from string to timestamp.

Try moving .withColumn once the Dataframe is created - after .csv
eventsDF = (
spark
.readStream
.schema(schema)
.option("header", "true")
.option("maxFilesPerTrigger", 1)
.csv(inputPath)
.withColumn("time", unix_timestamp().cast("double").cast("timestamp"))
)

Related

Spark Delta Table Add new columns in middle Schema Evolution

Have to ingest a file with new column into a existing table structure.
create table sch.test (
name string ,
address string
) USING DELTA
--OPTIONS ('mergeSchema' 'true')
PARTITIONED BY (name)
LOCATION '/mnt/loc/fold'
TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);
Code to read the file:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
File in path contains below data
name,address
raghu,india
raj,usa
On writing it to a table,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.saveAsTable("sch.test")
display(spark.read.table("sch.test"))
Adding a new column,
name,address,age
raghu,india,12
raj,usa,13
Read the file,
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
While writing into the table using insertInto,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.insertInto("sch.test")
display(spark.read.table("sch.test"))
Getting the below error,
Setting overwriteSchema to true will wipe out the old schema and let you create a completely new table.
import org.apache.spark.sql.functions._
df.withColumn(""az_insert_ts"", current_timestamp())
.withColumn(""exec_run_id"",lit(""233""))
.withColumn(""az_inp_file_name"",lit(""24234filename""))
.coalesce(12)
.write
.mode(""append"")
.option(""overwriteSchema"", ""true"")
.format(""delta"")
.insertInto(""sch.test"")
display(spark.read.table(""sch.test""))

Error while reading an Excel file using Spark Scala with filename as a parameter

Can someone help me with reading excel file using Spark Scala Read API? I tried installing com.crealytics:spark-excel_2.11:0.13.1 (from Maven) to Cluster with Databricks Runtime 6.5 and 6.6 (Apache Spark 2.4.5, Scala 2.11) but it works only if I hard-code the filepath..
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Listing_Attributed")
.option("header", "true")
.option("inferSchema", "false")
.option("addColorColumns", "true") // Optional, default: false
.option("badRecordsPath", Vars.rootSourcePath + "BadRecords/" + DataCategory)
.option("dateFormat", "dd-MON-yy")
.option("timestampFormat", "MM/dd/yyyy hh:mm:ss")
.option("ignoreLeadingWhiteSpace",true)
.option("ignoreTrailingWhiteSpace",true)
.option("escape"," ")
.load("/ABC/Test_Filename_6.12.20.xlsx") // hard-coded path works...
// .load(filepath) //Filepath is a parameter and throws error, "java.io.IOException: GC overhead limit exceeded" (edited)
Use .option("location",inputPath) like below
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Listing_Attributed")
.option("header", "true")
.option("location", inputPath)
.load()

Use hive on spark to merge small file

I would like to merge the output to 128mb per file in Hive. In the Spark, I have set up the following attributes, but it still can't work. Can someone give me a suggestion?
val spark = SparkSession.builder
.appName("MyExample")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
spark.sqlContext.setConf("hive.mapred.supports.subdirectories", "true")
spark.sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sqlContext.setConf("hive.hadoop.supports.splittable.combineinputformat", "true")
spark.sqlContext.setConf("hive.exec.compress.output", "false")
spark.sqlContext.setConf("hive.input.format", "org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
spark.sqlContext.setConf("hive.merge.mapredfiles", "true")
spark.sqlContext.setConf("hive.merge.size.per.task", "128000000")
spark.sqlContext.setConf("hive.merge.smallfiles.avgsize", "128000000")
spark.sqlContext.setConf("hive.groupby.skewindata", "true")
spark.sqlContext.setConf("hive.merge.sparkfiles", "true")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
val df = spark.read.format("csv")
.option("header", "false").load(path)
df.write.format("csv").saveAsTable("test_table")
You can either estimate or calculate the size of the dataframe as described in that post How to find spark RDD/Dataframe size?
And then do a
val nParitions = (sizeInMB/128).ceil
df.repartition(nPartitions).write.format(....).saveAsTable(...)```

Trim csv file before importing into a Spark Dataset

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

Reading csv file as data frame in spark

I am new to spark and I have a csv file with over 1500 columns. I like to load it as a dataframe in spark. I am not sure how to do this.
Thanks
Use this project https://github.com/databricks/spark-csv
There is an example from the front page:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")

Resources