Reading csv file as data frame in spark - apache-spark

I am new to spark and I have a csv file with over 1500 columns. I like to load it as a dataframe in spark. I am not sure how to do this.
Thanks

Use this project https://github.com/databricks/spark-csv
There is an example from the front page:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")

Related

Spark Delta Table Add new columns in middle Schema Evolution

Have to ingest a file with new column into a existing table structure.
create table sch.test (
name string ,
address string
) USING DELTA
--OPTIONS ('mergeSchema' 'true')
PARTITIONED BY (name)
LOCATION '/mnt/loc/fold'
TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);
Code to read the file:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
File in path contains below data
name,address
raghu,india
raj,usa
On writing it to a table,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.saveAsTable("sch.test")
display(spark.read.table("sch.test"))
Adding a new column,
name,address,age
raghu,india,12
raj,usa,13
Read the file,
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/mnt/loc/fold")
display(df)
While writing into the table using insertInto,
import org.apache.spark.sql.functions._
df.withColumn("az_insert_ts", current_timestamp())
.withColumn("exec_run_id",lit("233"))
.withColumn("az_inp_file_name",lit("24234filename"))
.coalesce(12)
.write
.mode("append")
.option("mergeSchema", "true")
.format("delta")
.insertInto("sch.test")
display(spark.read.table("sch.test"))
Getting the below error,
Setting overwriteSchema to true will wipe out the old schema and let you create a completely new table.
import org.apache.spark.sql.functions._
df.withColumn(""az_insert_ts"", current_timestamp())
.withColumn(""exec_run_id"",lit(""233""))
.withColumn(""az_inp_file_name"",lit(""24234filename""))
.coalesce(12)
.write
.mode(""append"")
.option(""overwriteSchema"", ""true"")
.format(""delta"")
.insertInto(""sch.test"")
display(spark.read.table(""sch.test""))

Use hive on spark to merge small file

I would like to merge the output to 128mb per file in Hive. In the Spark, I have set up the following attributes, but it still can't work. Can someone give me a suggestion?
val spark = SparkSession.builder
.appName("MyExample")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
spark.sqlContext.setConf("hive.mapred.supports.subdirectories", "true")
spark.sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sqlContext.setConf("hive.hadoop.supports.splittable.combineinputformat", "true")
spark.sqlContext.setConf("hive.exec.compress.output", "false")
spark.sqlContext.setConf("hive.input.format", "org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
spark.sqlContext.setConf("hive.merge.mapredfiles", "true")
spark.sqlContext.setConf("hive.merge.size.per.task", "128000000")
spark.sqlContext.setConf("hive.merge.smallfiles.avgsize", "128000000")
spark.sqlContext.setConf("hive.groupby.skewindata", "true")
spark.sqlContext.setConf("hive.merge.sparkfiles", "true")
spark.sqlContext.setConf("hive.merge.mapfiles", "true")
val df = spark.read.format("csv")
.option("header", "false").load(path)
df.write.format("csv").saveAsTable("test_table")
You can either estimate or calculate the size of the dataframe as described in that post How to find spark RDD/Dataframe size?
And then do a
val nParitions = (sizeInMB/128).ceil
df.repartition(nPartitions).write.format(....).saveAsTable(...)```

How to add multidimensional array to an existing Spark DataFrame

If I understand correctly, ArrayType can be added as Spark DataFrame columns. I am trying to add a multidimensional array to an existing Spark DataFrame by using the withColumn method. My idea is to have this array available with each DataFrame row in order to use it to send back information from the map function.
The error I get says that the withColumn function is looking for a Column type but it is getting an array. Are there any other functions that will allow adding an ArrayType?
object TestDataFrameWithMultiDimArray {
val nrRows = 1400
val nrCols = 500
/** Our main function where the action happens */
def main(args: Array[String]) {
// Create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "TestDataFrameWithMultiDimArray")
val sqlContext = new SQLContext(sc)
val PropertiesDF = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", "C:/Users/tjoha/Desktop/Properties.xlsx")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.option("sheetName", "Sheet1")
.load()
PropertiesDF.show()
PropertiesDF.printSchema()
val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", Array.ofDim[Any](nrRows,nrCols))
}
Thanks for your help.
Kind regards,
Johann
There are 2 problems in your code
the 2nd argument to withColumn needs to be a Column. you can wrap constant value with function col
Spark cant take Any as its column type, you need to use a specific supported type.
val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", lit(Array.ofDim[Int](nrRows,nrCols)))
will do the trick

How can you go about creating a csv file from an empty Dataset<Row> in spark 2.1 with headers

Spark 2.1 has default behaviour of writing empty files while creating a CSV from a Dataset
How can you go about creating a csv file with headers ?
This is what i am using to write the file
dataFrame.repartition(NUM_PARTITIONS).write()
.option("header", "true")
.option("delimiter", "\t")
.option("overwrite", "true")
.option("nullValue", "null")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.csv("some/path");

Trim csv file before importing into a Spark Dataset

I've seen this post about how to specify an schema for creating a Dataset
Spark Scala: Cannot up cast from string to int as it may truncate
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
It works for me, but not when there are withispaces in the csv. Is it possible to trim the csv in this same spark.read sequence?

Resources