Schema not writing to csv even if header=true is set - apache-spark

I am trying to create an empty dataframe and simply writing it to csv file.I was expecting schema will get written to file as I have specified header=true while writing but its creating empty .csv file.
I have tried setting different properties but nothing is working.
object HeaderTest extends App {
val spark = SparkSession.builder.master("local").appName("learning
spark").getOrCreate
val sc = spark.sparkContext
import spark.implicits._
val df: DataFrame = Seq.empty[(String, Int)].toDF("k", "v")
val f = "E:\\data.csv"
df.write.mode("overwrite").option("header", "true").csv(f)
}

Related

To avoid manual files errors how to code dynamic datatype of a column check in spark/scala

We are getting lot of manual files which we need to validate the few datatypes before process the data-frame. Can someone please suggest how can I proceed on this requirement. Basically need to write one spark Generic/common program which should work for many files. if possible please send more detail on this email id as well pathirammi1#gmail.com.
Wondering if your files have records with delimiter seperated (like csv file). If yes, you could very well read it as a text file and split the records based and delimiter and process it.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDFromCSVFile {
def main(args:Array[String]): Unit ={
def splitString(row:String):Array[String]={
row.split(",")
}
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.textFile("randomfile.csv")
val rdd2:RDD = rdd.map(row=>{
val strArray = splitString(row)
val field1 = strArray(0)
val field2 = strArray(1)
val field3 = strArray(3)
val field4 = strArray(4)
// DO custom code here and return to create RDD
})
rdd2.foreach(a=>println(a.toString))
}
}
If you have non-structured data then you should use below code
import org.apache.spark.sql.SparkSession
object RDDFromWholeTextFile {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("alice.txt")
rdd.foreach(a=>println(a._1+"---->"+a._2))
}
}
Hope this helps !!
Thanks,
Naveen

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

why the data has changed after convert to parquet format testing by union two dataframe?

I wrote a function to operation on a csv file, to convert it to parquet format.
and I wonder how to make sure the data is the same,not lost or add.
So I wrote a test for it. But it turns out they are not the same:
My logic is:
1) make the csv to dataframe A.
2)and make the dataframe A to parquet format ,save to a dir.
3)read the parquet file to be a new dataframe B.
4)then A.union(B).
5)count the A and B and A.union(B).
If the three are the same ,then I can get to the conclusion that they are the same data.
But I get third one different.
def doJob(sc: SparkContext, data: RDD[String]): DataFrame = {
logInfo("Extracting omniture data")
val result = data
.filter(_.contains("PAGE."))
.filter(_.contains(".PACKAGE"))
val sqlsqlContext = new SQLContext(sc)
//just ignore above codes...
val packagesCsvDF = sqlsqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
//
// // we should have some additional filter here
// val mydf = packagesDF.groupBy($"page_url").agg(last($"pagename"),last($"prop46"),last($"prop56"),last($"post_evar34"))
// logInfo("show mydf")
// mydf.show()
//TODO
// save files
logInfo("Saving omniture packages data to S3")
if (true) {
packagesCsvDF
.repartition(sc.defaultParallelism, col("pagename"))
.write
.mode(SaveMode.Append)
.partitionBy("pagename")
.parquet("file:///D:/test/parquet")
logInfo("packagesDF")
}
packagesCsvDF//Is this packagesCsvDF have not been changed yet??????
}
TEST:
object ParquetDataTestsSpec {
def main (args: Array[String] ): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("parquet data test Logs").setMaster("local"))
val input = PackagesOmnitureMapReduceJob.formatToJson(sc.textFile("file:///D:/test/option.json", sc.defaultParallelism))
val df = PackagesOmnitureMapReduceJob.doJob(sc, input)//call the function I want to test in "file:///D:/test/parquet"
val sqlContext = new SQLContext(sc)
val SourceCSVDF = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///D:/test/testsample.csv", "header" -> "true"))// original
val parquetDataFrame = sqlContext.read.parquet("file:///D:/test/parquet") //get the new dataframe
val dfCount = df.count()
val SourceCSVDFcount = SourceCSVDF.count()
val parquetDataCount = parquetDataFrame.count()
val unionCount = parquetDataFrame.union(SourceCSVDF).count()
println(dfCount,SourceCSVDFcount,parquetDataCount,unionCount)
}
}
print:
(200,200,200,400)
then I try to parse all the dataframe to json:
parquetDataFrame.write.json("file:///D:/test/parquetDataFrame")
SourceCSVDF.write.json("file:///D:/test/SourceCSVDF")
df.write.json("file:///D:/test/Desktop/df")
and when I open the json files, I find they are so all same..Is the problem is coming with the key word union?
val unionalldis3 = parquetDataFrame.unionAll(SourceCSVDF).distinct().count()
then it is right...
But I am very confused.I thought union() is the distincted unionAll....

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Merge parquet file on standalone spark

Is there a simple way how to save DataFrame into a single parquet file or merge the directory containing metadata and parts of this parquet file produced by sqlContext.saveAsParquetFile() into a single file stored on NFS without using HDFS and hadoop?
To save only one file, rather than many, you can call coalesce(1) / repartition(1) on the RDD/Dataframe before the data is saved.
If you already have a directory with small files, you could create a Compacter process which would read in the exiting files and save them to one new file. E.g.
val rows = parquetFile(...).coalesce(1)
rows.saveAsParquetFile(...)
You can store to a local file system using saveAsParquetFile. e.g.
rows.saveAsParquetFile("/tmp/onefile/")
I was able to use this method to compress parquet files using snappy format with Spark 1.6.1. I used overwrite so that I could repeat the process if needed. Here is the code.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
object CompressApp {
val serverPort = "hdfs://myserver:8020/"
val inputUri = serverPort + "input"
val outputUri = serverPort + "output"
val config = new SparkConf()
.setAppName("compress-app")
.setMaster("local[*]")
val sc = SparkContext.getOrCreate(config)
val sqlContext = SQLContext.getOrCreate(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
import sqlContext.implicits._
def main(args: Array[String]) {
println("Compressing Parquet...")
val df = sqlContext.read.parquet(inputUri).coalesce(1)
df.write.mode(SaveMode.Overwrite).parquet(outputUri)
println("Done.")
}
}
coalesce(N) has saved me so far. If your table is partitioned, then use repartition("partition key") as well.

Resources