Cannot write Dataframe result as a Hive table/LFS file - apache-spark

I have encountered an issue while writing the filtered data to a file. There are around 27 files created in local file system but with no output.
Below is the code used:
I'm reading the file as a dataframe
val in_df=spark.read.csv("file:///home/Desktop/Project/inputdata.csv").selectExpr("_c0 as Id","_c1 as name","_c2 as dept")
Then to register this dataframe as a temp table
in_df.registerTempTable("employeeDetails")
Now the requirement is to count the number of employees for each department and store it to a file.
val employeeDeptCount=spark.sql("select dept,count(*) from employeedetails group by dept")
//The following code is writing to Hive default warehouse as n number parquet files.
employeeDeptCount.write.saveAsTable("aggregatedcount")
//The following code is writing to LFS but No Output but n files are created
employeeDeptCount.write.mode("append").csv("file:///home/Desktop/Project")

val in_df=spark.read.csv("file:///home/Desktop/Project/inputdata.csv").selectExpr("_c0 as Id","_c1 as name","_c2 as dept")
// please, show your result
in_df.show(false)
val employeeDeptCount= in_df.groupBy("dept").count().alias("count")
employeeDeptCount.persist()
employeeDeptCount.write.format("csv").mode(SaveMode.Overwrite).saveAsTable("aggregatedcount")
employeeDeptCount.repartition(1).write.mode("append").csv("file:///home/Desktop/Project")
employeeDeptCount.unpersist()
// in_df.createOrReplaceTempView()
// in_df.createOrReplaceGlobalTempView()

Related

Aborting job in spark

Actually I want to extract a single column from CSV file which I already stored in a rdd and I used this code val csvArray = filerdd.map(line =>{ val colArray = line.split(",") List (colArray(13)) }...... After this code I printed that csvArray.foreach(println) ..I got the output and as well as error comes in between arrayindexoutofboundexception:13)

How to handle NullPointerException while reading, filtering and counting the lines of CSV files using SparkSession?

I'm trying to read the CSV files stored on HDFS using sparkSession and count the number of lines and print the value on the console. However, I'm constantly getting NullPointerException while calculating the count. Below is the code snippet,
val validEmployeeIds = Set("12345", "6789")
val count = sparkSession
.read
.option("escape", "\"")
.option("quote", "\"")
.csv(inputPath)
.filter(row => validEmployeeIds.contains(row.getString(0)))
.distinct()
.count()
println(count)
I'm getting an NPE exactly at .filter condition. If I remove .filter in the code, it runs fine and prints the count. How can I handle this NPE?
The inputPath is a folder that contains contains multiple CSV files. Each CSV file has two columns, one represents Id and other represents name of the employee. A sample CSV extract is below:
12345,Employee1
AA888,Employee2
I'm using Spark version 2.3.1.
Try using isin function.
import spark.implicits._
val validEmployeeIds = List("12345", "6789")
val df = // Read CSV
df.filter('_c0.isin(validEmployeeIds:_*)).distinct().count()

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

I have to read hundreds of avro files in Databricks from an Azure Data Lake Gen2, extract data from the Body field inside every file, and concatenate all the extracted data in a unique dataframe. The point is that all avro files to read are stored in different subdirectories in the lake, following the pattern:
root/YYYY/MM/DD/HH/mm/ss.avro
This forces me to loop the ingestion and selection of data. I'm using this Python code, in which list_avro_files is the list of paths to all files:
list_data = []
for file_avro in list_avro_files:
df = spark.read.format('avro').load(file_avro)
data1 = spark.read.json(df.select(df.Body.cast('string')).rdd.map(lambda x: x[0]))
list_data.append(data1)
data = reduce(DataFrame.unionAll, list_data)
Is there any way to do this more efficiently? How can I parallelize/speed up this process?
As long as your list_avro_files can be expressed through standard wildcard syntax, you can probably use Spark's own ability to parallelize read operation. All you'd need is to specify a basepath and a filename pattern for your avro files:
scala> var df = spark.read
.option("basepath","/user/hive/warehouse/root")
.format("avro")
.load("/user/hive/warehouse/root/*/*/*/*.avro")
And, in case you find that you need to know exactly which file any given row came from, use input_file_name() built-in function to enrich your dataframe:
scala> df = df.withColumn("source",input_file_name())

RDD String to Spark csv Reader

I want to read the RDD[String] using the spark CSV reader. The reason I am doing this is, I need to filter some records before using the CSV reader.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file")
I need to read the fileRDD using the spark CSV reader. I wish not to commit the file as it increases the IO of the HDFS. I have looked into the options we have in the spark CSV, but didn't found any.
spark.read.csv(file)
Sample Data
PHM|MERC|PHARMA|BLUEDRUG|50
CLM|BSH|CLAIM|VISIT|HSA|EMPLOYER|PAID|250
PHM|GSK|PHARMA|PARAC|70
CLM|UHC|CLAIM|VISIT|HSA|PERSONAL|PAID|72
As you can see all the records starts with PHM has different number of columns and clm has different number of columns. That is the reason i am filtering and then applying schema. PHM and CLM records has different schema.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file").filter(_.startWith("PHM"))
spark.read.option(schema,"phcschema").csv(fileRDD.toDS())
Since Spark 2.2, method ".csv" can read dataset of strings. Can be implemented in this way:
val rdd: RDD[String] = spark.sparkContext.textFile("csv.txt")
// ... do filtering
spark.read.csv(rdd.toDS())

Filter JSON records to diffrent datasets Spark-Java

I'm using Java-Spark.
I have the following Java records in rdd from Kafka (As string):
{"code":"123", "date":"14/07/2018",....}
{"code":"124", "date":"15/07/2018",....}
{"code":"123", "date":"15/07/2018",....}
{"code":"125", "date":"14/07/2018",....}
That I'm read to Dataset as follow:
Dataset<Row> df = sparkSession.read().json(jsonSet);
Dataset<Row> dfSelect = df.select(cols);//Where cols is Column[]
I want to write the JSON records to different Hive table and different partitions by mapping to diffrent datasets,
Meaning that:
{"code":"123", "date":"14/07/2018",....} Write to HDFS dir -> /../table123/partition=14_07_2018
{"code":"124", "date":"15/07/2018",....} Write to HDFS dir -> /../table124/partition=15_07_2018
{"code":"123", "date":"15/07/2018",....} Write to HDFS dir -> /../table123/partition=15_07_2018
{"code":"125", "date":"14/07/2018",....} Write to HDFS dir -> /../table125/partition=14_07_2018
How can I mapping the Jsons by code and by date and then write by:
dfSelectByTableAndDate123.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate124.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate125.write().format("parquet").mode("append").save(pathByTableAndDate);
Thanks
You can convert you json to java objects, then reduce it by date which will give you rows grouped by same date. Each set then you can write as you wish below is pseudo code in scala
case class MyType(code: String,date: String)
newDs = df.as[MyType]
newDs.reduceByKey("date").values

Resources