How to handle NullPointerException while reading, filtering and counting the lines of CSV files using SparkSession? - apache-spark

I'm trying to read the CSV files stored on HDFS using sparkSession and count the number of lines and print the value on the console. However, I'm constantly getting NullPointerException while calculating the count. Below is the code snippet,
val validEmployeeIds = Set("12345", "6789")
val count = sparkSession
.read
.option("escape", "\"")
.option("quote", "\"")
.csv(inputPath)
.filter(row => validEmployeeIds.contains(row.getString(0)))
.distinct()
.count()
println(count)
I'm getting an NPE exactly at .filter condition. If I remove .filter in the code, it runs fine and prints the count. How can I handle this NPE?
The inputPath is a folder that contains contains multiple CSV files. Each CSV file has two columns, one represents Id and other represents name of the employee. A sample CSV extract is below:
12345,Employee1
AA888,Employee2
I'm using Spark version 2.3.1.

Try using isin function.
import spark.implicits._
val validEmployeeIds = List("12345", "6789")
val df = // Read CSV
df.filter('_c0.isin(validEmployeeIds:_*)).distinct().count()

Related

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

race condition in FileScanRDD.scala

I am attempting to create a DataSet from a single CSV file :
val ss: SparkSession = ....
val ddr = ss.read.option("path", path)
var df = ddr.option("sep", ",")
.option("quote", "\"")
.option("escape", "\"") // want to retain backslashes (\) ...
.option("delimiter", ",")
.option("comment", "#")
.option("header", "true")
.option("format", "csv")
ddr.csv(path)
df.count() returns 2 times the number of lines in the CSV file.
There appears to be a problem in FileScanRDD.compute.
I am using spark version 2.0.1 with scala 2.11. I am not going to include the entire contents of FileScanRDD.scala here.
In FileScanRDD.compute there is the following:
private[this] val files = split.asInstanceOf[FilePartition].files.toIterator
If i put a breakpoint in either FileScanRDD.compute or FIleScanRDD.nextIterator the resulting dataset has the correct number of rows.
Moreover, the code in FileScanRDD.scala is:
private def nextIterator(): Boolean = {
updateBytesReadWithFileSize()
if (files.hasNext) {
currentFile = files.next()
....
}
else { .... }
....
}
if i put a breakpoint on the files.hasNext line all is well; however, if i put a breakpoint on the files.next() line the code will fail when i continue because the files iterator has become empty. Disabling the breakpoint winds up creating a Dataset with each line of the csv file duplicated.
So it appears that multiple threads are using the files iterator or the underling split value (an RDDPartition) and timing wise on my system 2 workers wind up processing the same file, with the resulting DataSet having 2 copies of each of the input lines.
This code is not active when parsing an XML file.
I couldn't find any questions about this.
Do i have do something to prevent this behavior?

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

createDataFrame() returning a list instead of DataFrame in Spark

I am running Spark 1.5.1. On startup I have HiveContext available as sqlContext but set
sqlContext2 = SQLContext(sc)
I create a pipelined RDD by parsing a list of strings to JSON
data = points.map(lambda line: json.loads(line))
I then try to convert this into a dataframe using
DF = sqlContext2.createDataFrame(data).collect()
This runs perfectly, but then when i run type(DF) it says that it is a list.
How is this possible? How is a list coming out of a createDataFrame()
That's because when you apply collect() on a DataFrame, it return a list that contains all of the elements (Rows) in this DataFrame.
if you want just a DatFrame, df = sqlContext.createDataFrame(data) is enough.
There is no need for sqlContext2 here.

Add a header before text file on save in Spark

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.
I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.
You can make an RDD out of your header line and then union it, yes:
val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)
Then you end up with a bunch of part-xxxxx files that you merge.
The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.
More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.
Some help on writing it without Union(Supplied the header at the time of merge)
val fileHeader ="This is header"
val fileHeaderStream: InputStream = new ByteArrayInputStream(fileHeader.getBytes(StandardCharsets.UTF_8));
val output = IOUtils.copyBytes(fileHeaderStream,out,conf,false)
Now loop over you file parts to write the complete file using
val in: DataInputStream = ...<data input stream from file >
IOUtils.copyBytes(in, output, conf, false)
This made sure for me that header always comes as first line even when you use coalasec/repartition for efficient writing
def addHeaderToRdd(sparkCtx: SparkContext, lines: RDD[String], header: String): RDD[String] = {
val headerRDD = sparkCtx.parallelize(List((-1L, header))) // We index the header with -1, so that the sort will put it on top.
val pairRDD = lines.zipWithIndex()
val pairRDD2 = pairRDD.map(t => (t._2, t._1))
val allRDD = pairRDD2.union(headerRDD)
val allSortedRDD = allRDD.sortByKey()
return allSortedRDD.values
}
Slightly diff approach with Spark SQL
From Question: I now want to save this RDD as a CSV file and add a header. Each line of this RDD is already formatted correctly.
With Spark 2.x you have several options to convert RDD to DataFrame
val rdd = .... //Assume rdd properly formatted with case class or tuple
val df = spark.createDataFrame(rdd).toDF("col1", "col2", ... "coln")
df.write
.format("csv")
.option("header", "true") //adds header to file
.save("hdfs://location/to/save/csv")
Now we can even use Spark SQL DataFrame to load, transform and save CSV file
spark.sparkContext.parallelize(Seq(SqlHelper.getARow(temRet.columns,
temRet.columns.length))).union(temRet.rdd).map(x =>
x.mkString("\x01")).coalesce(1, true).saveAsTextFile(retPath)
object SqlHelper {
//create one row
def getARow(x: Array[String], size: Int): Row = {
var columnArray = new Array[String](size)
for (i <- 0 to (size - 1)) {
columnArray(i) = x(i).toString()
}
Row.fromSeq(columnArray)
}
}

Resources