Aborting job in spark - apache-spark

Actually I want to extract a single column from CSV file which I already stored in a rdd and I used this code val csvArray = filerdd.map(line =>{ val colArray = line.split(",") List (colArray(13)) }...... After this code I printed that csvArray.foreach(println) ..I got the output and as well as error comes in between arrayindexoutofboundexception:13)

Related

How to handle NullPointerException while reading, filtering and counting the lines of CSV files using SparkSession?

I'm trying to read the CSV files stored on HDFS using sparkSession and count the number of lines and print the value on the console. However, I'm constantly getting NullPointerException while calculating the count. Below is the code snippet,
val validEmployeeIds = Set("12345", "6789")
val count = sparkSession
.read
.option("escape", "\"")
.option("quote", "\"")
.csv(inputPath)
.filter(row => validEmployeeIds.contains(row.getString(0)))
.distinct()
.count()
println(count)
I'm getting an NPE exactly at .filter condition. If I remove .filter in the code, it runs fine and prints the count. How can I handle this NPE?
The inputPath is a folder that contains contains multiple CSV files. Each CSV file has two columns, one represents Id and other represents name of the employee. A sample CSV extract is below:
12345,Employee1
AA888,Employee2
I'm using Spark version 2.3.1.
Try using isin function.
import spark.implicits._
val validEmployeeIds = List("12345", "6789")
val df = // Read CSV
df.filter('_c0.isin(validEmployeeIds:_*)).distinct().count()

Cannot write Dataframe result as a Hive table/LFS file

I have encountered an issue while writing the filtered data to a file. There are around 27 files created in local file system but with no output.
Below is the code used:
I'm reading the file as a dataframe
val in_df=spark.read.csv("file:///home/Desktop/Project/inputdata.csv").selectExpr("_c0 as Id","_c1 as name","_c2 as dept")
Then to register this dataframe as a temp table
in_df.registerTempTable("employeeDetails")
Now the requirement is to count the number of employees for each department and store it to a file.
val employeeDeptCount=spark.sql("select dept,count(*) from employeedetails group by dept")
//The following code is writing to Hive default warehouse as n number parquet files.
employeeDeptCount.write.saveAsTable("aggregatedcount")
//The following code is writing to LFS but No Output but n files are created
employeeDeptCount.write.mode("append").csv("file:///home/Desktop/Project")
val in_df=spark.read.csv("file:///home/Desktop/Project/inputdata.csv").selectExpr("_c0 as Id","_c1 as name","_c2 as dept")
// please, show your result
in_df.show(false)
val employeeDeptCount= in_df.groupBy("dept").count().alias("count")
employeeDeptCount.persist()
employeeDeptCount.write.format("csv").mode(SaveMode.Overwrite).saveAsTable("aggregatedcount")
employeeDeptCount.repartition(1).write.mode("append").csv("file:///home/Desktop/Project")
employeeDeptCount.unpersist()
// in_df.createOrReplaceTempView()
// in_df.createOrReplaceGlobalTempView()

how to check if rdd is empty using spark streaming?

I have following pyspark code which I am using to read log files from logs/ directory and then saving results to a text file only when it has the data in it ... in other words when RDD is not empty. But I am having issues implementing it. I have tried both take(1) and notempty. As this is dstream rdd we can't apply rdd methods to it. Please let me know if I am missing anything.
conf = SparkConf().setMaster("local").setAppName("PysparkStreaming")
sc = SparkContext.getOrCreate(conf = conf)
ssc = StreamingContext(sc, 3) #Streaming will execute in each 3 seconds
lines = ssc.textFileStream('/Users/rocket/Downloads/logs/') #'logs/ mean directory name
audit = lines.map(lambda x: x.split('|')[3])
result = audit.countByValue()
#result.pprint()
#result.foreachRDD(lambda rdd: rdd.foreach(sendRecord))
# Print the first ten elements of each RDD generated in this DStream to the console
if result.foreachRDD(lambda rdd: rdd.take(1)):
result.pprint()
result.saveAsTextFiles("/Users/rocket/Downloads/output","txt")
else:
result.pprint()
print("empty")
The correct structure would be
import uuid
def process_batch(rdd):
if not rdd.isEmpty():
result.saveAsTextFiles("/Users/rocket/Downloads/output-{}".format(
str(uuid.uuid4())
) ,"txt")
result.foreachRDD(process_batch)
That however, as you see above, requires a separate directory for each batch, as RDD API doesn't have append mode.
And alternative could be:
def process_batch(rdd):
if not rdd.isEmpty():
lines = rdd.map(str)
spark.createDataFrame(lines, "string").save.mode("append").format("text").save("/Users/rocket/Downloads/output")

Apache SPARK with SQLContext:: IndexError

I am trying to execute a basic example provided in Inferring the Schema Using Reflection segment of Apache SPARK documentation.
I'm doing this on Cloudera Quickstart VM(CDH5)
The example I'm trying to execute is as below ::
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
The input file book6_sample is available at
book6_sample.csv.
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
Please suggest pointers on where I'm going wrong.
Thanks in advance.
Regards,
Sri
Your file has one empty line at the end which is causing this error.Open your file in text editor and remove that line hope it will work

Add a header before text file on save in Spark

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.
I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.
You can make an RDD out of your header line and then union it, yes:
val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)
Then you end up with a bunch of part-xxxxx files that you merge.
The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.
More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.
Some help on writing it without Union(Supplied the header at the time of merge)
val fileHeader ="This is header"
val fileHeaderStream: InputStream = new ByteArrayInputStream(fileHeader.getBytes(StandardCharsets.UTF_8));
val output = IOUtils.copyBytes(fileHeaderStream,out,conf,false)
Now loop over you file parts to write the complete file using
val in: DataInputStream = ...<data input stream from file >
IOUtils.copyBytes(in, output, conf, false)
This made sure for me that header always comes as first line even when you use coalasec/repartition for efficient writing
def addHeaderToRdd(sparkCtx: SparkContext, lines: RDD[String], header: String): RDD[String] = {
val headerRDD = sparkCtx.parallelize(List((-1L, header))) // We index the header with -1, so that the sort will put it on top.
val pairRDD = lines.zipWithIndex()
val pairRDD2 = pairRDD.map(t => (t._2, t._1))
val allRDD = pairRDD2.union(headerRDD)
val allSortedRDD = allRDD.sortByKey()
return allSortedRDD.values
}
Slightly diff approach with Spark SQL
From Question: I now want to save this RDD as a CSV file and add a header. Each line of this RDD is already formatted correctly.
With Spark 2.x you have several options to convert RDD to DataFrame
val rdd = .... //Assume rdd properly formatted with case class or tuple
val df = spark.createDataFrame(rdd).toDF("col1", "col2", ... "coln")
df.write
.format("csv")
.option("header", "true") //adds header to file
.save("hdfs://location/to/save/csv")
Now we can even use Spark SQL DataFrame to load, transform and save CSV file
spark.sparkContext.parallelize(Seq(SqlHelper.getARow(temRet.columns,
temRet.columns.length))).union(temRet.rdd).map(x =>
x.mkString("\x01")).coalesce(1, true).saveAsTextFile(retPath)
object SqlHelper {
//create one row
def getARow(x: Array[String], size: Int): Row = {
var columnArray = new Array[String](size)
for (i <- 0 to (size - 1)) {
columnArray(i) = x(i).toString()
}
Row.fromSeq(columnArray)
}
}

Resources