how to check if rdd is empty using spark streaming? - python-3.x

I have following pyspark code which I am using to read log files from logs/ directory and then saving results to a text file only when it has the data in it ... in other words when RDD is not empty. But I am having issues implementing it. I have tried both take(1) and notempty. As this is dstream rdd we can't apply rdd methods to it. Please let me know if I am missing anything.
conf = SparkConf().setMaster("local").setAppName("PysparkStreaming")
sc = SparkContext.getOrCreate(conf = conf)
ssc = StreamingContext(sc, 3) #Streaming will execute in each 3 seconds
lines = ssc.textFileStream('/Users/rocket/Downloads/logs/') #'logs/ mean directory name
audit = lines.map(lambda x: x.split('|')[3])
result = audit.countByValue()
#result.pprint()
#result.foreachRDD(lambda rdd: rdd.foreach(sendRecord))
# Print the first ten elements of each RDD generated in this DStream to the console
if result.foreachRDD(lambda rdd: rdd.take(1)):
result.pprint()
result.saveAsTextFiles("/Users/rocket/Downloads/output","txt")
else:
result.pprint()
print("empty")

The correct structure would be
import uuid
def process_batch(rdd):
if not rdd.isEmpty():
result.saveAsTextFiles("/Users/rocket/Downloads/output-{}".format(
str(uuid.uuid4())
) ,"txt")
result.foreachRDD(process_batch)
That however, as you see above, requires a separate directory for each batch, as RDD API doesn't have append mode.
And alternative could be:
def process_batch(rdd):
if not rdd.isEmpty():
lines = rdd.map(str)
spark.createDataFrame(lines, "string").save.mode("append").format("text").save("/Users/rocket/Downloads/output")

Related

How to handle NullPointerException while reading, filtering and counting the lines of CSV files using SparkSession?

I'm trying to read the CSV files stored on HDFS using sparkSession and count the number of lines and print the value on the console. However, I'm constantly getting NullPointerException while calculating the count. Below is the code snippet,
val validEmployeeIds = Set("12345", "6789")
val count = sparkSession
.read
.option("escape", "\"")
.option("quote", "\"")
.csv(inputPath)
.filter(row => validEmployeeIds.contains(row.getString(0)))
.distinct()
.count()
println(count)
I'm getting an NPE exactly at .filter condition. If I remove .filter in the code, it runs fine and prints the count. How can I handle this NPE?
The inputPath is a folder that contains contains multiple CSV files. Each CSV file has two columns, one represents Id and other represents name of the employee. A sample CSV extract is below:
12345,Employee1
AA888,Employee2
I'm using Spark version 2.3.1.
Try using isin function.
import spark.implicits._
val validEmployeeIds = List("12345", "6789")
val df = // Read CSV
df.filter('_c0.isin(validEmployeeIds:_*)).distinct().count()

Spark job reading the sorted AVRO files in dataframe, but writing to kafka without order

I have AVRO files sorted with ID and each ID has folder called "ID=234" and data inside the folder is in AVRO format and sorted on the basis of date.
I am running spark job which takes input path and reads avro in dataframe. This dataframe then writes to kafka topic with 5 partition.
val properties: Properties = getProperties(args)
val spark = SparkSession.builder().master(properties.getProperty("master"))
.appName(properties.getProperty("appName")).getOrCreate()
val sqlContext = spark.sqlContext
val sourcePath = properties.getProperty("sourcePath")
val dataDF = sqlContext.read.avro(sourcePath).as("data")
val count = dataDF.count();
val schemaRegAdd = properties.getProperty("schemaRegistry")
val schemaRegistryConfs = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegAdd,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME
)
val start = Instant.now
dataDF.select(functions.struct(properties.getProperty("message.key.name")).alias("key"), functions.struct("*").alias("value"))
.toConfluentAvroWithPlainKey(properties.getProperty("topic"), properties.getProperty("schemaName"),
properties.getProperty("schemaNamespace"))(schemaRegistryConfs)
.write.format("kafka")
.option("kafka.bootstrap.servers",properties.getProperty("kafka.brokers"))
.option("topic",properties.getProperty("topic")).save()
}
My use case is to write all messages from each ID (sorted on date) sequencially such as all sorted data from one ID 1 should be added first then from ID 2 and so on. Kafka message has key as ID.
Dont forgot that the data inside a RDD/dataset is shuffle when you do transformations so you lose the order.
the best way to achieve this is to read file one by one and send it to kafka instead of read a full directory in your val sourcePath = properties.getProperty("sourcePath")

How to create a DStream from a List of string?

I have a list of string, but i cant find a way to change the list to a DStream of spark streaming.
I tried this:
val tmpList = List("hi", "hello")
val rdd = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rowRdd = rdd.map(v => Row(v: _*))
But the eclipse says sparkContext is not a member of sqlContext, so, How can i do this?
Appreciate your help, Please.
DStream is the sequence of RDD and it is created when you have register a received to some streaming source like Kafka. For testing if you want to create DStream from list of RDD's you can do that as follows:
val rdd1 = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rdd2 = sqlContext.sparkContext.parallelize(Seq(tmpList1))
ssc.queueStream[String](mutable.Queue(rdd1,rdd2))
Hope it answers your question.

How to evaluate spark Dstream objects with an spark data frame

I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database
Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it .
Now I am getting the streaming data as
import re
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext,functions as func,Row
sc = SparkContext("local[2]", "realtimeApp")
ssc = StreamingContext(sc,10)
files = ssc.textFileStream("hdfs://RealTimeInputFolder/")
########Lets get the data from the db which is relavant for streaming ###
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
dataurl = "jdbc:sqlserver://myserver:1433"
db = "mydb"
table = "stream_helper"
credential = "my_credentials"
########basic data for evaluation purpose ########
files_count = files.flatMap(lambda file: file.split( ))
pattern = '(TranAmount=Decimal.{2})(.[0-9]*.[0-9]*)(\\S+ )(TranDescription=u.)([a-zA-z\\s]+)([\\S\\s]+ )(dSc=u.)([A-Z]{2}.[0-9]+)'
tranfiles = "wasb://myserver.blob.core.windows.net/RealTimeInputFolder01/"
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def pre_parse(logline):
"""
to read files as rows of sql in pyspark streaming using the pattern . for use of logging
added 0,1 in case there is any failure in processing by this pattern
"""
match = re.search(pattern,logline)
if match is None:
return(line,0)
else:
return(
Row(
customer_id = match.group(8)
trantype = match.group(5)
amount = float(match.group(2))
),1)
def parse():
"""
actual processing is happening here
"""
parsed_tran = ssc.textFileStream(tranfiles).map(preparse)
success = parsed_tran.filter(lambda s: s[1] == 1).map(lambda x:x[0])
fail = parsed_tran.filter(lambda s:s[1] == 0).map(lambda x:x[0])
if fail.count() > 0:
print "no of non parsed file : %d", % fail.count()
return success,fail
success ,fail = parse()
Now I want to evaluate it by the data frame that I get from the historical data
base_data = sqlContext.read.format("jdbc").options(driver=driver,url=dataurl,database=db,user=credential,password=credential,dbtable=table).load()
Now since this being returned as a data frame how do I use this for my purpose .
The streaming programming guide here says
"You have to create a SQLContext using the SparkContext that the StreamingContext is using."
Now this makes me even more confused on how to use the existing dataframe with the streaming object . Any help is highly appreciated .
To manipulate DataFrames, you always need a SQLContext so you can instanciate it like :
sc = SparkContext("local[2]", "realtimeApp")
sqlc = SQLContext(sc)
ssc = StreamingContext(sc, 10)
These 2 contexts (SQLContext and StreamingContext) will coexist in the same job because they are associated with the same SparkContext.
But, keep in mind, you can't instanciate two different SparkContext in the same job.
Once you have created your DataFrame from your DStreams, you can join your historical DataFrame with the DataFrame created from your stream.
To do that, I would do something like :
yourDStream.foreachRDD(lambda rdd: sqlContext
.createDataFrame(rdd)
.join(historicalDF, ...)
...
)
Think about the amount of streamed data you need to use for your join when you manipulate streams, you may be interested by the windowed functions

Add a header before text file on save in Spark

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.
I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.
You can make an RDD out of your header line and then union it, yes:
val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)
Then you end up with a bunch of part-xxxxx files that you merge.
The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.
More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.
Some help on writing it without Union(Supplied the header at the time of merge)
val fileHeader ="This is header"
val fileHeaderStream: InputStream = new ByteArrayInputStream(fileHeader.getBytes(StandardCharsets.UTF_8));
val output = IOUtils.copyBytes(fileHeaderStream,out,conf,false)
Now loop over you file parts to write the complete file using
val in: DataInputStream = ...<data input stream from file >
IOUtils.copyBytes(in, output, conf, false)
This made sure for me that header always comes as first line even when you use coalasec/repartition for efficient writing
def addHeaderToRdd(sparkCtx: SparkContext, lines: RDD[String], header: String): RDD[String] = {
val headerRDD = sparkCtx.parallelize(List((-1L, header))) // We index the header with -1, so that the sort will put it on top.
val pairRDD = lines.zipWithIndex()
val pairRDD2 = pairRDD.map(t => (t._2, t._1))
val allRDD = pairRDD2.union(headerRDD)
val allSortedRDD = allRDD.sortByKey()
return allSortedRDD.values
}
Slightly diff approach with Spark SQL
From Question: I now want to save this RDD as a CSV file and add a header. Each line of this RDD is already formatted correctly.
With Spark 2.x you have several options to convert RDD to DataFrame
val rdd = .... //Assume rdd properly formatted with case class or tuple
val df = spark.createDataFrame(rdd).toDF("col1", "col2", ... "coln")
df.write
.format("csv")
.option("header", "true") //adds header to file
.save("hdfs://location/to/save/csv")
Now we can even use Spark SQL DataFrame to load, transform and save CSV file
spark.sparkContext.parallelize(Seq(SqlHelper.getARow(temRet.columns,
temRet.columns.length))).union(temRet.rdd).map(x =>
x.mkString("\x01")).coalesce(1, true).saveAsTextFile(retPath)
object SqlHelper {
//create one row
def getARow(x: Array[String], size: Int): Row = {
var columnArray = new Array[String](size)
for (i <- 0 to (size - 1)) {
columnArray(i) = x(i).toString()
}
Row.fromSeq(columnArray)
}
}

Resources