Keeping csv order when creating pyspark dataframe - apache-spark

I have a csv file, from which I'm trying to create a spark streaming dataframe. The problem is, when printing the records I noticed they are in random order. Is there a way to keep the order of the original csv file when creating the stream?
I tried orderBy the Index field but got this error: pyspark.sql.utils.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
I need the order because later, I'll be using a concept drift model that requires ordered data.
def f(row):
print(row)
df = spark.readStream.format('csv').schema(csv_schema).option('header','true').option('delimiter',',').load('./Swat_dataset*.csv')
df_final = df.select("Index","MV301","AIT501","PIT503","Normal/Attack")
query = df_final.writeStream.outputMode("append").foreach(f).option("checkpointLocation", "checkpoints").start().awaitTermination()

Related

Spark caching - when to cache after foreachbatch (Spark Streaming)

I'm currently reading from a Kafka topic using spark streaming. Then, ForEachBatch (df), I do some transformations. I first filter the df batch by an id (df_filtered - i can do this filter n amount of times), then create a dataframe based on that filtered df (new_df_filtered - because the the data comes as a json message and I want to convert it to a normal column structure, providing it the schema), and finally writing in to 2 sinks.
Here's a sample of the code:
def sink_process(self, df: DataFrame, current_ids: list):
df.repartition(int(os.environ.get("SPARK_REPARTITION_NUMBER")))
df.cache()
for id in current_ids:
df_filtered = self.df_filter_by_id(df, id) #this returns the new dataframe with the schema. Uses a .where and then a .createDataFrame
first_row = df_filtered.take(1) #making sure that this filter action returns any data
if first_row:
df_filtered.cache()
self.sink_process(df_filtered, id)
df_filtered.unpersist()
df.unpersist()
My question is where should I cache this data for optimal performance. Right now I cached the batch before applying any transformations, which I have come to realise that at that point is not really doing anything, as it's only cached when the first action occurs. So following this logic, i'm only really caching this df when i'm reaching that .take, right? But at this point, i'm also caching that filtered df. The idea behind caching the batch data before the filter was that if had a log different ids, i wasn't fetching the data every time I was doing the filter, but I might have gotten this all wrong.
Can anyone please help clarify what would be the best approach? Maybe only caching the df_filtered one which is going to be used for the different sinks?
Thanks

PySpark rdd of pandas data frames

I'm extracting information of different source files. Each source file corresponds to a given snapshot time of some measurement data. I have a preprocessing function that takes one of these files and outputs a pandas data frame. So I did a spark sc.wholeTextFiles call, which gave me a list of all input files, and then I called map on it, which provided me with an rdd where each element is a pandas data frame. What would now be the best approach to "reshape" this structure such that I have only one resulting data frame consisting of the concatenated smaller data frames?
You can create spark dataframe. Assuming that these files situated in one location and are delimted you can use spark to create a new dataframe having data from all files.
spark.read.option("header", "true").csv("../location/*")
After that you can use lot of transformations available in spark. They are a lot similar to pandas, and works on bigdata and even faster than RDD.

Xml parsing on spark Structured Streaming

I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
I created a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Later I converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath. Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata"), I'm able to query using xpath:
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help
You can use com.databricks.spark.xml.XmlReader to read xml data from column but requires an RDD, which means that you need to transform your df to RDD using df.rdd which may impact performance.
Below is untested code from spark java:
import com.databricks.spark.xml
xmlRdd = df = kinDF.select("xml_data").map(r -> r[0])
new XmlReader().xmlRdd(spark, xmlRdd)

How to save pandas DataFrame to Hive within `forearchPartition` (PySpark)

I have a PySpark DataFrame. I want to perform some function forearchPartition and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?
I have tried the following without success (gives a serialization error):
def processData(x):
#do something
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)
original_spark_df.rdd.forearchPartition(processData)
I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions instead of forearchPartition), and then use rdd.toDF() and saveAsTable().
Is there some solution to save the pandas to Hive within forearchPartition?

Spark Parse Text File to DataFrame

Currently, I can parse a text file to a Spark DataFrame by way of the RDD API with the following code:
def row_parse_function(raw_string_input):
# Do parse logic...
return pyspark.sql.Row(...)
raw_rdd = spark_context.textFile(full_source_path)
# Convert RDD of strings to RDD of pyspark.sql.Row
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
# Convert RDD of pyspark.sql.Row to Spark DataFrame.
data_frame = spark_sql_context.createDataFrame(row_rdd, schema)
Is this current approach ideal?
Or is there a better way to do this without using the older RDD API.
FYI, Spark 2.0.
Clay,
This is a good approach to load a file that has not specific format instead CSV, JSON, ORC, Parquet or from Database.
If you have any kind of specific logic to work on it, this is the best way to do that. Using RDD is for this kind of situation, when you need to run a specific logic in your data that is not trivial.
You can read here about the uses of the APIs of Spark. And you are in the situation of RDD is the best approach.

Resources