How does Structured Streaming execute pandas_udf? - apache-spark

I'd like to understand how structured streaming treats new data coming.
If more rows arrive at the same time, spark append them to the input streaming dataframe, right?
If I have a withColumn and apply a pandas_udf, the function is called once per each row, or only one time and the rows are passed to the pandas_udf?
Let's say something like this:
dfInt = spark \
.readStream \
.load() \
.withColumn("prediction", predict( (F.struct([col(x) for x in (features)]))))
If more rows arrive at the same time, they are processed together or once per each?=
There is the chance to limit this to only one row per time?

If more rows arrive at the same time, spark append them to the input streaming dataframe, right?
Let's talk Micro-Batch Execution Engine only, right? That's what you most likely use in streaming queries.
Structured Streaming queries the streaming sources in a streaming query using Source.getBatch (DataSource API V1):
getBatch(start: Option[Offset], end: Offset): DataFrame
Returns the data that is between the offsets (start, end]. When start is None, then the batch should begin with the first record.
Whatever the source returns in a DataFrame is the data to be processed in a micro-batch.
If I have a withColumn and apply a pandas_udf, the function is called once per each row
Always. That's how user-defined functions work in Spark SQL.
or only one time and the rows are passed to the pandas_udf?
This says:
Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data.
The Python function should take pandas.Series as inputs and return a pandas.Series of the same length. Internally, Spark will execute a Pandas UDF by splitting columns into batches and calling the function for each batch as a subset of the data, then concatenating the results together.
If more rows arrive at the same time, they are processed together or once per each?
If "arrive" means "part of a single DataFrame", then "they are processed together", but one row at a time (per the UDF contract).
There is the chance to limit this to only one row per time?
You don't have to. It's as such by design. One row at a time only.

Related

How to extract a Dataset content n rows by n rows?

I have to output the results of a Dataset into a Postgis (spatial) database. Spark doesn't handle it and I had to write specific code that cannot be serialized. It means that I can't use dataset.foreach(...) method, and I have to execute my database insertions from outside Spark tasks.
But a whole
List<Row> rows = ds.collectAsList()
will produce an out of memory error.
And a
List<Row> row = takeList();
only returns the n first rows of the dataset.
Is there a way to read sequentially the dataset, so that I can read its whole content from the beginning to the end, extracting each time only a fixed amount of rows ?
You can try randomSplit method to split your dataframe into multiple dataframes.
For example, to split into 3:
ds.randomSplit(Array(1,1,1))

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

Spark's dataframe count() function taking very long

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:
Seq(df1, df2).map(df => df.count() > 0)
However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.
My question: Why is Spark's implementation of count() is slow. Is there a work-around?
Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.
Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.
So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

Does Spark's RDD.combineByKey() preserve the order of a previously sorted DataFrame?

I've done this in PySpark:
Created a DataFrame using a SELECT statement to get asset data ordered by asset serial number and then time.
Used DataFrame.map() to convert the DataFrame to an RDD.
Used RDD.combineByKey() to collate all the data for each asset, using the asset's serial number as the key.
Question: Can I be certain that the data for each asset will still be sorted in time order in the RDD resulting from the last step?
Time order is crucial for me (I need to calculate statistics over a moving time window across the data for each asset). When RDD.combineByKey() combines data from different nodes in the Spark cluster for a given key, is any order in that key's data retained? Or is the data from the different nodes combined in no particular order for a given key?
Can I be certain that the data for each asset will still be sorted in time order in the RDD resulting from the last step?
You cannot. When you apply sort across multiple dimensions (data ordered by asset serial number and then time) records for a single asset can be spread across multiple partitions. combineByKey will require a shuffle and the order in which these parts are combined is not guaranteed.
You can try with repartition and sortWithinPartitions (or its equivalent on RDDs):
df.repartition("asset").sortWithinPartitions("time")
or
df.repartition("asset").sortWithinPartitions("asset", "time")
or window functions with frame definition as follows:
w = Window.partitionBy("asset").orderBy("time")
In Spark >= 2.0 window functions can be used with UserDefinedFunctions so if you're fine with writing your own SQL extensions in Scala you can skip conversion to RDD completely.

Join Spark dataframe with Cassandra table [duplicate]

Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

Resources