Fill missing value in Spark dataframe - apache-spark

I 'm trying to fill missing values in spark dataframe using PySpark. But there is not any proper way to do it. My task is to fill the missing values of some rows with respect to their previous or following rows. Concretely , I would change the 0.0 value of one row to the value of the previous row, while doing nothing on a none-zero row . I did see the Window function in spark, but it only supports some simple operation like max, min, mean, which are not suitable for my case. It would be optimal if we could have a user defined function sliding over the given Window.
Does anybody have a good idea ?

Use Spark window API to access previous row data. If you work on time series data, see also this package for missing data imputation.

Related

Spark SQL Window functions - manual repartitioning necessary?

I am processing data partitioned by column "A" with PySpark.
Now, I need to use a window function over another column "B" to get the max value for this frame and count it up for new entries.
As it says here, "Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame."
Do I need to manually repartition the data by column "B" before applying the window, or does Spark does this automatically?
I.e. would I have to do:
data = data.repartition("B")
before:
w = Window().partitionBy("B").orderBy(col("id").desc())
Thanks a lot!
If you use Window.partitionBy(someCol), then if you have not set a value for shuffle partitions parameter, then the partitioning will default to 200.
A similar but not the same post should provide guidance. spark.sql.shuffle.partitions of 200 default partitions conundrum
So, in short you need not expressly perform the repartition, the shuffle partitions parameter is more relevant.

Assigning indexes across rows within python DataFrame

I'm currently trying to assign a unique indexer across rows, rather than alongside columns. The main problem is these values can never repeat, and must be preserved with every monthly report that I run.
I've thought about merging the columns and assigning an indexer to that, but my concern is that I won't be able to easily modify the dataframe and still preserve the same index values for each cell with this method.
I'm expecting my df to look something like this below:
Sample DataFrame
I haven't yet found a viable solution so haven't got any code to show yet. Any solutions would be much appreciated. Thank you.

Previous item search in apache spark

I'm quite new to big data area and I'm going to solve a problem. I am currently gauging the Spark solution and would like to check if this could be achieved by Spark.
My simplified input data schema:
|TransactionID|CustomerID|Timestamp|
What I'd like to get is for each transaction ID, find the 5 previous transaction IDs within the same customer. So the output data schema would look like:
|TransactionID|1stPrevTID|2ndPrevTID|...|5thPrevTID|
My input data source is around billion entries.
Here my question would be, is Spark a good candidate for solution or should I consider something else?
This can be done using the lag function.
from pyspark.sql.functions import lag
from pyspark.sql import Window
#Assuming the dataframe is named df
w = Window.partitionBy(df.customerid).orderBy(df.timestamp)
df_with_lag = df.withColumn('t1_prev',lag(df.transactionID,1).over(w))\
.withColumn('t2_prev',lag(df.transactionID,2).over(w))\
.withColumn('t3_prev',lag(df.transactionID,3).over(w))\
.withColumn('t4_prev',lag(df.transactionID,4).over(w))\
.withColumn('t5_prev',lag(df.transactionID,5).over(w))
df_with_lag.show()
Documentation on lag
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

Calculating Kernel Density of every column in a Spark DataFrame

is there a way to calculate KDE of every column of a DataFrame?
I have a DataFrame where each column represents the values of one feature. The KDE function of Spark MLLib needs an RDD[Double] of the sample values. The problem is I need to find a way without collecting the values for each column, because that would slow down the program to much.
Does anyone have an idea how I could solve that? Sadly all my tries failed till now.
Probably you can create a new RDD using sample function (refer here) and then perform your operation to get the optimal performance.

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

Resources