Previous item search in apache spark - apache-spark

I'm quite new to big data area and I'm going to solve a problem. I am currently gauging the Spark solution and would like to check if this could be achieved by Spark.
My simplified input data schema:
|TransactionID|CustomerID|Timestamp|
What I'd like to get is for each transaction ID, find the 5 previous transaction IDs within the same customer. So the output data schema would look like:
|TransactionID|1stPrevTID|2ndPrevTID|...|5thPrevTID|
My input data source is around billion entries.
Here my question would be, is Spark a good candidate for solution or should I consider something else?

This can be done using the lag function.
from pyspark.sql.functions import lag
from pyspark.sql import Window
#Assuming the dataframe is named df
w = Window.partitionBy(df.customerid).orderBy(df.timestamp)
df_with_lag = df.withColumn('t1_prev',lag(df.transactionID,1).over(w))\
.withColumn('t2_prev',lag(df.transactionID,2).over(w))\
.withColumn('t3_prev',lag(df.transactionID,3).over(w))\
.withColumn('t4_prev',lag(df.transactionID,4).over(w))\
.withColumn('t5_prev',lag(df.transactionID,5).over(w))
df_with_lag.show()
Documentation on lag
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

Related

Efficient way to get the first item satisfying a condition

Using pySpark, I want to get the first element from a column satisfying a condition. I want this operation to be efficient so that the first time the condition is satisfied the element is returned.
Currently, I am trying
df.filter(df.seller_id==6).take(1)
It is taking a lot of time and I think something is causing a scan through the entire data or the read of the entire data. However, I think the filter should be pushed down while reading the data and the first moment the seller_id is 6 it should return that value. The first row in df_sales has seller_id as 6 so I think reading that row should be enough.
How can I manage something more efficient than the code I have mentioned?
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("seller_id")\
.orderBy("how you determine first in partition")
df.withColumn("row_number",row_number().over(windowSpec)).where(col("row_number") == 1)
I think the above code should do what you are looking for, it partitions your data by the seller ID and then within that partition you will need to order on a column in the orderBy() function, this orderBy will rank your rows based on this ordering and then in the last where we take the first row from that ordered partition

Spark SQL Window functions - manual repartitioning necessary?

I am processing data partitioned by column "A" with PySpark.
Now, I need to use a window function over another column "B" to get the max value for this frame and count it up for new entries.
As it says here, "Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame."
Do I need to manually repartition the data by column "B" before applying the window, or does Spark does this automatically?
I.e. would I have to do:
data = data.repartition("B")
before:
w = Window().partitionBy("B").orderBy(col("id").desc())
Thanks a lot!
If you use Window.partitionBy(someCol), then if you have not set a value for shuffle partitions parameter, then the partitioning will default to 200.
A similar but not the same post should provide guidance. spark.sql.shuffle.partitions of 200 default partitions conundrum
So, in short you need not expressly perform the repartition, the shuffle partitions parameter is more relevant.

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

Fill missing value in Spark dataframe

I 'm trying to fill missing values in spark dataframe using PySpark. But there is not any proper way to do it. My task is to fill the missing values of some rows with respect to their previous or following rows. Concretely , I would change the 0.0 value of one row to the value of the previous row, while doing nothing on a none-zero row . I did see the Window function in spark, but it only supports some simple operation like max, min, mean, which are not suitable for my case. It would be optimal if we could have a user defined function sliding over the given Window.
Does anybody have a good idea ?
Use Spark window API to access previous row data. If you work on time series data, see also this package for missing data imputation.

Resources