Efficient way to get the first item satisfying a condition - apache-spark

Using pySpark, I want to get the first element from a column satisfying a condition. I want this operation to be efficient so that the first time the condition is satisfied the element is returned.
Currently, I am trying
df.filter(df.seller_id==6).take(1)
It is taking a lot of time and I think something is causing a scan through the entire data or the read of the entire data. However, I think the filter should be pushed down while reading the data and the first moment the seller_id is 6 it should return that value. The first row in df_sales has seller_id as 6 so I think reading that row should be enough.
How can I manage something more efficient than the code I have mentioned?

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("seller_id")\
.orderBy("how you determine first in partition")
df.withColumn("row_number",row_number().over(windowSpec)).where(col("row_number") == 1)
I think the above code should do what you are looking for, it partitions your data by the seller ID and then within that partition you will need to order on a column in the orderBy() function, this orderBy will rank your rows based on this ordering and then in the last where we take the first row from that ordered partition

Related

speed up pandas search for a certain value not in the whole df

I have a large pandas DataFrame consisting of some 100k rows and ~100 columns with different dtypes and arbitrary content.
I need to assert that it does not contain a certain value, let's say -1.
Using assert( not (any(test1.isin([-1]).sum()>0))) results in processing time of some seconds.
Any idea how to speed it up?
Just to make a full answer out of my comment:
With -1 not in test1.values you can check if -1 is in your DataFrame.
Regarding the performance, this still needs to check every single value, which is in your case
10^5*10^2 = 10^7.
You only save with this the performance cost for summation and an additional comparison of these results.

Previous item search in apache spark

I'm quite new to big data area and I'm going to solve a problem. I am currently gauging the Spark solution and would like to check if this could be achieved by Spark.
My simplified input data schema:
|TransactionID|CustomerID|Timestamp|
What I'd like to get is for each transaction ID, find the 5 previous transaction IDs within the same customer. So the output data schema would look like:
|TransactionID|1stPrevTID|2ndPrevTID|...|5thPrevTID|
My input data source is around billion entries.
Here my question would be, is Spark a good candidate for solution or should I consider something else?
This can be done using the lag function.
from pyspark.sql.functions import lag
from pyspark.sql import Window
#Assuming the dataframe is named df
w = Window.partitionBy(df.customerid).orderBy(df.timestamp)
df_with_lag = df.withColumn('t1_prev',lag(df.transactionID,1).over(w))\
.withColumn('t2_prev',lag(df.transactionID,2).over(w))\
.withColumn('t3_prev',lag(df.transactionID,3).over(w))\
.withColumn('t4_prev',lag(df.transactionID,4).over(w))\
.withColumn('t5_prev',lag(df.transactionID,5).over(w))
df_with_lag.show()
Documentation on lag
Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

Pyspark: filter DataaFrame where column value equals some value in list of Row objects

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

SQL dataframe first and last not returning "real" first and last values

I tried using the Apache Spark SQL dataframe's aggregate functions "first" and "last" on a large file with a spark master and 2 workers. When I do the "first" and "last" operations I am expecting back the last column from the file; but it looks like Spark is returning the "first" or "last" from the worker partitions.
Is there any way to get the "real" first and last values in aggregate functions?
Thanks,
Yes. It is possible depending on what you mean first "real" first and last values. For example, if you are dealing with timestamped data and "real" first value refers to the oldest record, just orderBy the data according to time and get the first value.
When you say When I do the "first" and "last" operations I am expecting back the last column from the file, I understand that you are in fact referring to the first/last row of data from the file. Please correct me if I mistook this.
Thanks.
Edit :
You can read the file in a single partition (by setting numPartitions = 1) and then zipWithIndex and finally parallize the resulting collection. This way you get a column to order on and you don't change the source file as well.

Resources