Pyspark: filter DataaFrame where column value equals some value in list of Row objects - apache-spark

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects

The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

Related

Efficient way to get the first item satisfying a condition

Using pySpark, I want to get the first element from a column satisfying a condition. I want this operation to be efficient so that the first time the condition is satisfied the element is returned.
Currently, I am trying
df.filter(df.seller_id==6).take(1)
It is taking a lot of time and I think something is causing a scan through the entire data or the read of the entire data. However, I think the filter should be pushed down while reading the data and the first moment the seller_id is 6 it should return that value. The first row in df_sales has seller_id as 6 so I think reading that row should be enough.
How can I manage something more efficient than the code I have mentioned?
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("seller_id")\
.orderBy("how you determine first in partition")
df.withColumn("row_number",row_number().over(windowSpec)).where(col("row_number") == 1)
I think the above code should do what you are looking for, it partitions your data by the seller ID and then within that partition you will need to order on a column in the orderBy() function, this orderBy will rank your rows based on this ordering and then in the last where we take the first row from that ordered partition

How to Calculate Unique values using 2 column in PySpark dataframe

I have below dataframe, I need to groupby all user_id and calculate unique categories corresponding to each user_id
Dataframe
Input Is like below
And The Output should be Like
This is the expected Output
If you need to get the distinct categories for each user, one way is to use a simple distinct(). This will give you each combination of the user_id and the category columns:
df.select("user_id", "category").distinct()
Another approach is to use collect_set() as an aggregation function. It will automatically get rid of the duplicates. This way there will be one record for each user_id, with all the categories aggregated to a struct field.
from pyspark.sql.functions import collect_set
df.groupBy("user_id").agg(collect_set("category")).display()
The first approach looks like the one that you need.

How to get a list of columns that will give me a unique record in Pyspark Dataframe

My intension is to write a python function that would take a pyspark DataFrame as input, and its output would be a list of columns (could be multiple lists) that gives a unique record when combined together.
So, if you take a set of values for the columns in the list, you would always get just 1 record from the DataFrame.
Example:
Input Dataframe
Name Role id
--------------------
Tony Dev 130
Stark Qa 131
Steve Prod 132
Roger Dev 133
--------------------
Output:
Name,Role
Name,id
Name,id,Role
Why is the output what it is?
For any Name,Role combination I will always get just 1 record
And, for any Name, id combination I will always get just 1 record.
There are ways to define a function, which will do exactly what you are asking for.
I will only show 1 possibility and it is a very naive solution. You can iterate through all the combinations of columns and check whether they form a unique entry in the table:
import itertools as it
def find_all_unique_columns_naive(df):
cols = df.columns
res = []
for num_of_cols in range(1, len(cols) + 1):
for comb in it.combinations(cols, num_of_cols):
num_of_nonunique = df.groupBy(*comb).count().where("count > 1").count()
if not num_of_nonunique:
res.append(comb)
return res
With a result for your example being:
[('Name',), ('id',), ('Name', 'Role'), ('Name', 'id'), ('Role', 'id'),
('Name', 'Role', 'id')]
There is obviously a performance issue, since this function is exponentially increasing in time as the number of columns grow, i.e. O(2^N). Meaning the runtime for a table with just 20 columns is already going to take quite a long time.
There are however some obvious ways to speed this up, f.e. in case you already know that column Name is unique, then definitely any combination which includes the already known unique combination will remain unique, hence you can already by that fact deduce that combinations (Name, Role), (Name, id) and (Name, Role, id) are unique as well and this will definitely reduce the search space quite efficiently. The worst case scenario however remains the same, i.e. in case the table has no unique combination of columns, you will have to exhaust the entire search space to make that conclusion.
As a conclusion, I'd suggest that you should think about why you want this function in the first place. There might be some specific use-cases for small tables I agree, just to save some time, but to be completely honest, this is not how one should treat a table. If a table exists, then there should be a purpose for the table to exist and a proper table design, i.e. how are the data inside the table really structured and updated. And that should be the starting point when looking for unique identifiers. Because even though you will be able to find other unique identifiers now with this method, it very well might be the case that the table design will destroy them with the next update. I'd much rather suggest to use the table's metadata and documentation, because then you can be sure that you are treating the table in the correct way as it was designed and, in case the table has a lot of columns, it is actually faster.

Groupby Strings

I'm trying to create a new dataframe from an existing dataframe. I tried groupby but I didn't seem to sum up the strings as a whole number. Instead it returned many strings(colleges in this case).
This is the original dataframe
I tried groupby to get the number(whole number) of the colleges but it returned a many colleges(string) instead
How do I return the number of colleges as an integer in the new column 'totalPlayer'? Please help.
Hoping, I understand your question correctly.
You need to count the distinct values in college column.
Assuming df is the name of your data frame. Below code will help.
df['college'].nunique()
Helping Links -
Counting unique values in a column in pandas dataframe like in Qlik?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Resources