How to Calculate Unique values using 2 column in PySpark dataframe - apache-spark

I have below dataframe, I need to groupby all user_id and calculate unique categories corresponding to each user_id
Dataframe
Input Is like below
And The Output should be Like
This is the expected Output

If you need to get the distinct categories for each user, one way is to use a simple distinct(). This will give you each combination of the user_id and the category columns:
df.select("user_id", "category").distinct()
Another approach is to use collect_set() as an aggregation function. It will automatically get rid of the duplicates. This way there will be one record for each user_id, with all the categories aggregated to a struct field.
from pyspark.sql.functions import collect_set
df.groupBy("user_id").agg(collect_set("category")).display()
The first approach looks like the one that you need.

Related

Groupby Strings

I'm trying to create a new dataframe from an existing dataframe. I tried groupby but I didn't seem to sum up the strings as a whole number. Instead it returned many strings(colleges in this case).
This is the original dataframe
I tried groupby to get the number(whole number) of the colleges but it returned a many colleges(string) instead
How do I return the number of colleges as an integer in the new column 'totalPlayer'? Please help.
Hoping, I understand your question correctly.
You need to count the distinct values in college column.
Assuming df is the name of your data frame. Below code will help.
df['college'].nunique()
Helping Links -
Counting unique values in a column in pandas dataframe like in Qlik?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

How to automatically index DataFrame created from groupby in Pandas

Using the Kiva Loan_Data from Kaggle I aggregated the Loan Amounts by country. Pandas allows them to be easily turned into a DataFrame, but indexes on the country data. The reset_index can be used to create a numerical/sequential index, but I'm guessing I am adding an unnecessary step. Is there a way to create an automatic default index when creating a DataFrame like this?
Use as_index=False
groupby
split-apply-combine
df.groupby('country', as_index=False)['loan_amount'].sum()

Pyspark: filter DataaFrame where column value equals some value in list of Row objects

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

How to cube on two columns as if they were one?

I have a following attributes that I am interested on performing aggregations (regular count for example) on attributes:
'category', 'sub-category', age, city, education... (around 10 more)
I am interested in all possible combinations of the attributes in group by, so using dataframes cube function could help me achieve that.
But here is a catch: sub-category does not make any sense without category, so in order to achieve this I need to combine rollup(category, sub-category) with cube(age, city. education...).
How to do this?
This is what I tried, where test is the name of my table:
val data = sqlContext.sql("select category,'sub-category',age from test group by cube(rollup(category,'sub-category'), age )")
and this is the error I get:
org.apache.spark.sql.AnalysisException: expression 'test.category' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I think what you want is struct or expr functions to combine two columns as one and use it to cube on.
With struct it'd be as follows:
df.rollup(struct("category", "sub-category") as "(cat,sub)")
With expr it's as simple as using "pure" SQL, i.e.
df.rollup(expr("(category, 'sub-category')") as "(cat,sub)")
But I'm just guessing...

Resources