PySpark - A more efficient method to count common elements - apache-spark

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?

As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection

I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Related

How to check Spark DataFrame difference?

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)
Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

Spark: sorting and assigning ids to dataset which has no unique id

I have a spark dataset which I get from some Hive table via SQL:
Dataset<Row> dataset = session.sql("select * from mytable order by myDate");
I need to assign unique increasing but not necessary sequential ids to the rows of my dataset sorted by myDate field i.e. assign ids like
1,4,6,7,8,9,16 etc
First thing I tried was row_number() function.
Dataset<Row> dataset = session.sql("select *,row_number() over () as rn from mytable order by myDate");
But I failed because myDate is not unique key (and there are no unique key in my dataset!), and I faced a very interesting bug. It turned out that each time I modify my dataset...
dataset.drop("redundantColumn");
dataset.join(..with something..);
dataset.select("rn","myDate");
... the dataset is being recalculated and thus the consequence of row number assignments is different! In other words, because my SQL query is non deterministic, each time I do something with my dataset generated by that query - I get different order of rows and thus different row-to-row_number matches.
Questions:
(1) is it possible to force Spark not to recalculate my dataset each time I do something with it? Indeed, why cant I drop columns or join dataset without re-running the initial query?
(2) any other solutions to this problem? Looks like the only option is to combine monotonic_increasing_id function with row_number and it looks quite verbose:
get dataset
add column with monotonic_increaing_id
create temp view from dataset
add row_number while selecting from that temp view
And again, I am not sure Spark will not reassign monotonic_increasing_id during some operations with my dataset like joining and adding columns.
(3) instead of option 2 - is there any way to assign monotonic_increasing_id function aligned with some sorted column?

Spark SQL - group by after repartitioning

I want to group all items in a source based on a specified, pre-defined category. The number of items per category could be in the order of millions. The groupBy helps me achieve this, but I want to understand if repartitioning on the product-type before grouping, would be more efficient?
The source for the spark jobs is hive tables. The version of spark is latest 2.4.4. The problem statement for me is that I want to run a customised similarity algorithm for every item with every other item in a given category. So, by the end of this operation, for every item, I would have the 10 most similar items to it.
Since this involves a groupBy operation and since groupBy involves shuffling of data, I thought first I would repartition the data based upon the category. I can even set the number of partitions to the number of categories that I have(in the magnitude of 100s).
Once data is re-partitioned sent on individual workers, running groupBy should be a local operation- if I do the groupBy on the same type. Is this assumption correct?
// For demo, I am reading from CSV. The final source is a hive table
Dataset<Row> rows = spark.read().option("sep", "\t")
.csv("<some path>")
.repartition(20, new Column("category"))
.cache();
Dataset<Row> ids_grouped_by_category = rows.map((MapFunction<Row, Row>) items -> {
// Some transformation returns a row in the format I need.
return new-row;
}, <encoder>)
.groupBy(functions.col("category"))
.agg(functions.collect_list("category").as("ids"));
At the end of this operation, I have been able to group all item-ids for a given category into a list. Something like this:
+---------------------------+------------------------------------------+
|category | ids |
+---------------------------+------------------------------------------+
|category-1 | [id1, id2...] |
|category-2 | [idx, idy...] |
+---------------------------+------------------------------------------+
I have been able to get the data in the format I need but wanted to understand is this way of doing a group-by correct?
Also, what are the implications of doing a collectList operation? Does it load everything in-memory?

Pyspark: filter DataaFrame where column value equals some value in list of Row objects

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

Resources