How to use monotonically_increasing_id to join two pyspark dataframes having no common column? - apache-spark

I have two pyspark dataframes with same number of rows but they don't have any common column. So I am adding new column to both of them using monotonically_increasing_id() as
from pyspark.sql.functions import monotonically_increasing_id as mi
id=mi()
df1 = df1.withColumn("match_id", id)
cont_data = cont_data.withColumn("match_id", id)
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
But after join the resulting data frame has less number of rows.
What am I missing here. Thanks

You just don't. This not an applicable use case for monotonically_increasing_id, which is by definition non-deterministic. Instead:
convert to RDD
zipWithIndex
convert back to DataFrame.
join

You can generate the id's with monotonically_increasing_id, save the file to disc, and then read it back in THEN do whatever joining process. Would only suggest this approach if you just need to generate the id's once. At that point they can be used for joining, but for the reasons mentioned above, this is hacky and not a good solution for anything that runs regularly.

If you want to get an incremental number on both dataframes and then join, you can generate a consecutive number with monotonically and windowing with the following code:
df1 = df1.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
df1 = df1.withColumn("match_id", row_number().over(window))
df1 = df1.drop("monotonically_increasing_id")
cont_data = cont_data.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
cont_data = cont_data.withColumn("match_id", row_number().over(window))
cont_data = cont_data.drop("monotonically_increasing_id")
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
Warning It may move the data to a single partition! So maybe is better to separate the match_id to a different dataframe with the monotonically_increasing_id, generate the consecutive incremental number and then join with the data.

Related

Create multiple columns over the same window

The following code is pretty slow.
Is there a way of creating multiple columns at once over the same window, so Spark does not need to partition and order the data multiple times?
w = Window().partitionBy("k").orderBy("t")
df = df.withColumn(F.col("a"), F.last("a",True).over(w))
df = df.withColumn(F.col("b"), F.last("b",True).over(w))
df = df.withColumn(F.col("c"), F.last("c",True).over(w))
...
I'm not sure that Spark does partitioning and reordering several times, as you use the same window consecutively. However, .select is usually a better alternative than .withColumn.
df = df.select(
"*",
F.last("a", True).over(w).alias("a"),
F.last("b", True).over(w).alias("b"),
F.last("c", True).over(w).alias("c"),
)
To find out if partitioning and ordering is done several times, you need to analyse the df.explain() results.
You dont have to generate one column at a time. Use list comprehension. Code below
new=['a','b','c']
df = df.select(
"*", *[F.last(x, True).over(w).alias(f"{x}") for x in new]
)

Create column and use as join key inside of JOIN in PySpark

I'd like to create a column to use as the join key inside of the join like:
df1.join(df2
.withColumn('NewDF2Column', SOME_OPERATION)),
df1['key'] = df2['NewDF2Column'], how = 'left'))
PySpark can never find the NewDF2Column to use as the join key. It works if I create it first in another dataframe, but not dynamically like this. Is it possible? Thank you!
Dataframes are immutable, which means that you need to reassign everytime your variable to get the result from it. In this case, you are creating your NewDF2Column on the first parameter of join operation, but the second parameter where you references NewDF2Column again can't see the changes made before. How to solve it?
First option
# Creating before joining
df2 = df2.withColumn('NewDF2Column', SOME_OPERATION)
output_df = df1.join(df2, df1['key'] = df2['NewDF2Column'], how='left')
Second option
# Creating a column to join with the same name as df1
output_df = df1.join(df2.withColumn('key', SOME_OPERATION), on='key', how='left')

Is there a way to add multiple columns to a dataframe calculated from moving averages from different columns and/or over different duration

I have a dataframe with time-series data and I am trying to add a lot of moving average columns to it with different windows of various ranges. When I do this column by column, results are pretty slow.
I have tried to just pile the withColumn calls until I have all of them.
Pseudo code:
import pyspark.sql.functions as pysparkSqlFunctions
## working from a data frame with 12 colums:
## - key as a String
## - time as a DateTime
## - col_{1:10} as numeric values
window_1h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-3600, 0)
window_2h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-7200, 0)
df = df.withColumn("col1_1h", pysparkSqlFunctions.avg("col_1").over(window_1h))
df = df.withColumn("col1_2h", pysparkSqlFunctions.avg("col_1").over(window_2h))
df = df.withColumn("col2_1h", pysparkSqlFunctions.avg("col_2").over(window_1h))
df = df.withColumn("col2_2h", pysparkSqlFunctions.avg("col_2").over(window_2h))
What I would like is the ability to add all 4 columns (or many more) in one call, hopefully traversing the data only once for better performance.
I prefer to import the functions library as F as it looks neater and it is the standard alias used in the official Spark documentation.
The star string, '*', should capture all the current columns within the dataframe. Alternatively, you could replace the star string with *df.columns. Here the star explodes the list into separate parameters for the select method.
from pyspark.sql import functions as F
df = df.select(
"*",
F.avg("col_1").over(window_1h).alias("col1_1h"),
F.avg("col_1").over(window_2h).alias("col1_2h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
)

PySpark isin function

I am converting my legacy Python code to Spark using PySpark.
I would like to get a PySpark equivalent of:
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
Both, actdataall and orddata are Spark dataframes.
I don't want to use toPandas() function given the drawback associated with it.
If both dataframes are big, you should consider using an inner join which will work as a filter:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
sc.broadcast(orderid_list)
The most direct translation of your code would be:
from pyspark.sql import functions as F
# collect all the unique ORDER_IDs to the driver
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
# filter ORDValue column by list of order_ids, then select only User ID column
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids)).select('User ID')
However, you should only filter like this only if number of 'ORDER_ID' is definitely small (perhaps <100,000 or so).
If the number of 'ORDER_ID's is large, you should use a broadcast variable which sends the list of order_ids to each executor so it can compare against the order_ids locally for faster processing. Note, this will work even if 'ORDER_ID' is small.
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
order_ids_broadcast = sc.broadcast(order_ids) # send to broadcast variable
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID')
For more information on broadcast variables, check out: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html
So, you have two spark dataframe. One is actdataall and other is orddata, then use following command to get your desire result.
usersofinterest = actdataall.where(actdataall['ORDValue'].isin(orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0])).select('User ID')

How to remove rows in DataFrame on column based on another DataFrame?

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))

Resources