passing array into isin() function in Databricks - apache-spark

I have a requirement where I will have to filter records from a df if that is present in one array. so I have an array that is distinct values from another df's column like below.
dist_eventCodes = Event_code.select('Value').distinct().collect()
now I am passing this dist_eventCodes in a filter like below.
ADT_df_select = ADT_df.filter(ADT_df.eventTypeCode.isin(dist_eventCodes))
when I run this code I get the below error message
"AttributeError: 'DataFrame' object has no attribute '_get_object_id'"
can somebody please help me under what wrong am i doing?
Thanks in advance

If I understood correctly, you want to retain only those rows where eventTypeCode is within eventTypeCode from Event_code dataframe
Let me know if this is not the case
This can be achieved by a simple left-semi join in spark. This way you don't need to collect the dataframe, thus would be the right way in a distributed environment.
ADT_df.alias("df1").join(Event_code.select("value").distinct().alias("df2"), [F.col("df1.eventTypeCode")=F.col("df2.value")], 'leftsemi')
Or if there is a specific need to use isin, this would work (collect_set will take care of distinct):
dist_eventCodes = Event_code.select("value").groupBy(F.lit("dummy")).agg(F.collect_set("value").alias("value")).first().asDict()
ADT_df_select = ADT_df.filter(ADT_df["eventTypeCode"].isin(dist_eventCodes["value"]))
Input (ADT_df):
Event_code Dataframe:
Output:

Related

Returning a default value when looking up a pyspark map Column

I have a map Column that I created using pyspark.sql.functions.create_map. I am performing some actions that require me to look up in this map column as shown below.
lookup_map[col("col1")]
If a value does not exist in lookup_map column, I want it to return a default value. How can I achieve this?
Use coalesce :
F.coalesce(lookup_map[col("col1")], F.lit("default"))
E.g.
For below map
mapping = {'1': 'value'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping.items())])
and Input DF:
Output of
df.withColumn("value", F.coalesce(mapping_expr[F.col("id")], F.lit("x"))).show()
will be :
I managed to do it using when and otherwise
df.withColumn("col", when(col("mapColumn").getItem("key").isNotNull(),col("mapColumn").getItem("key")).otherwise(lit("DEFAULT_VALUE")))
There is another answer that suggests using coalesce. I haven't tried it and I am not sure if there are any difference in performance between them.

why am I getting column object not callable error in pyspark?

I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.
Please suggest where am I making mistakes.
Thank you
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()

Multiple if elif conditions to be evaluated for each row of pyspark dataframe

I need help in pyspark dataframe topic.
I have a dataframe of say 1000+ columns and 100000+ rows.Also I have 10000+ if elif conditions are there,under each if else condition there are few global variables getting incremented by some values.
Now my question is how can I achieve this in pyspark only.
I read about filter and where functions which return rows based on condition by I need to check those 10000+ if else conditions and perform some manipulations.
Any help would be appreciated.
If you could give an example with small dataset that would be of great help.
Thankyou
You can define a function to contain all of you if elif conditions, then apply this function into each row of the DataFrame.
Just use .rdd to convert the DataFrame to a normal RDD, then use map() function.
e.g, DF.rdd.map(lambda row: func(row))
Hope it can help you.
As I understand it, you just want to update some global counters while iterating over your DataFrame. For this, you need to:
1) Define one or more accumulators:
ac_0 = sc.accumulator(0)
ac_1 = sc.accumulator(0)
2) Define a function to update your accumulators for a given row, e.g:
def accumulate(row):
if row.foo:
ac_0.add(1)
elif row.bar:
ac_1.add(row.baz)
3) Call foreach on your DataFrame:
df.foreach(accumulate)
4) Inspect the accumulator values
> ac_0.value
>>> 123

How to apply function to each row of specified column of PySpark DataFrame

I have a PySpark DataFrame consists of three columns, whose structure is as below.
In[1]: df.take(1)
Out[1]:
[Row(angle_est=-0.006815859163590619, rwsep_est=0.00019571401752467945, cost_est=34.33651951754235)]
What I want to do is to retrieve each value of the first column (angle_est), and pass it as parameter xMisallignment to a defined function to set a particular property of a class object. The defined function is:
def setMisAllignment(self, xMisallignment):
if np.abs(xMisallignment) > 0.8:
warnings.warn('You might set misallignment angle too large.')
self.MisAllignment = xMisallignment
I am trying to select the first column and convert it into rdd, and apply the above function to a map() function, but it seems it does not work, the MisAllignment did not change anyway.
df.select(df.angle_est).rdd.map(lambda row: model0.setMisAllignment(row))
In[2]: model0.MisAllignment
Out[2]: 0.00111511718224
Anyone has ideas to help me let that function work? Thanks in advance!
You can register your function as spark UDF something similar to follows:
spark.udf.register("misallign", setMisAllignment)
You can get many examples of creating and registering UDF's in this test suite:
https://github.com/apache/spark/blob/master/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java
Hope it answers your question

PySpark isin function

I am converting my legacy Python code to Spark using PySpark.
I would like to get a PySpark equivalent of:
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
Both, actdataall and orddata are Spark dataframes.
I don't want to use toPandas() function given the drawback associated with it.
If both dataframes are big, you should consider using an inner join which will work as a filter:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
sc.broadcast(orderid_list)
The most direct translation of your code would be:
from pyspark.sql import functions as F
# collect all the unique ORDER_IDs to the driver
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
# filter ORDValue column by list of order_ids, then select only User ID column
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids)).select('User ID')
However, you should only filter like this only if number of 'ORDER_ID' is definitely small (perhaps <100,000 or so).
If the number of 'ORDER_ID's is large, you should use a broadcast variable which sends the list of order_ids to each executor so it can compare against the order_ids locally for faster processing. Note, this will work even if 'ORDER_ID' is small.
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
order_ids_broadcast = sc.broadcast(order_ids) # send to broadcast variable
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID')
For more information on broadcast variables, check out: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html
So, you have two spark dataframe. One is actdataall and other is orddata, then use following command to get your desire result.
usersofinterest = actdataall.where(actdataall['ORDValue'].isin(orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0])).select('User ID')

Resources