How to reference an aliased aggregate function applied to a column in a join condition - apache-spark

I'm applying an aggregate function (max) to a column, which I'm then referencing in a join.
The column becomes max(column_name) in the data frame. So, to make it easier to reference using Python's dot notation, I aliased the column, but I'm still getting an error:
tmp = hiveContext.sql("SELECT * FROM s3_data.nate_glossary WHERE profile_id_guid='ffaff64b-e87c-4a43-b593-b0e4bccc2731'"
)
max_processed = tmp.groupby('profile_id_guid','profile_version_id','record_type','scope','item_id','attribute_key') \
.agg(max("processed_date").alias("max_processed_date"))
df = max_processed.join(tmp, [max_processed.profile_id_guid == tmp.profile_id_guid,
max_processed.profile_version_id == tmp.profile_version_id,
max_processed.record_type == tmp.record_type,
max_processed.scope == tmp.scope,
max_processed.item_id == tmp.item_id,
max_processed.attribute_key == tmp.attribute_key,
max_processed.max_processed_date == tmp.processed_date])
The error:
File "", line 7, in File
"/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/dataframe.py", line
650, in join
jdf = self._jdf.join(other._jdf, on._jc, "inner") File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in call File
"/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py", line 51, in
deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'resolved attribute(s)
processed_date#10 missing from
record_type#41,scope#4,item_id#5,profile_id_guid#1,data_type#44,attribute_value#47,logical_id#45,profile_version_id#40,profile_version_id#2,attribute_key#8,max_processed_date#37,attribute_key#46,processed_date#48,scope#42,record_type#3,item_id#43,profile_id_guid#39,ems_system_id#38
in operator !Join Inner, Some((((((((profile_id_guid#1 =
profile_id_guid#1) && (profile_version_id#2 = profile_version_id#2))
&& (record_type#3 = record_type#3)) && (scope#4 = scope#4)) &&
(item_id#5 = item_id#5)) && (attribute_key#8 = attribute_key#8)) &&
(max_processed_date#37 = processed_date#10)));'
Note the error message: "processed_date#10 missing". I see processed_date#48 and processed_date#10 in the list of attributes.

See:
# DataFrame transformation
tmp -> max_processed -> df
The above three DataFrame share the same lineage, so if you want to use same column more than once, you need to use alias.
For example:
tmp = spark.createDataFrame([(1, 3, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
max_processed = tmp.groupBy(['key1', 'key2']).agg(f.max(tmp['val']).alias('max_val'))\
.withColumnRenamed('key1', 'max_key1').withColumnRenamed('key2', 'max_key2')\
df = max_processed.join(tmp, on=[max_processed['max_key1'] == tmp['key1'],
max_processed['max_key2'] == tmp['key2'],
max_processed['max_val'] == tmp['val']])
df.show()
+--------+--------+-------+----+----+---+
|max_key1|max_key2|max_val|key1|key2|val|
+--------+--------+-------+----+----+---+
| 1| 3| 1| 1| 3| 1|
| 2| 3| 1| 2| 3| 1|
+--------+--------+-------+----+----+---+
I still think this is a defect in spark lineage, to be honest

Related

Merging two or more dataframes/rdd efficiently in PySpark

I'm trying to merge three RDD's based on the same key. The following is the data.
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 15|
| 3| Candy| 15|
| 1| Bahroze| 15|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 7342|
| 3| Candy| 5669|
| 1| Bahroze| 8361|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 37|
| 3| Candy| 27|
| 1| Bahroze| 39|
+------+---------+-----+
I'm able to merge these three DF. I converted them to RDD dict with the following code for all three
new_rdd = userTotalVisits.rdd.map(lambda row: row.asDict(True))
After RDD conversion, I'm taking one RDD and the other two as lists. Mapping the first RDD and then adding other keys to it based on the same UserID. I was hoping there was a better way of doing this using pyspark. Here's the code I've written.
def transform(row):
# Add a new key to each row
for x in conversion_list: # first rdd in list of object as[{}] after using collect()
if( x['UserID'] == row['UserID'] ):
row["Total"] = { "Visitors": row["Total"], "Conversions": x["Total"] }
for y in Revenue_list: # second rdd in list of object as[{}] after using collect()
if( y['UserID'] == row['UserID'] ):
row["Total"]["Revenue"] = y["Total"]
return row
potato = new_rdd.map(lambda row: transform(row)) #first rdd
How should I efficiently merge these three RDDs/DFs? (because I had to perform three different task on a huge DF). Looking for a better efficient idea. PS I'm still spark newbie. The result of my code does is as follows which is what I need.
{'UserID': '2', 'UserLabel': 'Panda', 'Total': {'Visitors': 37, 'Conversions': 15, 'Revenue': 7342}}
{'UserID': '3', 'UserLabel': 'Candy', 'Total': {'Visitors': 27, 'Conversions': 15, 'Revenue': 5669}}
{'UserID': '1', 'UserLabel': 'Bahroze', 'Total': {'Visitors': 39, 'Conversions': 15, 'Revenue': 8361}}
Thank you.
You can join the 3 dataframes on columns ["UserID", "UserLabel"], create a new struct total from the 3 total columns:
from pyspark.sql import functions as F
result = df1.alias("conv") \
.join(df2.alias("rev"), ["UserID", "UserLabel"], "left") \
.join(df3.alias("visit"), ["UserID", "UserLabel"], "left") \
.select(
F.col("UserID"),
F.col("UserLabel"),
F.struct(
F.col("conv.Total").alias("Conversions"),
F.col("rev.Total").alias("Revenue"),
F.col("visit.Total").alias("Visitors")
).alias("Total")
)
# write into json file
result.write.json("output")
# print result:
for i in result.toJSON().collect():
print(i)
# {"UserID":3,"UserLabel":"Candy","Total":{"Conversions":15,"Revenue":5669,"Visitors":27}}
# {"UserID":1,"UserLabel":"Bahroze","Total":{"Conversions":15,"Revenue":8361,"Visitors":39}}
# {"UserID":2,"UserLabel":"Panda","Total":{"Conversions":15,"Revenue":7342,"Visitors":37}}
You can just do the left joins on all the three dataframes but make sure the first dataframe that you use has all the UserID and UserLabel Values. You can ignore the GroupBy operation as suggested by #blackbishop and still it would give you the required output
I am showing how it can be done in scala but you could do something similar in python.
//source data
val visitorDF = Seq((2,"Panda",15),(3,"Candy",15),(1,"Bahroze",15),(4,"Test",25)).toDF("UserID","UserLabel","Total")
val conversionsDF = Seq((2,"Panda",37),(3,"Candy",27),(1,"Bahroze",39)).toDF("UserID","UserLabel","Total")
val revenueDF = Seq((2,"Panda",7342),(3,"Candy",5669),(1,"Bahroze",8361)).toDF("UserID","UserLabel","Total")
import org.apache.spark.sql.functions._
val finalDF = visitorDF.as("v").join(conversionsDF.as("c"),Seq("UserID","UserLabel"),"left")
.join(revenueDF.as("r"),Seq("UserID","UserLabel"),"left")
.withColumn("TotalArray",struct($"v.Total".as("Visitor"),$"c.Total".as("Conversions"),$"r.Total".as("Revenue")))
.drop("Total")
display(finalDF)
You can see the output as below :

Split Text in Dataframe and Check if Contains Substring

So I want to check if my text contains the word 'baby' and not any other word that contains 'baby'. For example, "maybaby" would not be a match. I already have piece of code that works, but I wanted to see if there was a better way to format so that I don't have to go through the data twice. Here is what I have thus far:
import pyspark.sql.functions as F
rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds'], ['23-maybaby'], ['54-baby']])
rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')
rows_df = rows_df.withColumn('fruit', split)
+----------+-------------+
| ID| fruit|
+----------+-------------+
| 14-banana| [14, banana]|
| 12-cheese| [12, cheese]|
| 13-olives| [13, olives]|
|11-almonds|[11, almonds]|
|23-maybaby|[23, maybaby]|
| 54-baby| [54, baby]|
+----------+-------------+
from pyspark.sql.types import StringType
def func(col):
for item in col:
if item == "baby":
return "yes"
return "no"
func_udf = udf(func, StringType())
df_hierachy_concept = rows_df.withColumn('new',func_udf(rows_df['fruit']))
+----------+-------------+---+
| ID| fruit|new|
+----------+-------------+---+
| 14-banana| [14, banana]| no|
| 12-cheese| [12, cheese]| no|
| 13-olives| [13, olives]| no|
|11-almonds|[11, almonds]| no|
|23-maybaby|[23, maybaby]| no|
| 54-baby| [54, baby]|yes|
+----------+-------------+---+
Ultimately, I just want the "ID" and "new" column only.
I'll show two ways to resolve this. Probably there's a lot other ways to reach the same result.
See the examples below:
from pyspark.shell import sc
from pyspark.sql.functions import split, when
rows = sc.parallelize(
[
['14-banana'], ['12-cheese'], ['13-olives'],
['11-almonds'], ['23-maybaby'], ['54-baby']
]
)
# Resolves with auxiliary column named "fruit"
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn('fruit', split(rows_df.ID, '-')[1])
rows_df = rows_df.withColumn('new', when(rows_df.fruit == 'baby', 'yes').otherwise('no'))
rows_df = rows_df.drop('fruit')
rows_df.show()
# Resolves directly without creating an auxiliary column
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn(
'new',
when(split(rows_df.ID, '-')[1] == 'baby', 'yes').otherwise('no')
)
rows_df.show()
# Resolves without forcing `split()[1]` call, avoiding out of index exception
rows_df = rows.toDF(["ID"])
is_new_udf = udf(lambda col: 'yes' if any(value == 'baby' for value in col) else 'no')
rows_df = rows_df.withColumn('new', is_new_udf(split(rows_df.ID, '-')))
rows_df.show()
All outputs are the same:
+----------+---+
| ID|new|
+----------+---+
| 14-banana| no|
| 12-cheese| no|
| 13-olives| no|
|11-almonds| no|
|23-maybaby| no|
| 54-baby|yes|
+----------+---+
I'd use pyspark.sql.functions.regexp_extract for this. Make the column new equal to "yes" if you're able to extract the word "baby" with a word boundary on both sides, and "no" otherwise.
from pyspark.sql.functions import regexp_extract, when
rows_df.withColumn(
'new',
when(
regexp_extract("ID", "(?<=(\b|\-))baby(?=(\b|$))", 0) == "baby",
"yes"
).otherwise("no")
).show()
#+----------+-------------+---+
#| ID| fruit|new|
#+----------+-------------+---+
#| 14-banana| [14, banana]| no|
#| 12-cheese| [12, cheese]| no|
#| 13-olives| [13, olives]| no|
#|11-almonds|[11, almonds]| no|
#|23-maybaby|[23, maybaby]| no|
#| 54-baby| [54, baby]|yes|
#+----------+-------------+---+
The last argument to regexp_extract is the index of the match to extract. We pick the first index (index 0). If the pattern doesn't match, an empty string is returned. Finally use when() to check if the extracted string equals the desired value.
The regex pattern means:
(?<=(\b|\-)): Positive look-behind for either a word boundary (\b) or a literal hyphen (-).
baby: The literal word "baby"
(?=(\b|$)): Positive look-ahead for either a word boundary or the end of the line ($).
This method also doesn't require you to first split the string, because it's unclear if that part is needed for your purposes.

How to detect null column in pyspark

I have a dataframe defined with some null values. Some Columns are fully null values.
>> df.show()
+---+---+---+----+
| A| B| C| D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+
In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).
this is how I did it:
nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]
but this does no consider null columns as constant, it works only with values.
How should I then do it ?
Extend the condition to
from pyspark.sql.functions import min, max
((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c)
or use eqNullSafe (PySpark 2.3):
(min(c).eqNullSafe(max(c))).alias(c)
One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:
spark.version
# u'2.2.0'
from pyspark.sql.functions import col
nullColumns = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if nullRows == numRows: # i.e. if ALL values are NULL
nullColumns.append(k)
nullColumns
# ['D']
But there is a simpler way: it turns out that the function countDistinct, when applied to a column with all NULL values, returns zero (0):
from pyspark.sql.functions import countDistinct
df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]
So the for loop now can be:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
UPDATE (after comments): It seems possible to avoid collect in the second solution; since df.agg returns a dataframe with only one row, replacing collect with take(1) will safely do the job:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
How about this? In order to guarantee the column are all nulls, two properties must be satisfied:
(1) The min value is equal to the max value
(2) The min or max is null
Or, equivalently
(1) The min AND max are both equal to None
Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1] would be incorrectly reported since the min and max will be 1.
import pyspark.sql.functions as F
def get_null_column_names(df):
column_names = []
for col_name in df.columns:
min_ = df.select(F.min(col_name)).first()[0]
max_ = df.select(F.max(col_name)).first()[0]
if min_ is None and max_ is None:
column_names.append(col_name)
return column_names
Here's an example in practice:
>>> rows = [(None, 18, None, None),
(1, None, None, None),
(1, 9, 4.0, None),
(None, 0, 0., None)]
>>> schema = "a: int, b: int, c: float, d:int"
>>> df = spark.createDataFrame(data=rows, schema=schema)
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
|null| 18|null|null|
| 1|null|null|null|
| 1| 9| 4.0|null|
|null| 0| 0.0|null|
+----+----+----+----+
>>> get_null_column_names(df)
['d']

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column',
IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.
Note that df is a pyspark.sql.dataframe.DataFrame.
There are a few efficient ways to implement this. Let's start with required imports:
from pyspark.sql.functions import col, expr, when
You can use Hive IF function inside expr:
new_column_1 = expr(
"""IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)
or when + otherwise:
new_column_2 = when(
col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally you could use following trick:
from pyspark.sql.functions import coalesce, lit
new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With example data:
df = sc.parallelize([
("orange", "apple"), ("kiwi", None), (None, "banana"),
("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])
you can use this as follows:
(df
.withColumn("new_column_1", new_column_1)
.withColumn("new_column_2", new_column_2)
.withColumn("new_column_3", new_column_3))
and the result is:
+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple| 0| 0| 0|
| kiwi| null| 3| 3| 3|
| null|banana| 3| 3| 3|
| mango| mango| 1| 1| 1|
| null| null| 3| 3| 3|
+------+------+------------+------------+------------+
You'll want to use a udf as below
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def func(fruit1, fruit2):
if fruit1 == None or fruit2 == None:
return 3
if fruit1 == fruit2:
return 1
return 0
func_udf = udf(func, IntegerType())
df = df.withColumn('new_column',func_udf(df['fruit1'], df['fruit2']))
The withColumn function in pyspark enables you to make a new variable with conditions, add in the when and otherwise functions and you have a properly working if then else structure.
For all of this you would need to import the sparksql functions, as you will see that the following bit of code will not work without the col() function.
In the first bit, we declare a new column -'new column', and then give the condition enclosed in when function (i.e. fruit1==fruit2) then give 1 if the condition is true, if untrue the control goes to the otherwise which then takes care of the second condition (fruit1 or fruit2 is Null) with the isNull() function and if true 3 is returned and if false, the otherwise is checked again giving 0 as the answer.
from pyspark.sql import functions as F
df=df.withColumn('new_column',
F.when(F.col('fruit1')==F.col('fruit2'), 1)
.otherwise(F.when((F.col('fruit1').isNull()) | (F.col('fruit2').isNull()), 3))
.otherwise(0))

Column name when using concat

I am concatenating some columns in Spark SQL using the concat function. Here is some dummy code:
import org.apache.spark.sql.functions.{concat, lit}
val data1 = sc.parallelize(Array(2, 0, 3, 4, 5))
val data2 = sc.parallelize(Array(4, 0, 0, 6, 7))
val data3 = sc.parallelize(Array(1, 2, 3, 4, 10))
val dataJoin = data1.zip(data2).zip(data3).map((x) => (x._1._1, x._1._2, x._2 )).toDF("a1","a2","a3")
val dataConcat = dataJoin.select($"a1",concat(lit("["),$"a1", lit(","),$"a2", lit(","),$"a3", lit("]")))
Is there a way to specify or to change the label of the columns in order to avoid the default name which is not very practical?
+---+------------------------+
| a1|concat([,a1,,,a2,,,a3,])|
+---+------------------------+
| 2| [2,4,1]|
| 0| [0,0,2]|
| 3| [3,0,3]|
| 4| [4,6,4]|
| 5| [5,7,10]|
+---+------------------------+
Use as or alias methods to give a name to your column.

Resources