Pyspark - Looking to create a normalized version of a Double column - apache-spark

As the title states, I'd like to create a normalized version of an existing Double column.
As I'm quite new to pyspark, this was my attempt at solving this:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df2 = df2.withColumn('count_trans_norm', F.col('count_trans) / (F.max(F.col('count_trans'))))
When I do this, I get the following error:
"grouping expressions sequence is empty, and '`movie_id`' is not an aggregate function.
Any help would be much appreciated.

You need to specify an empty window if you want to get the maximum of count_trans in df2:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df3 = df2.selectExpr('*', 'count_trans / max(count_trans) over () as count_trans_norm')
Or if you prefer pyspark syntax:
from pyspark.sql import functions as F, Window
df3 = df2.withColumn('count_trans_norm', F.col('count_trans') / F.max(F.col('count_trans')).over(Window.orderBy()))

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

PySpark equivalent of a simple SQL join

This is probably far to simple of a question.
But I'm not getting very far on my own.
I'm trying to use PySpark in Databricks to do the SQL equivalent of a lookup:
select
a.*
, b.MASTER_ID as PLAYER_ID
from vGame a
join PLAYER_XREF b
on a.PLAYER_NAME = b.PLAYER
Note that the two attributes on both sides of the on are NOT named the same.
Can you show me the pyspark version of the same?
Seems to me the numerous tangential posts here for this are over the top complex compared to than this.
I found this and this is really close but the returned dataframe is all columns of ta & tb.
inner_join = ta.join(tb, ta.name == tb.name)
I can list out all the ta columns individually & alias the one tb column with:
from pyspark.sql.functions import *
inner_join = ta.join(tb, ta.PLAYER_NAME == tb.PLAYER).select('<taCol1>', '<taCol2>', ... col('MASTER_ID').alias('PLAYER_ID'))
display(inner_join)
logic:
1.) We first rename player_name in ta dataframe to player so that we can join
2.) Once the columnNames are same we can use a join using square brackets []
3.) also we dynamically select columns from data frame ta
code:
ta = ta.withColumn("player_name","player")
inner_join = ta.join(tb,["player"]).select(col(x) for x in ta.columns])

How to use monotonically_increasing_id to join two pyspark dataframes having no common column?

I have two pyspark dataframes with same number of rows but they don't have any common column. So I am adding new column to both of them using monotonically_increasing_id() as
from pyspark.sql.functions import monotonically_increasing_id as mi
id=mi()
df1 = df1.withColumn("match_id", id)
cont_data = cont_data.withColumn("match_id", id)
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
But after join the resulting data frame has less number of rows.
What am I missing here. Thanks
You just don't. This not an applicable use case for monotonically_increasing_id, which is by definition non-deterministic. Instead:
convert to RDD
zipWithIndex
convert back to DataFrame.
join
You can generate the id's with monotonically_increasing_id, save the file to disc, and then read it back in THEN do whatever joining process. Would only suggest this approach if you just need to generate the id's once. At that point they can be used for joining, but for the reasons mentioned above, this is hacky and not a good solution for anything that runs regularly.
If you want to get an incremental number on both dataframes and then join, you can generate a consecutive number with monotonically and windowing with the following code:
df1 = df1.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
df1 = df1.withColumn("match_id", row_number().over(window))
df1 = df1.drop("monotonically_increasing_id")
cont_data = cont_data.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
cont_data = cont_data.withColumn("match_id", row_number().over(window))
cont_data = cont_data.drop("monotonically_increasing_id")
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
Warning It may move the data to a single partition! So maybe is better to separate the match_id to a different dataframe with the monotonically_increasing_id, generate the consecutive incremental number and then join with the data.

How to remove rows in DataFrame on column based on another DataFrame?

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))

Resources