Related
I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+
Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+
Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.
I have this DataFrame:
+----+----+---+
|NAME|RANK| ID|
+----+----+---+
|null| 1|100|
| abc| 5|100|
| cyz| 2|100|
+----+----+---+
I am trying access the column name so that I can get first non-null element but I am getting the error:
TypeError: Column is not iterable
Here is what I tried:
grouped_df = df1.groupby('ID').agg(collect_list('NAME').alias("name")).select("*")
+---+----------------+
| ID| name|
+---+----------------+
|100|[null, abc, cyz]|
+---+----------------+
grouped_df.withColumn('temp',next(s for s in grouped_df["name"] if s))
I can access the item in list by using getItem method but I am trying to get it dynamically
grouped_df.select("*").withColumn('finalName',grouped_df["name"].getItem(1))
I want output like this
+---+----------------+
| ID| name|
+---+----------------+
|100| abc|
+---+----------------+
If anyone have anyone has any idea please let me know
You're trying to apply a Python for comprehension on a Column object (grouped_df["name"] returns Column not list).
Actually when you use collect_list function, Spark ignores the null values, so you don't need to fetch the first non-null value in the array, just select the first element:
grouped_df.withColumn('temp', col("name").getItem(0))
But a better way to do this is to groupBy and select the first value using first function:
grouped_df = df1.groupby('ID').agg(first(col('NAME'), ignorenulls=True).alias("name")).select("*")
I want to loop through spark dataframe, check if a condition i.e aggregated value of multiple rows is true/false then create a dataframe. Please see the code outline, can you please help fix the code? i'm pretty new to spark and python- struggling may way through it,any help is greatly appreciated
sort trades by Instrument and date (in asc order)
dfsorted = df.orderBy('Instrument','Date').show()
new temp variable to keep track of the quantity sum
sumofquantity = 0
for each row in the dfsorted
sumofquantity = sumofquantity + dfsorted['Quantity']
keep appending the rows looped thus far into this new dataframe called dftemp
dftemp= dfsorted (how to write this?)
if ( sumofquantity == 0)
once the sumofquantity becomes zero, for all the rows in the tempview-add a new column with unique seqential number
and append rows into the final dataframe
dffinal= dftemp.withColumn('trade#', assign a unique trade number)
reset the sumofquantity back to 0
sumofquantity = 0
clear the dftemp-how to clear the dataframe so i can start wtih zero rows for next iteration?
trade_sample.csv ( raw input file)
Customer ID,Instrument,Action,Date,Price,Quantity
U16,ADM6,BUY,20160516,0.7337,2
U16,ADM6,SELL,20160516,0.7337,-1
U16,ADM6,SELL,20160516,0.9439,-1
U16,CLM6,BUY,20160516,48.09,1
U16,CLM6,SELL,20160517,48.08,-1
U16,ZSM6,BUY,20160517,48.09,1
U16,ZSM6,SELL,20160518,48.08,-1
Expected Result ( notice last new column-that is all that I'm trying to add)
Customer ID,Instrument,Action,Date,Price,Quantity,trade#
U16,ADM6,BUY,20160516,0.7337,2,10001
U16,ADM6,SELL,20160516,0.7337,-1,10001
U16,ADM6,SELL,20160516,0.9439,-1,10001
U16,CLM6,BUY,20160516,48.09,1,10002
U16,CLM6,SELL,20160517,48.08,-1,10002
U16,ZSM6,BUY,20160517,48.09,1,10003
U16,ZSM6,SELL,20160518,48.08,-1,10003
Looping in such way is not good practice. You can not add/sum dataframe cumulatively and clear immutable dataframe. For your problem you can use spark windowing concept.
As much I understand your problem you want to calculate sum of Quantity for each customer ID. Once it complete sum for one Customer ID you reset sumofquantity to zero. If it is so, then you can partition Customer ID with order by Instrument , Date and calculate sum for each Customer ID. Once you get sum then you can check for trade# with your conditions.
just refer below code:
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import row_number,col,sum
>>> w = Window.partitionBy("Customer ID").orderBy("Instrument","Date")
>>> w1 = Window.partitionBy("Customer ID").orderBy("Instrument","Date","rn")
>>> dftemp = Df.withColumn("rn", (row_number().over(w))).withColumn("sumofquantity", sum("Quantity").over(w1)).select("Customer_ID","Instrument","Action","Date","Price","Quantity","sumofquantity")
>>> dftemp.show()
+-----------+----------+------+--------+------+--------+-------------+
|Customer_ID|Instrument|Action| Date| Price|Quantity|sumofquantity|
+-----------+----------+------+--------+------+--------+-------------+
| U16| ADM6| BUY|20160516|0.7337| 2| 2|
| U16| ADM6| SELL|20160516|0.7337| -1| 1|
| U16| ADM6| SELL|20160516|0.9439| -1| 0|
| U16| CLM6| BUY|20160516| 48.09| 1| 1|
| U16| CLM6| SELL|20160517| 48.08| -1| 0|
| U16| ZSM6| BUY|20160517| 48.09| 1| 1|
| U16| ZSM6| SELL|20160518| 48.08| -1| 0|
+-----------+----------+------+--------+------+--------+-------------+
You can refer Window function at below link:
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?
I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
I think the best way would be to simply use "filter".
df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
Another possible way could be using a where function of DF.
For example this:
val output = df.where("col1>col2")
will give you the expected result:
+----+----+
|col1|col2|
+----+----+
| 22|12.2|
| 77|33.3|
+----+----+
The best way to keep rows based on a condition is to use filter, as mentioned by others.
To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark.
For example to delete all rows with col1>col2 use:
rows_to_delete = df.filter(df.col1>df.col2)
df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
you can use sqlContext to simplify the challenge.
first register as temp table as example:
df.createOrReplaceTempView("tbl1")
then run the sql like
sqlContext.sql("select * from tbl1 where col1 > col2")