show first occurence(s) of a column - apache-spark

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+

Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+

Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+

Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

Related

Pyspark: grouping contiguous rows by boolean column

I have a Spark dataframe in Python and it is in a specific order where the rows can be sectioned into the right groups, according to a column "start_of_section" which has values 1 or 0. For each collection of rows that need to be grouped together, every column other than "value" and "start_of_section" is equal. I want to group each such collection into one row that has the same values for every other column and a column "list_values" which has an array of all the values in each row.
So some rows might look like:
Row(category=fruit, object=apple, value=60, start_of_section=1)
Row(category=fruit, object=apple, value=160, start_of_section=0)
Row(category=fruit, object=apple, value=30, start_of_section=0)
and in the new dataframe this would be
Row(category=fruit, object=apple, list_values=[60, 160, 30])
(Edit: note that the column "start_of_section" should not have been included in the final dataframe.)
The issue I've had in trying to research the answer is that I've only found ways of grouping by column value without regard for ordering, so that this would wrongly produce two rows, one grouping all rows with "start_of_section"=1 and one grouping all rows with "start_of_section"=0..
What code can achieve this?
Assuming your order column is order_col
df.show()
+--------+------+---------+----------------+-----+
|category|object|order_col|start_of_section|value|
+--------+------+---------+----------------+-----+
| fruit| apple| 1| 1| 60|
| fruit| apple| 2| 0| 160|
| fruit| apple| 3| 0| 30|
| fruit| apple| 4| 1| 50|
+--------+------+---------+----------------+-----+
you need to generate an id to group the lines in the same section together, then group by this id and the dimension you want. Here is how you do it.
from pyspark.sql import functions as F, Window as W
df.withColumn(
"id",
F.sum("start_of_section").over(
W.partitionBy("category", "object").orderBy("order_col")
),
).groupBy("category", "object", "id").agg(F.collect_list("value").alias("values")).drop(
"id"
).show()
+--------+------+-------------+
|category|object| values|
+--------+------+-------------+
| fruit| apple|[60, 160, 30]|
| fruit| apple| [50]|
+--------+------+-------------+
EDIT: If you do not have any order_col, it is an impossible task to do. See your lines in a dataframe as marble in a bag. They do not have any order. You can order them as you pull them out of the bag according to some criteria, but otherwise, you cannot assume any order. show is just you pulling 10 marbles (lines) out of the bag. The order may be the same each time you do it, but suddently change, and you have no controle on it
Well, now I got it. You can do a group by with the column that summing the start_of_section.
In order to make sure about the result, you should include the ordering column.
from pyspark.sql.types import Row
from pyspark.sql.functions import *
from pyspark.sql import Window
data = [Row(category='fruit', object='apple', value=60, start_of_section=1),
Row(category='fruit', object='apple', value=160, start_of_section=0),
Row(category='fruit', object='apple', value=30, start_of_section=0),
Row(category='fruit', object='apple', value=50, start_of_section=1),
Row(category='fruit', object='apple', value=30, start_of_section=0),
Row(category='fruit', object='apple', value=60, start_of_section=1),
Row(category='fruit', object='apple', value=110, start_of_section=0)]
df = spark.createDataFrame(data)
w = Window.partitionBy('category', 'object').rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('group', sum('start_of_section').over(w)) \
.groupBy('category', 'object', 'group').agg(collect_list('value').alias('list_value')) \
.drop('group').show()
+--------+------+-------------+
|category|object| list_value|
+--------+------+-------------+
| fruit| apple|[60, 160, 30]|
| fruit| apple| [50, 30]|
| fruit| apple| [60, 110]|
+--------+------+-------------+
FAILS: monotonically_increasing_id fails when you have many partitions.
df.repartition(7) \
.withColumn('id', monotonically_increasing_id()) \
.withColumn('group', sum('start_of_section').over(w)) \
.groupBy('category', 'object', 'group').agg(collect_list('value').alias('list_value')) \
.drop('group').show()
+--------+------+--------------------+
|category|object| list_value|
+--------+------+--------------------+
| fruit| apple| [60]|
| fruit| apple|[60, 160, 30, 30,...|
| fruit| apple| [50]|
+--------+------+--------------------+
This is totally not wanted.

List is unordered even after using windowing with collect_set

So I'm trying to collect_set a group of dates from a dataframe. The problem I'm facing is that the dates are not present in the order the dataframe has.
Example dataframe (this is a much larger dataset. Basically this dataframe tracks the
beginning date of a week, for every single day in a year)
+--------+-----------+----------+
|year_num|week_beg_dt| cal_dt|
+--------+-----------+----------+
| 2013| 2012-12-31|2012-12-31|
| 2013| 2012-12-31|2013-01-03|
| 2013| 2013-01-07|2013-01-07|
| 2013| 2013-01-07|2013-01-12|
| 2013| 2013-01-14|2013-01-14|
| 2013| 2013-01-14|2013-01-15|
| 2014| 2014-01-01|2014-01-01|
| 2014| 2014-01-01|2014-01-05|
| 2014| 2014-01-07|2014-01-07|
| 2014| 2014-01-07|2014-01-12|
| 2014| 2014-01-15|2014-01-15|
| 2014| 2014-01-15|2014-01-16|
What Im trying to get to is this
+--------+-------------------------------------+
|year_num| dates. |
+--------+-------------------------------------+
| 2013|[2012-12-31, 2013-01-07, 2013-01-14] |
| 2014|[2014-01-01, 2014-01-07, 2014-01-14] |
I have tried windowing to do it, as collect_set together with groupBy will result in unordered set:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('year_num').orderBy('week_beg_dt')
business_days_ = df2.withColumn('dates', F.collect_set('week_beg_dt').over(w)) \
.groupBy('year_num') \
.agg(F.max('dates').alias('dates')) \
.collect()
But I still end up with unordered sets. Any suggestions what I'm doing wrong and how to fix it?
For Spark 2.4+, use array_sort in built function on collect_set to get ordered list.
Example:
df1.show()
#+--------+-----------+----------+
#|year_num|week_beg_dt| cal_dt|
#+--------+-----------+----------+
#| 2013| 2012-12-31|2012-12-31|
#| 2013| 2012-12-31|2012-12-31|
#| 2013| 2013-01-07|2013-01-03|
#+--------+-----------+----------+
#without array_sort
df1.groupBy("year_num").agg(collect_set(col("week_beg_dt"))).show(10,False)
#+--------+------------------------+
#|year_num|collect_set(week_beg_dt)|
#+--------+------------------------+
#|2013 |[2013-01-07, 2012-12-31]|
#+--------+------------------------+
#using array_sort
df1.groupBy("year_num").agg(array_sort(collect_set(col("week_beg_dt")))).show(10,False)
#+--------+------------------------------------+
#|year_num|array_sort(collect_set(week_beg_dt))|
#+--------+------------------------------------+
#|2013 |[2012-12-31, 2013-01-07] |
#+--------+------------------------------------+
For earlier versions of Spark:
from pyspark.sql.types import *
#udf to sort array
sort_arr_udf=udf(lambda x:sorted(x),ArrayType(StringType()))
df1.groupBy("year_num").agg(sort_arr_udf(collect_set(col("week_beg_dt")))).show(10,False)
#+--------+----------------------------------------+
#|year_num|<lambda>(collect_set(week_beg_dt, 0, 0))|
#+--------+----------------------------------------+
#|2013 |[2012-12-31, 2013-01-07] |
#+--------+----------------------------------------+

MySQL sum over a window that contains a null value returns null

I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?

Remove rows from dataframe based on condition in pyspark

I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
I think the best way would be to simply use "filter".
df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
Another possible way could be using a where function of DF.
For example this:
val output = df.where("col1>col2")
will give you the expected result:
+----+----+
|col1|col2|
+----+----+
| 22|12.2|
| 77|33.3|
+----+----+
The best way to keep rows based on a condition is to use filter, as mentioned by others.
To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark.
For example to delete all rows with col1>col2 use:
rows_to_delete = df.filter(df.col1>df.col2)
df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
you can use sqlContext to simplify the challenge.
first register as temp table as example:
df.createOrReplaceTempView("tbl1")
then run the sql like
sqlContext.sql("select * from tbl1 where col1 > col2")

How to put avg in descending order?

I want to get a descending order and round to integer the zhvi number of from a csv file using spark.
However, when I try sort(desc("Zhvi")) at the end of my code. It always gives me error.
from pyspark.sql.functions import col, desc
stateByZhvi = home.select('State','Zhvi').groupBy((col("State"))).avg("Zhvi").show()
and part of my result:
+-----+------------------+
|State| avg(Zhvi)|
+-----+------------------+
| AZ|246687.01298701297|
| SC|143188.94736842104|
| LA|159991.74311926606|
| MN|236449.40239043825|
| NJ| 367156.5637065637|
| DC| 586109.5238095238|
| OR| 306646.3768115942|
| VA| 282764.4986449864|
Any one can help with?
// input dataframe
+-----+------------------+
|State| avg|
+-----+------------------+
| AZ|246687.01298701297|
| SC|143188.94736842104|
| LA|159991.74311926606|
+-----+------------------+
df.orderBy(desc("avg")).show()
//
+-----+------------------+
|State| avg|
+-----+------------------+
| AZ|246687.01298701297|
| LA|159991.74311926606|
| SC|143188.94736842104|
+-----+------------------+
There might be another issue, it seems you are using "sort(desc("Zhvi"))",
however, the column name changed after the avg function, "|State| avg(Zhvi)|"
Thanks
What about using SQL:
home.createOrReplaceTempView("home")
spark.sql("select State, round(avg(Zhvi)) as avg_Zhvi from home group by State order by 2 desc").show()
I worked on the same problem you had, here is my solution. Use agg, avg, alias, and order by (with ascending parameter as false):
from pyspark.sql.functions import *
stateByZhvi = home.groupBy((col("State"))).agg.avg(col("Zhvi")).alias("avg_Zhvi").orderBy("avg_Zhvi", ascending=False).select('State','avg_Zhvi')show()

Resources