How to put avg in descending order? - apache-spark

I want to get a descending order and round to integer the zhvi number of from a csv file using spark.
However, when I try sort(desc("Zhvi")) at the end of my code. It always gives me error.
from pyspark.sql.functions import col, desc
stateByZhvi = home.select('State','Zhvi').groupBy((col("State"))).avg("Zhvi").show()
and part of my result:
+-----+------------------+
|State| avg(Zhvi)|
+-----+------------------+
| AZ|246687.01298701297|
| SC|143188.94736842104|
| LA|159991.74311926606|
| MN|236449.40239043825|
| NJ| 367156.5637065637|
| DC| 586109.5238095238|
| OR| 306646.3768115942|
| VA| 282764.4986449864|
Any one can help with?

// input dataframe
+-----+------------------+
|State| avg|
+-----+------------------+
| AZ|246687.01298701297|
| SC|143188.94736842104|
| LA|159991.74311926606|
+-----+------------------+
df.orderBy(desc("avg")).show()
//
+-----+------------------+
|State| avg|
+-----+------------------+
| AZ|246687.01298701297|
| LA|159991.74311926606|
| SC|143188.94736842104|
+-----+------------------+
There might be another issue, it seems you are using "sort(desc("Zhvi"))",
however, the column name changed after the avg function, "|State| avg(Zhvi)|"
Thanks

What about using SQL:
home.createOrReplaceTempView("home")
spark.sql("select State, round(avg(Zhvi)) as avg_Zhvi from home group by State order by 2 desc").show()

I worked on the same problem you had, here is my solution. Use agg, avg, alias, and order by (with ascending parameter as false):
from pyspark.sql.functions import *
stateByZhvi = home.groupBy((col("State"))).agg.avg(col("Zhvi")).alias("avg_Zhvi").orderBy("avg_Zhvi", ascending=False).select('State','avg_Zhvi')show()

Related

Check if timestamp is inside range

I'm trying to obtain the following:
+--------------------+
|work_time | day_shift|
+--------------------+
| 00:45:40 | No |
| 10:05:47 | Yes |
| 15:25:28 | Yes |
| 19:38:52 | No |
where I classify the "work_time" into "day_shift".
"Yes" - if the time falls between 09:00:00 and 18:00:00
"No" - otherwise
My "work_time" is in datetime format showing only the time. I tried the following, but I'm just getting "No" for everything.
df = df.withColumn('day_shift', when(df.work_time >= to_timestamp(lit('09:00:00'), 'HH:mm:ss') & df.work_time <= to_timestamp(lit('18:00:00'), 'Yes').otherwise('No'))
You can use Column class method between. It works for both, timestamps and strings in format "HH:mm:ss". Use this:
F.col("work_time").between("09:00:00", "18:00:00")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([('00:45:40',), ('10:05:47',), ('15:25:28',), ('19:38:52',)], ['work_time'])
day_shift = F.col("work_time").between("09:00:00", "18:00:00")
df = df.withColumn("day_shift", F.when(day_shift, "Yes").otherwise("No"))
df.show()
# +---------+---------+
# |work_time|day_shift|
# +---------+---------+
# | 00:45:40| No|
# | 10:05:47| Yes|
# | 15:25:28| Yes|
# | 19:38:52| No|
# +---------+---------+
First of all, spark doesn't have so-called "Time" data type, it only supports either TimestampType or DateType. Therefore, I believe the work_time in your dataframe is a string.
Secondly, when you check your func.to_timestamp(func.lit('09:00:00'), 'HH:mm:ss') in selection statement, it will show:
+--------------------------------+
|to_timestamp(09:00:00, HH:mm:ss)|
+--------------------------------+
|1970-01-01 09:00:00 |
+--------------------------------+
only showing top 1 row
The best way to achieve is either split your work_time column to hour, minute and second column respectively and do the filtering, or add a date value in your work_time column before any timestamp filtering.

show first occurence(s) of a column

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+
Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+
Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

How to identify if a particular string/pattern exist in a column using pySpark

Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+

List is unordered even after using windowing with collect_set

So I'm trying to collect_set a group of dates from a dataframe. The problem I'm facing is that the dates are not present in the order the dataframe has.
Example dataframe (this is a much larger dataset. Basically this dataframe tracks the
beginning date of a week, for every single day in a year)
+--------+-----------+----------+
|year_num|week_beg_dt| cal_dt|
+--------+-----------+----------+
| 2013| 2012-12-31|2012-12-31|
| 2013| 2012-12-31|2013-01-03|
| 2013| 2013-01-07|2013-01-07|
| 2013| 2013-01-07|2013-01-12|
| 2013| 2013-01-14|2013-01-14|
| 2013| 2013-01-14|2013-01-15|
| 2014| 2014-01-01|2014-01-01|
| 2014| 2014-01-01|2014-01-05|
| 2014| 2014-01-07|2014-01-07|
| 2014| 2014-01-07|2014-01-12|
| 2014| 2014-01-15|2014-01-15|
| 2014| 2014-01-15|2014-01-16|
What Im trying to get to is this
+--------+-------------------------------------+
|year_num| dates. |
+--------+-------------------------------------+
| 2013|[2012-12-31, 2013-01-07, 2013-01-14] |
| 2014|[2014-01-01, 2014-01-07, 2014-01-14] |
I have tried windowing to do it, as collect_set together with groupBy will result in unordered set:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('year_num').orderBy('week_beg_dt')
business_days_ = df2.withColumn('dates', F.collect_set('week_beg_dt').over(w)) \
.groupBy('year_num') \
.agg(F.max('dates').alias('dates')) \
.collect()
But I still end up with unordered sets. Any suggestions what I'm doing wrong and how to fix it?
For Spark 2.4+, use array_sort in built function on collect_set to get ordered list.
Example:
df1.show()
#+--------+-----------+----------+
#|year_num|week_beg_dt| cal_dt|
#+--------+-----------+----------+
#| 2013| 2012-12-31|2012-12-31|
#| 2013| 2012-12-31|2012-12-31|
#| 2013| 2013-01-07|2013-01-03|
#+--------+-----------+----------+
#without array_sort
df1.groupBy("year_num").agg(collect_set(col("week_beg_dt"))).show(10,False)
#+--------+------------------------+
#|year_num|collect_set(week_beg_dt)|
#+--------+------------------------+
#|2013 |[2013-01-07, 2012-12-31]|
#+--------+------------------------+
#using array_sort
df1.groupBy("year_num").agg(array_sort(collect_set(col("week_beg_dt")))).show(10,False)
#+--------+------------------------------------+
#|year_num|array_sort(collect_set(week_beg_dt))|
#+--------+------------------------------------+
#|2013 |[2012-12-31, 2013-01-07] |
#+--------+------------------------------------+
For earlier versions of Spark:
from pyspark.sql.types import *
#udf to sort array
sort_arr_udf=udf(lambda x:sorted(x),ArrayType(StringType()))
df1.groupBy("year_num").agg(sort_arr_udf(collect_set(col("week_beg_dt")))).show(10,False)
#+--------+----------------------------------------+
#|year_num|<lambda>(collect_set(week_beg_dt, 0, 0))|
#+--------+----------------------------------------+
#|2013 |[2012-12-31, 2013-01-07] |
#+--------+----------------------------------------+

MySQL sum over a window that contains a null value returns null

I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?

Resources