select top max rows which has sum = 50 in dataframe - apache-spark

This is my dataframe
+--------------+-----------+------------------+
| _c3|sum(number)| perc|
+--------------+-----------+------------------+
| France| 5170305|1.3201573334529797|
| Germany| 9912088|2.5308982087190754|
| Vietnam| 14729566| 3.760966630301244|
|United Kingdom| 19435674| 4.962598446648971|
| Philippines| 21994132| 5.615861086093151|
| Japan| 35204549| 8.988936539189615|
| China| 39453426|10.073821498682275|
| Hong Kong| 39666589| 10.1282493704753|
| Thailand| 57202857|14.605863902228613|
| Malaysia| 72364309| 18.47710593603423|
| Indonesia| 76509597|19.535541048174547|
+--------------+-----------+------------------+
I want to select only top countries which sum up to 50 percent of passengers (country, number of passengers, percentage of passengers). How can I do it?

You can use a running total to store cumulative percentage and then filter by it. So, assuming your dataframe is small enough, something like this should do it:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn("cumulativepercentage", sum("perc").over(
Window.orderBy(col("perc").desc))
).where(col("cumulativepercentage").leq(50))
result.show(false)

Related

spark sql- select record having least difference in 2 date columns

This is the logic in SQL:
coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt)).
I want similar logic in Pyspark, when effc date is lesser than tran date it will select effc date closest to tran date and if lesser date is not present it will check for greater and select effc date closest to tran date.
Input dataframe:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |
Expected Output:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
You can order by the date difference and limit the results to 1 row:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(
Window.partitionBy('ID')
.orderBy(F.datediff('start_date', 'end_date'))
)
).filter('rn = 1').drop('rn')
df2.show()
+---+----------+----------+
| id|start_date| end_date|
+---+----------+----------+
| 34|2021-02-01|2019-02-01|
| 12|2020-02-01|2019-02-01|
+---+----------+----------+

Multiple metrics over large dataset in spark

I have a big dataset groupped by certain field that I need to run descriptive statistics on each field.
Let's say dataset is 200m+ records and there's about 15 stat functions that I need to run - sum/avg/min/max/stddev etc. Problem that it's very hard to scale that task since there's not clear way to partition dataset.
Example dataset:
+------------+----------+-------+-----------+------------+
| Department | PartName | Price | UnitsSold | PartNumber |
+------------+----------+-------+-----------+------------+
| Texas | Gadget1 | 5 | 100 | 5943 |
| Florida | Gadget3 | 484 | 2400 | 4233 |
| Alaska | Gadget34 | 44 | 200 | 4235 |
+------------+----------+-------+-----------+------------+
Right now I am doing it this way (example):
columns_to_profile = ['Price', 'UnitSold', 'PartNumber']
functions = [
Function(F.mean, 'mean'),
Function(F.min, 'min_value'),
Function(F.max, 'max_value'),
Function(F.variance, 'variance'),
Function(F.kurtosis, 'kurtosis'),
Function(F.stddev, 'std'),
Function(F.skewness, 'skewness'),
Function(count_zeros, 'n_zeros'),
Function(F.sum, 'sum'),
Function(num_hist, "hist_data"),
]
functions_to_apply = [f.function(c).alias(f'{c}${f.alias}')
for c in columns_to_profile for f in get_functions(column_types, c)]
df.groupby('Department').agg(*functions_to_apply).toPandas()
Problem here is that list of functions is bigger than this (there's about 16-20) which applies to each column but cluster spend most of the time in shuffling and CPU load is about 5-10%.
How should I partition this data or maybe my approach is incorrect?
If departments are skewed (i.e. Texas have 90% of volume) what should be my approach?
this is my spark dag for this job:

PySpark: Filtering duplicates of a union, keeping only the groupby rows with the maximum value for a specified column

I want to create a DataFrame that contains all the rows from two DataFrames, and where there are duplicates we keep only the row with the max value of a column.
For example, if we have two tables with the same schema, like below, we will merge into one table which includes only the rows with the maximum column value (highest score) for the group of rows grouped by another column ("name" in the below example).
Table A
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Finch | Acme | 62 |
| Jones | Acme | 30 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Table B
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Jones | Beta | 75 |
| Lewis | Beta | 59 |
| Smith | Beta | 64 |
| Starr | Beta | 81 |
+--------+---------+-------+
Final Table
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Finch | Acme | 62 |
| Jones | Beta | 75 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Here's what seems to work:
from pyspark.sql import functions as F
schema = ["name", "source", "score"]
rows1 = [("Smith", "Acme", 98),
("Jones", "Acme", 30),
("Finch", "Acme", 62),
("Lewis", "Acme", 59),
("Starr", "Acme", 87)]
rows2 = [("Smith", "Beta", 64),
("Jones", "Beta", 75),
("Bryan", "Beta", 93),
("Lewis", "Beta", 59),
("Starr", "Beta", 81)]
df1 = spark.createDataFrame(rows1, schema)
df2 = spark.createDataFrame(rows2, schema)
df_union = df1.unionAll(df2)
df_agg = df_union.groupBy("name").agg(F.max("score").alias("score"))
df_final = df_union.join(df_agg, on="score", how="leftsemi").orderBy("name", F.col("score").desc()).dropDuplicates(["name"])
The above results in the DataFrame I expect. It seems like a convoluted way to do this, but I don't know as I'm relatively new to Spark. Can this be done in a more efficient, elegant, or "Pythonic" manner?
You can use window functions. Partition by name and choose the record with the highest score.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(desc("score"))
df_union.withColumn("rank", row_number().over(w))\
.filter(col("rank")==1).drop("rank").show()
+-----+------+-----+
| name|source|score|
+-----+------+-----+
|Bryan| Beta| 93|
|Finch| Acme| 62|
|Jones| Beta| 75|
|Lewis| Acme| 59|
|Smith| Acme| 98|
|Starr| Acme| 87|
+-----+------+-----+
I don't see anything wrong with your answer, except for the last line - you cannot join on score only, but need to join on combination of "name" and "score", and you can choose inner join, which will eliminate the need to remove rows with lower scores for the same name:
df_final = (df_union.join(df_agg, on=["name", "score"], how="inner")
.orderBy("name")
.dropDuplicates(["name"]))
Notice that there is no need to order by score, and .dropDuplicates(["name"]) is only needed if you want to avoid displaying two rows for name = Lewis who has the same score in both dataframes.

Sum sublevel calculations in DAX - Power Pivot

I have two levels of calculations: Region (upper level) and City (lower level).
| DATE_1 | REGION_1 | city_1 | 1560 |
| DATE_2 | REGION_1 | city_2 | 1234 |
| DATE_2 | REGION_1 | city_3 | 245 |
| DATE_3 | REGION_1 | city_2 | 2345 |
| DATE_2 | REGION_1 | city_1 | 654 |
On lower city level i use formula to calculate statistical sample size based on total orders made in a city.
On upper Regional level i DO NOT need to calculate the formula (sample size) i just need to sum the calculations from lower level.
Expected output:
| Region | 1119 |
| city_1 | 384 |
| city_2 | 370 |
| city_3 | 365 |
So in Pivot table i SUM sales for each City and then based on total sales in that city i calculate Statistically significant sample size.
But on Regional level in pivot table i need to SUM those Sample sizes.
I tried using SUMX but results dont work fine (i guess i m not fully getting the sense on SUMX).
Also i feel like using GROUPBY (or some other way to precompile table) and then SUMMING the results from lower level.

Multiple column categories in MS Excel pivot table

Is it possible to design a pivot table in such a way that I have multiple column categories? Please see examples below:
|Group | Category 1 | Category 2 |
| | good | bad | total | good | bad | total |
|---------------------------------------------------|
|Group 1| 40% | 60% | 100% | 60% | 40% | 100% |
|Group 2| 30% | 70% | 100% | 20% | 80% | 100% |
...
I can get the Category 1 part or the Category 2 part, but not both. If you put both as my column input, I get the combined version (i.e. good/good, good/bad, bad/good, and bad/bad).
Thanks
Yes, it is possible - see the attached image:
pivot table

Resources