Can I select 2 field as index in Pivot Table in Excel? - excel

I am trying to create a pivot table in excel which taking 2 fields as column and be the key for grouping data.
Example:
Original Table:
| Fruit | Country | Sold |
| -------- | ---- | --|
| Apple | USA| 10|
| Apple | JAPAN| 20|
| Orange| JAPAN|5|
| Orange| USA|3|
| Orange| JAPAN|100|
| Orange| THAILAND|30|
| Banana| THAILAND|20|
| Banana| THAILAND|10|
Pivot Table I want:
| Fruit | Country | TotalSold |
| ------| ---- | --|
| Apple | USA | 10|
| Apple | JAPAN| 20|
| Orange| JAPAN|105|
| Orange| USA |3|
| Orange| THAILAND|30|
| Banana| THAILAND|30|
Basically, I want to use 2 column as key to group the Sold Amount. I have played a while in excel and still cannot find a way to group the data in this way.

Related

PySpark: creating aggregated columns out of a string type column different values

I have this dataframe:
+---------+--------+------+
| topic| emotion|counts|
+---------+--------+------+
| dog | sadness| 4 |
| cat |surprise| 1 |
| bird | fear| 3 |
| cat | joy| 2 |
| dog |surprise| 10 |
| dog |surprise| 3 |
+---------+--------+------+
And I want to create a column for every different emotion aggregating the counts for every topic and every emotion, ending up having an output like this:
+---------+--------+---------+-----+----------+
| topic| fear | sadness | joy | surprise |
+---------+--------+---------+-----+----------+
| dog | 0 | 4 | 0 | 13 |
| cat | 0 | 0 | 2 | 1 |
| bird | 3 | 0 | 0 | 0 |
+---------+--------+---------+-----+----------+
This is what I tried so far, for the fear column but the rest of the emotions keep showing up for every topic, how can I get a result like the above?
agg_emotion = df.groupby("topic", "emotion") \
.agg(F.sum(F.when(F.col("emotion").eqNullSafe("fear"), 1)\
.otherwise(0)).alias('fear'))
groupy sum then groupby pivot the outcome
df.groupby('topic','emotion').agg(sum('counts').alias('counts')).groupby('topic').pivot('emotion').agg(F.first('counts')).na.fill(0).show()
+-----+----+---+-------+--------+
|topic|fear|joy|sadness|surprise|
+-----+----+---+-------+--------+
| dog| 0| 0| 4| 13|
| cat| 0| 2| 0| 1|
| bird| 3| 0| 0| 0|
+-----+----+---+-------+--------+

pyspark pivot without aggregation

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object
As an example have this:
+---------++---------++---------++---------+
| country| code |Value | ids
+---------++---------++---------++---------+
| Mexico |food_1_3 |apple | 1
| Mexico |food_1_3 |banana | 2
| Canada |beverage_2 |milk | 1
| Mexico |beverage_2 |water | 2
+---------++---------++---------++---------+
Need this:
+---------++---------++---------++----------+
| country| id |food_1_3 | beverage_2|
+---------++---------++---------++----------+
| Mexico | 1 |apple | |
| Mexico | 2 |banana |water |
| Canada | 1 | |milk |
|+---------++---------++---------++---------+
I have tried
(df.groupby(df.country, df.id).pivot("code").agg(first('Value').alias('Value')))
but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...
That's because your 'id' is not unique. Add a unique index column and that should work:
import pyspark.sql.functions as F
pivoted = df.groupby(df.country, df.id, F.monotonically_increasing_id().alias('index')).pivot("code").agg(F.first('Value').alias('Value')).drop('index')
pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico| 1| null| apple|
| Mexico| 2| water| null|
| Canada| 1| milk| null|
| Mexico| 2| null| banana|
+-------+---+----------+--------+

pyspark detect change of categorical variable

I have a spark dataframe consisting of two columns.
+-----------------------+-----------+
| Metric|Recipe_name|
+-----------------------+-----------+
| 100. | A |
| 200. | A |
| 300. | A |
| 10. | A |
| 20. | A |
| 10. | B |
| 20. | B |
| 10. | A |
| 20. | A |
| .. | .. |
| .. | .. |
| 10. | B |
The dataframe is time ordered ( you can imagine there is a increasing timestamp column ). I need to add a column 'Cycles'. There are two scenarios when I say a new cycle begins :
If the same recipe is running lets say recipe 'A', and the value of Metric decreases (with respect to the last row) then a new cycle begins.
Lets say we switch from current recipe 'A' to second recipe 'B' and switch back to recipe 'A' we say a new cycle for recipe 'A' has begun.
So in the end i would like to have a column 'Cycle' which looks like this :
+-----------------------+-----------+-----------+
| Metric|Recipe_name| Cycle|
+-----------------------+-----------+-----------+
| 100. | A | 0 |
| 200. | A | 0 |
| 300. | A | 0 |
| 10. | A | 1 |
| 20. | A | 1 |
| 10. | B | 0 |
| 20. | B | 0 |
| 10. | A | 2 |
| 20. | A | 2 |
| .. | .. | 2 |
| .. | .. | 2 |
| 10. | B | 1 |
So it means recipe A has cycle 0 then metric decreases and cycle changes to 1.
Then a new recipe starts B so it has a new cycle 0.
Then again we get back to recipe A we say a new cycle begins for recipe A and with respect to last cycle number it has cycle 2 ( and similarly for recipe B).
In total there are 200 recipes.
Thanks for the help.
Replace my order column to your ordering column. Compare your condition by using lag function where the Recipe_name column is being partitioned.
w = Window.partitionBy('Recipe_name').orderBy('order')
df.withColumn('Cycle', when(col('Metric') < lag('Metric', 1, 0).over(w), 1).otherwise(0)) \
.withColumn('Cycle', sum('Cycle').over(w)) \
.orderBy('order') \
.show()
+------+-----------+-----+
|Metric|Recipe_name|Cycle|
+------+-----------+-----+
| 100| A| 0|
| 200| A| 0|
| 300| A| 0|
| 10| A| 1|
| 20| A| 1|
| 10| B| 0|
| 20| B| 0|
| 10| A| 2|
| 20| A| 2|
| 10| B| 1|
+------+-----------+-----+

Reverse Group By function in pyspark?

Sample Data:
+-----------+------------+---------+
|City |Continent | Price|
+-----------+------------+---------+
| A | Asia | 100|
| B | Asia | 110|
| C | Africa | 60|
| D | Europe | 170|
| E | Europe | 90|
| F | Africa | 100|
+-----------+------------+---------+
Output:
For the second column I know we can just use
df.groupby("Continent").agg({'Price':'avg'})
But how can we calculate the third column? The third column groups by the cities that does not
belong to each continent and then calculates average price.
expected output
------------+--------------+----------------------------------------------+
|Continent | Average Price|Average Price for cities not in this continent|
+-----------+--------------+----------------------------------------------+
| Asia | 105| 105 |
| Africa | 80| 117.5 |
| Europe | 130| 92.5 |
+-----------+--------------+----------------------------------------------+
>>> from pyspark.sql.functions import col,avg
>>> df.show()
+----+---------+-----+
|City|Continent|Price|
+----+---------+-----+
| A| Asia| 100|
| B| Asia| 110|
| C| Africa| 60|
| D| Europe| 170|
| E| Europe| 90|
| F| Africa| 100|
+----+---------+-----+
>>> df1 = df.alias("a").join(df.alias("b"), col("a.Continent") != col("b.Continent"),"left").select(col("a.*"), col("b.price").alias("b_price"))
>>> df1.groupBy("Continent").agg(avg(col("Price")).alias("Average Price"), avg(col("b_price")).alias("Average Price for cities not in this continent")).show()
+---------+-------------+----------------------------------------------+
|Continent|Average Price|Average Price for cities not in this continent|
+---------+-------------+----------------------------------------------+
| Europe| 130.0| 92.5|
| Africa| 80.0| 117.5|
| Asia| 105.0| 105.0|
+---------+-------------+----------------------------------------------+

Generating columns based on query results

I have four pivots I'm pulling data from that look like this:
tracked_work_by_users_by_operation_pivot:
+-------------------+---------------+---------------+-----------------------+---------------+
| DATE(start_time) | userid | operation_id | Time estimated | Time Elapsed |
+-------------------+---------------+---------------+-----------------------+---------------+
| 1/2/2011-1/8/2011 | jsmith | 11| 40| 40|
| 1/2/2011-1/8/2011 | jsmith | 10| 20| 24|
+-------------------+---------------+---------------+-----------------------+---------------+
faults_by_user_pivot:
+-------------------+---------------+---------------+-----------------------+----------+
| date(date_entered)| userid | operation_id | Major | Minor |
+-------------------+---------------+---------------+-----------------------+----------+
| 1/2/2011-1/8/2011 | jsmith | 11| 2 | 1|
+-------------------+---------------+---------------+-----------------------+----------+
paid_hours_by_user_pivot:
+-------------------+---------------+---------+
|date_range | userid | Total |
+-------------------+---------------+---------+
| 1/2/2011-1/8/2011 | jsmith | 40 |
+-------------------+---------------+---------+
tracked_work_by_users_pivot:
+-------------------+---------------+---------+
|DATE(start_time) | userid | Total |
+-------------------+---------------+---------+
| 1/2/2011-1/8/2011 | jsmith | 24 |
| | | |
+-------------------+---------------+---------+
What I need to do is compile a report for each user for each operation. From what I see the best way to do that is to have a format similar to:
+--------------+--------------+ +--------------+--------------+
| jsmith | packaging | | jsmith | machining |
+--------------------+--------------+--------------+----------------+--------------+--------------+--------------+--------------+----------------+--------------+--------------+
| DATE | time_elapsed | hours_worked | estimated_work | minor_faults | major_faults | time_elapsed | hours_worked | estimated_work | minor_faults | major_faults |
+--------------------+--------------+--------------+----------------+--------------+--------------+--------------+--------------+----------------+--------------+--------------+
| 1/2/2011-1/8/2011 | 24 | 40 | 36 | 1 | 2 | 24 | 40 | 36 | 1 | 2 |
+--------------------+--------------+--------------+----------------+--------------+--------------+--------------+--------------+----------------+--------------+--------------+
So that jsmith will have separate entries for machining and for packaging because we want to be able to rank him against all machining operators and all packaging operators. How can I best do this so that I will not have to add another 12 entries(since there are twelve operations) every time I add a new user?
The best way to do this seems to be to simply write an app instead of strong-arming excel into it.

Resources