Transpose a dataframe in Pyspark - apache-spark

how can I do to transpose the following data frame in Pyspark?
The idea is to achieve the result that appears below.
import pandas as pd
d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}
id place value attribute
a 1 A 10 size
b 1 A 30 height
c 1 A 20 weigth
d 2 A 10 size
e 2 A 30 height
f 2 A 20 weigth
g 3 A 10 size
h 3 A 30 height
i 3 A 20 weigth
d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print(df)
id place size height weigth
a 1 A 10 10 10
b 2 A 30 30 30
c 3 A 20 20 20
Any help is welcome. From already thank you very much

First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:
from pyspark.sql.functions import first
l =[( 1 ,'A', 10, 'size' ),
( 1 , 'A', 30, 'height' ),
( 1 , 'A', 20, 'weigth' ),
( 2 , 'A', 10, 'size' ),
( 2 , 'A', 30, 'height' ),
( 2 , 'A', 20, 'weigth' ),
( 3 , 'A', 10, 'size' ),
( 3 , 'A', 30, 'height' ),
( 3 , 'A', 20, 'weigth' )]
df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])
df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()
+---+-----+------+----+------+
| id|place|height|size|weigth|
+---+-----+------+----+------+
| 2| A| 30| 10| 20|
| 3| A| 30| 10| 20|
| 1| A| 30| 10| 20|
+---+-----+------+----+------+

Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.
df = df.groupBy(["id","place"]).pivot("attribute").sum("value")
This link also addresses the same question.

Related

Multi-List to an order dictionary by different column

I have a OrderedDict below, which column1 and column2 present a relationship
This created for me the following OrderedList
OrderedDict([('AD',
[['A', 'Q_30', 100],
['A', 'Q_24', 74],
['B', 'Q_28', 37],
['B', 'Q_30', 100],
['C', 'Q_25', 38],
['C', 'Q_30', 100],
['D', 'D_4', 44],
['E', 'D_4', 44],
['F', 'D_5', 44]])
I would like to iterate over the elements, each time look at other row and collect column2.
eg.
element A contain Q_30 and Q24 and collect related member from other rows
element B contain Q_30, so collect Q24,Q28,Q30 and order by column3
OrderedDict([('AD',
[{'Q_30':100, 'Q_24':74, 'Q_25':38, 'Q_28': 37}, {'D_4':44}, {'D_5':44}])
When I understand this correctly, your "OrderedDict" is currently a tuple with a list inside, in which is another list and is meant to look like this:
OrderedList = ('AD',
[['A', 'Q_30', 100],
['A', 'Q_24', 74],
['B', 'Q_28', 37],
['B', 'Q_30', 100],
['C', 'Q_25', 38],
['C', 'Q_30', 100],
['D', 'D_4', 44],
['E', 'D_4', 44],
['F', 'D_5', 44]])
and you want to convert it into a tuple with a list inside which holds dicts:
OrderedDict = ('AD',
[{'Q_30': 100,
'Q_24': 74,
'Q_25': 38,
'Q_28': 37},
{'D_4': 44},
{'D_5': 44}])
In this case I am guessing you look for groupby():
from itertools import groupby
OrderedList = ('AD',
[['A', 'Q_30', 100],
['A', 'Q_24', 74],
['B', 'Q_28', 37],
['B', 'Q_30', 100],
['C', 'Q_25', 38],
['C', 'Q_30', 100],
['D', 'D_4', 44],
['E', 'D_4', 44],
['F', 'D_5', 44]])
for key, group in groupby(OrderedList[1], lambda x: x[0]):
for thing in group:
print("%s is a %s." % (thing[1], key))
Gives:
Q_30 is a A.
Q_24 is a A.
Q_28 is a B.
Q_30 is a B.
Q_25 is a C.
Q_30 is a C.
D_4 is a D.
D_4 is a E.
D_5 is a F.
This is not the full answer, but an example as I feel like it would be spoon-feeding otherwise

Make a list inside the dictionary in python

I have a data frame like below. I want to get a dictionary consisting of a list.My expected output is. Can you pls assist me to get it?
You can use the handy groupby function in Pandas:
df = pd.DataFrame({
'Department': ['y1', 'y1', 'y1', 'y2', 'y2', 'y2'],
'Section': ['A', 'B', 'C', 'A', 'B', 'C'],
'Cost': [10, 20, 30, 40, 50, 60]
})
output = {dept: group['Cost'].tolist() for dept, group in df.groupby('Department')}
gives
{'y1': [10, 20, 30], 'y2': [40, 50, 60]}

How do I get FOR to work on a list but leaping through it according to another list and how to get it back to the start to keep counting like a loop?

So this is the chromatic_scale = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
I want to form major, minor and other scales based on this.
To form a major scale, for example, I need to leap through the chromatic scale like this: major = [2, 2, 1, 2, 2, 2, 1] (in music, we say "tone, tone, semitone, tone, tone, tone, semitone" where tone means leaping 2 list items and semitone means leaping 1, thus the list)
A practical example: Starting from 'C', I should get ['C', 'D', 'E', 'F', 'G', 'A', 'B', 'C'] (yes, it should loop back to 'C' at the end).
1 - I thought of doing it with FOR but how do I get FOR to work on a list (chromatic) but leaping through it according to another list (major)?
2 - And if I start from 'A', for instance, how to get it back to the beginning to keep counting?
At the end, I managed to do it but I used WHILE instead of FOR.
chromatic = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B',
'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
def major(tom):
major_scale = [2, 2, 1, 2, 2, 2, 1]
step = 0
t = chromatic.index(tom)
m = []
while len(m) < 8:
m.append(chromatic[t])
if len(m) == 8:
break
t += major_scale[step]
step += 1
return m
x = major('D')
print(x)
Thank you everyone!

How to aggregate on percentiles in PySpark?

I want to be able to aggregate based on percentiles (or more accurate in my case, complement percentiles)
Consider the following code:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
['a', 1, 'w'],
['a', 1, 'y'],
['a', 11, 'x'],
['a', 111, 'zzz'],
['a', 1111, 'zz'],
['a', 1111, 'zz'],
['b', 2, 'w'],
['b', 2, 'w'],
['b', 2, 'w'],
['b', 22, 'y'],
['b', 2222, 'x'],
['b', 2222, 'z'],
],
['grp', 'val1', 'val2'])
grouped = df.groupby('grp').agg(
F.count('*').alias('count'),
F.expr('percentile(val1, array(0.5, 0.75)) as percentiles'),
# val2 manipulation....
)
grouped.show()
In addition to the grouping and the percentiles calculation, I would like to count the distinct values of val2 in the complement percentiles respectively.
For group b for example, the 50th percentile of val1 is 12 and the complement percentile is the last 3 rows which contain 3 distinct values of val2 (y,x,z).
Similarly, the 75th percentile of is 1672 and the complement percentile is the last 2 rows which contain 2 distinct values (x,z).
So my desired output would be:
+---+-----+--------------+--------------|
|grp|count| percentiles|distinct count|
+---+-----+--------------+--------------|
| a| 6| [61.0, 861.0]|[2, 1] |
| b| 6|[12.0, 1672.0]|[3, 2] |
+---+-----+--------------+--------------|
How can I acheive this?
For spark 2.3.2, you can use Window function to calculate percentiles, find val2s satisfying condition associated with percentiles and then do aggregation:
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('grp')
df1 = df.withColumn('percentiles', F.expr('percentile(val1, array(0.5, 0.75))').over(w1)) \
.withColumn('c1', F.expr('IF(val1>percentiles[0],val2,NULL)')) \
.withColumn('c2', F.expr('IF(val1>percentiles[1],val2,NULL)'))
grouped = df1.groupby('grp').agg(
F.count('*').alias('count'),
F.first('percentiles').alias('percentiles'),
F.array(F.countDistinct('c1'), F.countDistinct('c2')).alias('distinct_count')
)
grouped.show()
+---+-----+--------------+--------------+
|grp|count| percentiles|distinct_count|
+---+-----+--------------+--------------+
| b| 6|[12.0, 1672.0]| [3, 2]|
| a| 6| [61.0, 861.0]| [2, 1]|
+---+-----+--------------+--------------+

Python: Create DataFrame from highest frequency of occurrence per item

I have a dataframe as given below
data = {
'Code': ['P', 'J', 'M', 'Y', 'P', 'Z', 'P', 'P', 'J', 'P', 'J', 'M', 'P', 'Z', 'Y', 'M', 'Z', 'J', 'J'],
'Value': [10, 10, 20, 30, 10, 40, 50, 10, 10, 20, 10, 50, 60, 40, 30, 20, 40, 20, 10]
}
example = pd.DataFrame(data)
Using Python 3, I want to create another dataframe from the dataframe example such that the Code associated with the greater number of Value is obtained.
The new dataframe should look like solution below
output = {'Code': ['J', 'M', 'Y', 'Z', 'P', 'M'],'Value': [10, 20, 30, 40, 50, 50]}
solution = pd.DataFrame(output)
As can be seen, J has more association to Value 10 than other Code so J is selected, and so on.
You could define a function that returns the most occurring items and apply it to the grouped elements. Finally explode to list to rows.
>>> def most_occurring(grp):
... res = Counter(grp)
... highest = max(res.values())
... return [k for k, v in res.items() if v == highest]
...
>>> example.groupby('Value')['Code'].apply(lambda x: most_occurring(x)).explode().reset_index()
Value Code
0 10 J
1 20 M
2 30 Y
3 40 Z
4 50 P
5 50 M
6 60 P
If I understood correctly, you need something like this:
grouped = example.groupby(['Code', 'Value']).indices
arr_tmp = []
[arr_tmp.append([i[0], i[1], len(grouped[i])]) for i in grouped]#['Int64Index'])
output = pd.DataFrame(data=arr_tmp, columns=['Code', 'Value', 'index_count'])
output = output.sort_values(by=['index_count'], ascending=False)
output.reset_index(inplace=True)
output

Resources