pyspark pivot a string data with multiple columns - apache-spark

I was having Data in Data frame as Below
Transaction_ID type value status code
123 MC_ID 1 Y MI
123 LOC_ID 2 Y MI
123 TCP_ID 65 Y MI
456 TCP_ID 65 NY TOI
456 LOC_ID 65 NY TOI
456 MC_ID 65 NY TOI
Expected Data :
Transaction_ID MC_ID LOC_ID TCP_ID status code
123 1 2 65 Y MI
456 65 65 65 NY TOI
can anyone help me with pyspark code or pyspark sql on this

Related

Calculate weighted average results for multiple columns based on another dataframe in Pandas

Let's say we have a students' score data df1 and credit data df2 as follows:
df1:
stu_id major Python English C++
0 U202010521 computer 56 81 82
1 U202010522 management 92 56 64
2 U202010523 management 95 88 81
3 U202010524 BigData&AI 79 53 74
4 U202010525 computer 53 71 -1
5 U202010526 computer 78 96 53
6 U202010527 BigData&AI 69 63 74
7 U202010528 BigData&AI 86 57 82
8 U202010529 BigData&AI 81 100 85
9 U202010530 BigData&AI 79 67 80
df2:
class credit
0 Python 2
1 English 4
2 C++ 3
I need to calculate weighted average for each students' scores.
df2['credit_ratio'] = df2['credit']/9
Out:
class credit credit_ratio
0 Python 2 0.222222
1 English 4 0.444444
2 C++ 3 0.333333
ie., for U202010521, his/her weighted score will be 56*0.22 + 81*0.44 + 82*0.33 = 75.02, I need to calculate each student's weighted_score as a new column, how could I do that in Pandas?
Try with set_index + mul then sum on axis=1:
df1['weighted_score'] = (
df1[df2['class']].mul(df2.set_index('class')['credit_ratio']).sum(axis=1)
)
df1:
stu_id major Python English C++ weighted_score
0 U202010521 computer 56 81 82 75.777778
1 U202010522 management 92 56 64 66.666667
2 U202010523 management 95 88 81 87.222222
3 U202010524 BigData&AI 79 53 74 65.777778
4 U202010525 computer 53 71 -1 43.000000
5 U202010526 computer 78 96 53 77.666667
6 U202010527 BigData&AI 69 63 74 68.000000
7 U202010528 BigData&AI 86 57 82 71.777778
8 U202010529 BigData&AI 81 100 85 90.777778
9 U202010530 BigData&AI 79 67 80 74.000000
Explaination:
By setting the index of df2 to class, multiplication will now align correctly with the columns of df1:
df2.set_index('class')['credit_ratio']
class
Python 0.222222
English 0.444444
C++ 0.333333
Name: credit_ratio, dtype: float64
Select the specific columns from df1 using the values from df2:
df1[df2['class']]
Python English C++
0 56 81 82
1 92 56 64
2 95 88 81
3 79 53 74
4 53 71 -1
5 78 96 53
6 69 63 74
7 86 57 82
8 81 100 85
9 79 67 80
Multiply to apply the weights:
df1[df2['class']].mul(df2.set_index('class')['credit_ratio'])
Python English C++
0 12.444444 36.000000 27.333333
1 20.444444 24.888889 21.333333
2 21.111111 39.111111 27.000000
3 17.555556 23.555556 24.666667
4 11.777778 31.555556 -0.333333
5 17.333333 42.666667 17.666667
6 15.333333 28.000000 24.666667
7 19.111111 25.333333 27.333333
8 18.000000 44.444444 28.333333
9 17.555556 29.777778 26.666667
Then sum across rows to get the total value.
df1[df2['class']].mul(df2.set_index('class')['credit_ratio']).sum(axis=1)
0 75.777778
1 66.666667
2 87.222222
3 65.777778
4 43.000000
5 77.666667
6 68.000000
7 71.777778
8 90.777778
9 74.000000
dtype: float64
I can do it in several steps, complete workflow is below:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""stu_id major Python English C++
U202010521 computer 56 81 82
U202010522 management 92 56 64
U202010523 management 95 88 81
U202010524 BigData&AI 79 53 74
U202010525 computer 53 71 -1
U202010526 computer 78 96 53
U202010527 BigData&AI 69 63 74
U202010528 BigData&AI 86 57 82
U202010529 BigData&AI 81 100 85
U202010530 BigData&AI 79 67 80"""), sep="\s+")
df2 = pd.read_csv(StringIO(
"""class credit
Python 2
English 4
C++ 3"""), sep="\s+")
df2['credit_ratio'] = df2['credit']/9
df3 = df.melt(id_vars=["stu_id", "major"])
df3["credit_ratio"] = df3["variable"].map(df2[["class", "credit_ratio"]].set_index("class").to_dict()["credit_ratio"])
df3["G"] = df3["value"] * df3["credit_ratio"]
>>> df3.groupby("stu_id")["G"].sum()
stu_id
U202010521 75.777778
U202010522 66.666667
U202010523 87.222222
U202010524 65.777778
U202010525 43.000000
U202010526 77.666667
U202010527 68.000000
U202010528 71.777778
U202010529 90.777778
U202010530 74.000000

Adding empty row base on two columns in Pandas DataFrame

I have a dataframe of following structure
x y z
93 122 787.185547
93 123 847.964905
93 124 908.932190
93 125 1054.865845
93 126 1109.340576
x y is coordinates,and I know their range.For example
x_range=np.arange(90,130)
y_range=np.arange(100,130)
z is measurement data
Now I want to insert missing points with nan value in z
so it looks like
x y z
90 100 NaN
90 101 NaN
...........................
93 121 NaN
93 122 787.185547
93 123 847.964905
93 124 908.932190
...........................
129 128 NaN
129 129 NaN
It can be done by a simple but stupid for loop.
But is there a simple way to perform this?
I will recommend use itertools.product follow by merge
import itertools
df=pd.DataFrame(itertools.product(x_range,y_range),columns=['x','y']).merge(df,how='left')

Generating all the combinations of 7 columns in a dataframe and add the corresponding rows to generate new columns

I have a dataframe that looks similar to below:
Wave A B C
340 77 70 15
341 80 73 15
342 83 76 16
343 86 78 17
I want to generate columns that will have all the possible combinations of the existing columns. I showed 3 cols here but in my actual data, I have 7 columns and therefore 127 total combinations. The desired output is as follows:
Wave A B C AB AC AD BC ... ABC
340 77 70 15 147 92 ...
341 80 73 15 153 95 ...
342 83 76 16 159 99 ...
I implemented a quite inefficient version where the user inputs the combinations (AB, AC, etc.) and a new col is created with the sum of the rows. This seems almost impossible to accomplish for 127 combinations, esp with descriptive col names.
Create a list of all combinations with chain + combinations from itertools, then sum the appropriate columns:
from itertools import combinations, chain
cols = [*df.iloc[:,1:]]
l = list(chain.from_iterable(combinations(cols, n+2) for n in range(len(cols))))
#[('A', 'B'), ('A', 'C'), ('B', 'C'), ('A', 'B', 'C')]
for items in l:
df[''.join(items)] = df.loc[:, items].sum(1)
Wave A B C AB AC BC ABC
0 340 77 70 15 147 92 85 162
1 341 80 73 15 153 95 88 168
2 342 83 76 16 159 99 92 175
3 343 86 78 17 164 103 95 181
You need to get the all combination first , then we just get the combination , and we need create the maps dict or Series
l=df.columns[1:].tolist()
l1=[list(map(list, itertools.combinations(l, i))) for i in range(len(l) + 1)]
d=[dict.fromkeys(y,''.join(y))for x in l1 for y in x ]
maps=pd.Series(d).apply(pd.Series).stack()
df.set_index('Wave',inplace=True)
df=df.reindex(columns=maps.index.get_level_values(1))
#here using reindex , get the order of your new df to the maps keys
df.columns=maps.tolist()
# here assign the new value to the column , since the order is same that why here I am assign it back
df.sum(level=0,axis=1)
Out[303]:
A B C AB AC BC ABC
Wave
340 77 70 15 147 92 85 162
341 80 73 15 153 95 88 168
342 83 76 16 159 99 92 175
343 86 78 17 164 103 95 181

pandas how to convert a dataframe to a matrix using transpose

I have the following df,
code y_m count
101 2017-11 86
101 2017-12 32
102 2017-11 11
102 2017-12 34
102 2018-01 46
103 2017-11 56
103 2017-12 89
now I want to convert this df into a matrix that transposes column y_m to row, make the count as matrix cell values like,
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 -1 89 -1
in specific, -1 represents a dummy value that indicates either a value doesn't exist for a y_m for a specific code or to maintain matrix shape; 0 represents 'all' values, that aggregates code or y_m or code and y_m, e.g. cell (1, 1) sums the count values for all y_m and code; (1,2) sums the count for 2017-11.
You can use first pivot_table:
df1 = (df.pivot_table(index='code',
columns='y_m',
values='count',
margins=True,
aggfunc='sum',
fill_value=-1,
margins_name='0'))
print (df1)
y_m 2017-11 2017-12 2018-01 0
code
101 86 32 -1 118
102 11 34 46 91
103 56 89 -1 145
0 153 155 46 354
And then for final format, but get mixed values, numeric with strings:
#change order of index and columns values for reindex
idx = df1.index[-1:].tolist() + df1.index[:-1].tolist()
cols = df1.columns[-1:].tolist() + df1.columns[:-1].tolist()
df2 = (df1.reindex(index=idx, columns=cols)
.reset_index()
.rename(columns={'code':-1})
.rename_axis(None,1))
#add columns to first row
df3 = df2.columns.to_frame().T.append(df2).reset_index(drop=True)
#reset columns names to range
df3.columns = range(len(df3.columns))
print (df3)
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 56 89 -1

Conditional date join in python Pandas

I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44

Resources