How to convert repated rows data to columns in python? - python-3.x

Hi I have a data frame df1 where column name names repeat after every 3 rows. I need to convert them to a single row.
This is how df looks
name marks
john 63
mark 45
pieter 32
beth 02
john 25
mark 01
pieter 23
beth 42
john 03
mark 43
pieter 42
beth 23
I need the output in the following format
type john mark pieter beth
marks 63 45 32 02
marks 25 01 23 42
marks 03 43 42 23

Consider this:
df=df.assign(id=df.groupby("name").cumcount()) \
.pivot(columns="name", index="id") \
.stack(level=0).reset_index(level=1) \
.rename(columns={"level_1": "type"})
del df.index.name
del df.columns.name
Outputs:
type beth john mark pieter
0 marks 02 63 45 32
1 marks 42 25 01 23
2 marks 23 03 43 42

IIUC:
new_df = (df.pivot_table(index=df.groupby('name').cumcount(), columns='name')
.rename_axis(columns=['type',None])
.stack(level=0)
.reset_index(level=1))
print(new_df)
type beth john mark pieter
0 marks 2 63 45 32
1 marks 42 25 1 23
2 marks 23 3 43 42
or
new_df = (df.assign(index=df.groupby('name').cumcount())
.melt(['index','name'], var_name='type')
.pivot_table(index=['type','index'], columns='name',values = 'value')
.reset_index('type'))

An alternative :
res = (df
.pivot(columns='name',values='marks')
.bfill()
#remove repititions on the beth column
#this impacts the other columns as well
.drop_duplicates('beth')
.rename_axis(columns=None)
.astype(int)
.assign(type='marks')
#adjust column positions to match ur output
.reindex(['type','john','mark','pieter','beth'],axis=1)
.reset_index(drop=True)
)
res
type john mark pieter beth
0 marks 63 45 32 2
1 marks 25 1 23 42
2 marks 3 43 42 23
You could also step out of Pandas into numpy, using the reshape method and at the end, create a new dataframe :
name_len = df.name.nunique()
names = df.name.unique()
df_len = len(df)
reshape_tuple = (df_len//name_len,name_len)
reshaped = df.marks.to_numpy().reshape(reshape_tuple)
#create new dataframe
res = pd.DataFrame(reshaped, columns = names)
#insert the 'type' column at the beginning of the dataframe
res.insert(0,'type','marks')
print(res)
type john mark pieter beth
0 marks 63 45 32 2
1 marks 25 1 23 42
2 marks 3 43 42 23

Related

Analysis on dataframe with python

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'
Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

Data frame transformation using transposing and flatening

I have a data frame that looks like:
tdelta A B label
1 11 21 Lab1
2 24 45 Lab2
3 44 65 Lab3
4 77 22 Lab4
5 12 64 Lab5
6 39 09 Lab6
7 85 11 Lab7
8 01 45 Lab8
And I need to transform this dataset into:
For selected window: 4
A1 A2 A3 A4 B1 B2 B3 B4 L1 label
11 24 44 77 21 45 65 22 Lab1 Lab4
12 39 85 01 64 09 11 45 Lab5 Lab8
So based on the selected window - 'w', I need to transpose w rows with the first corresponding label as my X values and the corresponding last label as my Y value. here is what I have developed till now:
def data_process(data,window):
n=len(data)
A = pd.DataFrame(data['A'])
B = pd.DataFrame(data['B'])
lb = pd.DataFrame(data['lab'])
df_A = pd.concat([gsr.loc[i] for i in range(0,window)],axis=1).reset_index()
df_B = pd.concat([st.loc[i] for i in range(0,window)],axis=1).reset_index()
df_lb = pd.concat([lb.loc[0],axis=1).reset_index()
X = pd.concat([df_A,df_B,df_lab],axis=1)
Y = pd.DataFrame(data['lab']).shift(-window)
return X, Y
I think this works for only the first 'window' rows. I need it to work for my entire dataframe.
This is essentially a pivot, with a lot of cleaning up after the pivot. For the pivot to work we need to use integer and modulus division so that we can group the rows into windows of length w and figure out which column they then belong to.
# Number of rows to group together
w = 4
df['col'] = np.arange(len(df))%w + 1
df['i'] = np.arange(len(df))//w
# Reshape and flatten the MultiIndex
df = (df.drop(columns='tdelta')
.pivot(index='i', columns='col')
.rename_axis(index=None))
df.columns = [f'{x}{y}'for x,y in df.columns]
# Define these columns and remove the intermediate label columns.
df['L1'] = df['label1']
df['label'] = df[f'label{w}']
df = df.drop(columns=[f'label{i}' for i in range(1, w+1)])
print(df)
A1 A2 A3 A4 B1 B2 B3 B4 L1 label
0 11 24 44 77 21 45 65 22 Lab1 Lab4
1 12 39 85 1 64 9 11 45 Lab5 Lab8

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

How to join a series into a dataframe

So I counted the frequency of a column 'address' from the dataframe 'df_two' and saved the data as dict. used that dict to create a series 'new_series'. so now I want to join this series into a dataframe making 'df_three' so that I can do some maths with the column 'new_count' and the column 'number' from 'new_series' and 'df_two' respectively.
I have tried to use merge / concat the items of 'new_count' were changed to NaN
Image for what i got(NaN)
df_three
number address name new_Count
14 12 ab pra NaN
49 03 cd ken NaN
97 07 ef dhi NaN
91 10 fg rav NaN
Image for input
Input
new_series
new_count
12 ab 8778
03 cd 6499
07 ef 5923
10 fg 5631
df_two
number address name
14 12 ab pra
49 03 cd ken
97 07 ef dhi
91 10 fg rav
output
df_three
number address name new_Count
14 12 ab pra 8778
49 03 cd ken 6499
97 07 ef dhi 5923
91 10 fg rav 5631
It seems you forget parameter on:
df = df_two.join(new_series, on='address')
print (df)
number address name new_count
0 14 12 ab pra 8778
1 49 03 cd ken 6499
2 97 07 ef dhi 5923
3 91 10 fg rav 5631

Plot individuals home-range with Adehabitat

I am trying to put the name from the individuals of my research in a polygons home-range plot, but after many attempts I still can not achieve it.
Here and example of my data: X and Y are coordinates and id are individuals
X Y id
29 29 4
44 28 7
57 57 5
60 81 11
32 41 4
43 29 7
57 57 5
46 83 11
32 41 4
43 29 7
57 56 5
60 82 11
35 40 4
43 28 7
62 55 5
54 73 11
27 40 4
43 28 7
61 54 5
First, i calculated the home-range of my data with MPC
cp <- mcp((data)[,1],percent=95, unin = c("m"), unout = c("m2"))
And the plot it
plot(cp, axes=TRUE, border = rainbow(12))
But i don´t kown which polygons correspond to each individual, and if possible i need to include the id of my individuals inside each polygon
Any help would be appreciated!!
Thanks
Juan
Here is an example using the example data from the adehabitatHR package, since you not really providing a reproducible example.
library(adehabitatHR)
data("puechabonsp")
cp <- mcp(puechabonsp$relocs[, 1], percent=95, unin = c("m"), unout = c("m2"))
One way would be to use ggplot2 and sf:
library(sf)
library(tidyverse)
st_as_sf(cp) %>% ggplot(., aes(fill = id)) + geom_sf(alpha = 0.5) +
scale_fill_discrete(name = "Animal id")

Resources