Counting the no.of elements in a column and grouping them - python-3.x

Hope you guys are doing well . I have taken up a small project to do in python so I can learn how to code and do basic data analysis in python along the way . I need some help on counting the number of elements present in a column in a DF and grouping them .
Below is the Dataframe I am using
dates open high low close volume % change
372 2010-01-05 15:28:00 5279.2 5280.25 5279.1 5279.5 131450
373 2010-01-05 15:29:00 5279.75 5279.95 5278.05 5279.0 181200
374 2010-01-05 15:30:00 5277.3 5279.0 5275.0 5276.45 240000
375 2010-01-06 09:16:00 5288.5 5289.5 5288.05 5288.45 32750 0.22837324337386275
376 2010-01-06 09:17:00 5288.15 5288.75 5285.05 5286.5 55004
377 2010-01-06 09:18:00 5286.3 5289.0 5286.3 5288.2 37650
I would like to create another DF where the count of elements/entries in the % change column and group them as , x<= 0.5 or 0.5<x<=1 or 1<x<=1.5 or 1.5<x<=2 or 2<x<=2.5 or X<2.5
Below would be the desired output
Group no.of instances
x<= 0.5 1
0.5<x<=1 0
1<x<=1.5 0
1.5<x<=2 0
2<x<=2.5 0
X<2.5 0
Looking forward to a reply ,
Fudgster

You could get the number of elements in each category by using the bins option of the pandas.value_counts() method.This would return the series with number of records within the specified range.
Here is the code,
df["% change"].value_counts(bins=[0,0.5,1,1.5,2,2.5])

Related

Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie
start end total
9 60 51
200 260 60
I'm trying to keep the start and end separate so I can call on them independently later.
I have this to open the .csv file and read its contents:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
df.groupby(0).filter(lambda x: len(x) > 0)
Which gives me this:
0
0 52.0
1 65.0
2 52.0
3 52.0
4 52.0
... ...
4995 8.0
4996 8.0
4997 8.0
4998 8.0
4999 8.0
5000 rows × 1 columns
I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:
df.groupby(0).count()
gives me this:
0
0.0
1.0
2.0
3.0
4.0
...
68.0
69.0
70.0
71.0
73.0
65 rows × 0 columns
Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do
df.groupby(0).filter(lambda x: len(x) > 0)
but if I change the 0 to anything else then it screws up my generated list.
Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.
Can I get some help, please?
UPDATE
Thanks! So this is what I have thanks to #piterbarg:
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df
.reset_index()
.shift(periods=-1)
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
out:
0 index
mean first last len
0 7 32 87 56
1 19 277 333 57
2 1 785 940 156
3 30 4062 4125 64
4 29 4214 4269 56
5 7 4450 4599 150
6 1 4612 4775 164
7 7 4778 4882 105
8 8 4945 4999 56
The current issue is trying to make it so the dataframe includes the last row of my .csv. If anyone happens to see this, I would love your input!
Let's start by mocking a df:
import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5
df.loc[600:660,0] = 4
First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({'index':['first','last',len]})
)
Then we only pick those groups that are longer than 50
(df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True)
)
output:
index
first last len
0 300 400 101
1 600 660 61
For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df
Edit
Your code is not quite right, should be
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({0 : 'mean', 'index':['first','last',len]}))
df3 = (df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True))
print(df3)
note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)
first is the index when the repetition starts, last is the last index, and len is how many elements there.
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
.shift(-1)
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
yields this:
0 index
mean first last len
0 7 31 86 56
1 19 276 332 57
2 1 784 939 156
3 31 4061 4124 64
4 29 4213 4268 56
5 8 4449 4598 150
6 1 4611 4774 164
7 8 4777 4881 105
8 8 4944 4999 56
Which I love. I did notice that the group with 56x duplicates of '7' actually starts on row 32, and ends on row 87 (just one later in both cases, and the pattern is consistent throughout the sheet). Am I right in believing that this can be fixed with the shift() function somehow? I'm toying around with this still :D

Pandas - number of occurances of IDs from a column in one dataframe in several columns of a second dataframe

I'm new to python and pandas, and trying to "learn by doing."
I'm currently working with two football/soccer (depending on where you're from!) dataframes:
player_table has several columns, among others 'player_name' and 'player_id'
player_id player_name
0 223 Lionel Messi
1 157 Cristiano Ronaldo
2 962 Neymar
match_table also has several columns, among others 'home_player_1', '..._2', '..._3' and so on, as well as the corresponding 'away_player_1', '...2' , '..._3' and so on. The content of these columns is a player_id, such that you can tell which 22 (2x11) players participated in a given match through their respective unique IDs.
I'll just post a 2 vs. 2 example here, because that works just as well:
match_id home_player_1 home_player_2 away_player_1 away_player_2
0 321 223 852 729 853
1 322 223 858 157 159
2 323 680 742 223 412
What I would like to do now is to add a new column to player_table which gives the number of appearances - player_table['appearances'] by counting the number of times each player_id is mentioned in the part of the dataframe match_table bound horizontally by (home player 1, away player 2) and vertically by (first match, last match)
Desired result:
player_id player_name appearances
0 223 Lionel Messi 3
1 157 Cristiano Ronaldo 1
2 962 Neymar 0
Coming from other programming languages I think my standard solution would be a nested for loop, but I understand that is frowned upon in python...
I have tried several solutions but none really work, this seems to at least give the number of appearances as "home_player_1"
player_table['appearances'] = player_table['player_id'].map(match_table['home_player_1'].value_counts())
Is there a way to expand the map function to include several columns in a dataframe? Or do I have to stack the 22 columns on top of one another in a new dataframe, and then map? Or is map not the appropriate function?
Would really appreciate your support, thanks!
Philipp
Edit: added specific input and desired output as requested
What you could do is use .melt() on the match_table player columns (so it'll turn your wide table in to a tall/long table of a single column). Then do a .value_counts on the that one column. Finally join it to the player_table on the 'player_id' column
import pandas as pd
player_table = pd.DataFrame({'player_id':[223,157,962],
'player_name':['Lionel Messi','Cristiano Ronaldo','Neymar']})
match_table = pd.DataFrame({
'match_id':[321,322,323],
'home_player_1':[223,223,680],
'home_player_2':[852,858,742],
'away_player_1':[729,157,223],
'away_player_2':[853,159,412]})
player_cols = [x for x in match_table.columns if 'player_' in x]
match_table[player_cols].value_counts(sort=True)
df1 = match_table[player_cols].melt(var_name='columns', value_name='appearances')['appearances'].value_counts(sort=True).reset_index(drop=False).rename(columns={'index':'player_id'})
appearances_df = df1.merge(player_table, how='right', on='player_id')[['player_id','player_name','appearances']].fillna(0)
Output:
print(appearances_df)
player_id player_name appearances
0 223 Lionel Messi 3.0
1 157 Cristiano Ronaldo 1.0
2 962 Neymar 0.0

Pandas shift datetimeindex takes too long time running

I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.
The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3
This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.
The relevant code in for DataFrame.GroupBy.shift is:
def shift(self, periods=1, freq=None, axis=0, fill_value=None):
"""..."""
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))
Previously this issue extended to specifying a fill_value

Selecting columns/axes for correlation from Pandas df

I have a pandas dataframe like the one below. I would like to build a correlation matrix that establishes the relationship between product ownership and the profit/cost/rev for a series of customer records.
prod_owned_a prod_owned_b profit cost rev
0 1 0 100 75 175
1 0 1 125 100 225
2 1 0 100 75 175
3 1 1 225 175 400
4 0 1 125 100 225
Ideally, the matrix will have all prod_owned along one axis with profit/cost/rev along another. I would like to avoid including the correlation between prod_owned_a and prod_owned_b in the correlation matrix.
Question: How can I select specific columns for each axis? Thank you!
As long as the order of the columns does not change, you can use slicing:
df.corr().loc[:'prod_owned_b', 'profit':]
# profit cost rev
#prod_owned_a 0.176090 0.111111 0.147442
#prod_owned_b 0.616316 0.666667 0.638915
A more robust solution locates all "prod_*" columns:
prod_cols = df.columns.str.match('prod_')
df.corr().loc[prod_cols, ~prod_cols]
# profit cost rev
#prod_owned_a 0.176090 0.111111 0.147442
#prod_owned_b 0.616316 0.666667 0.638915
Not very optimized but still;
df.corr().loc[['prod_owned_a', 'prod_owned_b'], ['profit', 'cost', 'rev']]

Use pandas dataframe column values to pivot other columns

I have the following dataframe which I want to reshape:
dir hour board_sign pass
1 5 d 294
1 5 u 342
1 6 d 1368
1 6 u 1268
1 7 d 3880
1 7 u 3817
What I want to do is to use the values from "board_sign" as new columns which will include the values from "pass" column so that the dataframe will look as this:
dir hour d u
1 5 294 342
1 6 1368 1268
1 7 3880 3817
I already tried several functions as melt pivot stack and unstack but it seems non of them give the wanted result, I also tried the pivot_table but it make is difficult to iterate since the multi index.
It's seems like an easy operation but I just cant get it right.
Is there any other function I can use for this?
Thanks.
Use pivot_table:
df = df.pivot_table(index=['dir', 'hour'], columns='board_sign', values='pass').reset_index()
del df.columns.name
df
dir hour d u
0 1 5 294 342
1 1 6 1368 1268
2 1 7 3880 3817

Resources