How can we groupby selected row values from a column and assign it to a new column in pandas df? - python-3.x

Id B
1 6
2 13
1 6
2 6
1 6
2 6
1 10
2 6
2 6
2 6
I want a new columns say C where I can get a grouped value of B=6 at Id level
Jan18.loc[Jan18['Enquiry Purpose']==6].groupby(Jan18['Member Reference']).transform('count')
Id B No_of_6
1 6 3
2 13 5
1 6 3
2 6 5
1 6 3
2 6 5
1 10 3
2 6 5
2 6 5
2 6 5

Comapre values by Series.eq for ==, convert to integers and use GroupBy.transform for new column filled by sum per groups:
df['No_of_6'] = df['B'].eq(6).astype(int).groupby(df['Id']).transform('sum')
#alternative
#df['No_of_6'] = df.assign(B= df['B'].eq(6).astype(int)).groupby('Id')['B'].transform('sum')
print (df)
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5
Generally create boolean mask by your condition(s) and pass below:
mask = df['B'].eq(6)
#alternative
#mask = (df['B'] == 6)
df['No_of_6'] = mask.astype(int).groupby(df['Id']).transform('sum')

A solution using map. This solution will return NaN on groups of Id have no number of 6
df['No_of_6'] = df.Id.map(df[df.B.eq(6)].groupby('Id').B.count())
Out[113]:
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5

Related

Remove rows from Dataframe where row above or below has same value in a specific column

Starting Dataframe:
A B
0 1 1
1 1 2
2 2 3
3 3 4
4 3 5
5 1 6
6 1 7
7 1 8
8 2 9
Desired result - eg. Remove rows where column A has values that match the row above or below:
A B
0 1 1
2 2 3
3 3 4
5 1 6
8 2 9
You can use boolean indexing, the following condition will return true if value of A is NOT equal to value of A's next row
new_df = df[df['A'].ne(df['A'].shift())]
A B
0 1 1
2 2 3
3 3 4
5 1 6
8 2 9

count Total rows of an Id from another column

I have a dataframe
Intialise data of lists.
data = {'Id':['1', '2', '3', '4','5','6','7','8','9','10'], 'reply_id':[2, 2,2, 5,5,6,8,8,1,1]}
Create DataFrame
df = pd.DataFrame(data)
Id reply_id
0 1 2
1 2 2
2 3 2
3 4 5
4 5 5
5 6 6
6 7 8
7 8 8
8 9 1
9 10 1
I want to get total of reply_id in new for every Id.
Id=1 have 2 time occurrence in reply_id which i want in new column new
Desired output
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0
I have done this line of code.
df['new'] = df.reply_id.eq(df.Id).astype(int).groupby(df.Id).transform('sum')
In this answer, I used Series.value_counts to count values in reply_id, and converted the result to a dict. Then, I used Series.map on the Id column to associate counts to Id. fillna(0) is used to fill values not present in reply_id
df['new'] = (df['Id']
.astype(int)
.map(df['reply_id'].value_counts().to_dict())
.fillna(0)
.astype(int))
Use, Series.groupby on the column reply_id, then use the aggregation function GroupBy.count to create a mapping series counts, finally use Series.map to map the values in Id column with their respective counts:
counts = df['reply_id'].groupby(df['reply_id']).count()
df['new'] = df['Id'].map(counts).fillna(0).astype(int)
Result:
# print(df)
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0

Counting the pairs that come together with a high value in a dataset

I have a set of data with column headings A, B, C, D, E ... K. and in the cells, there are values between 0-6. I am looking for a way to count and list the pairs or triples that have high values (4,5,6).
For example, if A and B columns have 5 and 6 in the same row respectively, then it should be counted in the calculation of the occurrences. If it is 1 and 6, 1 and 5, etc, then it should be skipped. It should be only counted if both (can be more than 2 columns) have high values on the same row.
Basically, I want to count and list the columns if they have high values in the same row. I am open for all types of solutions. I'd really appreciate if someone guide me how to do this. thanks.
Example Output:
Pairs Number of Occurrences (can be (5,6), (4,6),(5,5), (4,5), (6,6))
AB 10
BC 20
CE 30
Here is a picture of my data.
This is just a part of my actual data. Not the complete list. I am sorry, I said values between 0 and 6. I deleted 0s, and they are all blank now.
A B C D E F G H I J K L M
3 3 2 4 2 4 5 4 2 2 4 3 3
2 4 3 3 3 3 6 4 2 3 3 2 4
3 3 2 4 2 4 3 3 3 3 3 3 3
3 3 4 2 4 2 4 3 3 5 1 3 3
2 4 4 2 4 2 3 6 4 2 2 4
2 4 2 4 2 4 3 3 3 3 3 2 4
3 3 2 4 2 4 3 3 3 3 3 3 3
5 1 2 4 2 4 3 3 3 3 3 5 1
2 4 1 5 1 5 3 4 2 3 3 2 4
3 3 2 4 2 4 3 3 3 3 3 3 3
5 1 2 4 2 4 2 3 3 3 3 5 1
3 3 2 4 2 4 3 4 2 4 2 3 3
4 2 3 3 3 3 3 3 3 4 2 4 2
3 3 3 3 3 3 3 3 3 6 0 3 3
2 4 3 3 3 3 3 4 2 5 1 2 4
4 2 2 4 2 4 3 1 5 3 3 4 2
2 4 4 2 4 2 4 3 3 3 3 2 4
3 3 2 4 2 4 3 2 4 4 2 3 3
3 3 4 2 4 2 4 3 3 3 3 3 3
4 2 2 4 2 4 3 3 3 3 3 4 2
2 4 3 3 3 3 3 3 3 4 2 2 4
2 4 2 4 2 4 2 2 4 4 2 2 4
4 2 3 3 3 3 5 4 2 1 5 4 2
3 3 3 3 3 3 3 4 2 3 3 3 3
1 5 2 4 2 4 3 4 2 2 4 1 5
5 1 4 2 4 2 6 1 5 3 3 5 1
4 2 1 5 1 5 3 3 3 2 4 4 2
1 5 2 4 2 4 1 3 3 3 3 1 5
2 4 4 2 4 2 1 2 4 2 4 2 4
4 2 5 1 5 1 2 4 2 3 3 4 2
4 2 1 5 1 5 4 1 5 4 2 4 2
2 4 3 3 3 3 3 3 3 6 0 2 4
4 2 2 4 2 4 3 3 3 3 3 4 2
I made two helper columns that list the pairs of columns, then used this formula to calculate the pairs of (4,5), (4,6), and (5,6).
= SUMPRODUCT(COUNTIFS(INDEX($A:$M,0,MATCH(O2,$A$1:$M$1,0)),{4,4,5,5,6,6},
INDEX($A:$M,0,MATCH(P2,$A$1:$M$1,0)),{5,6,6,4,4,5}))
EDIT Based on your most recent comment, formula is updated to this:
= COUNTIFS(INDEX($A:$M,0,MATCH(O2,$A$1:$M$1,0)),">3",
INDEX($A:$M,0,MATCH(P2,$A$1:$M$1,0)),">3"))
See example below, I didn't do it for every single of columns, but gave it a good start:
Note your original data is to the left in my spreadsheet, I didn't show it here just to save space.
here goes a VBA solution exploiting Dictionary (which requires to add reference to Microsoft Scripting Runtime library):
Option Explicit
Sub main()
Dim col As Range
Dim cell As Range
Dim pairDict As Scripting.Dictionary
Set pairDict = New Scripting.Dictionary
With Worksheets("rates")
With .Range("a1").CurrentRegion
For Each col In .Columns.Resize(, .Columns.Count - 1) 'loop through referenced range columns except the last one
.AutoFilter Field:=col.Column, Criteria1:=">4" 'filter reference range on current column with values > 4
If Application.WorksheetFunction.Subtotal(103, col) > 1 Then ' if any filtered cells except header
For Each cell In Intersect(.Offset(, col.Column).Resize(, .Columns.Count - col.Column), .Resize(.Rows.Count - 1).Offset(1).SpecialCells(xlCellTypeVisible).EntireRow) 'loop through each row of filtered cells from one column right of current one to the last one
If cell.Value > 4 Then pairDict(.Cells(1, col.Column).Value & .Cells(1, cell.Column).Value) = pairDict(.Cells(1, col.Column).Value & .Cells(1, cell.Column).Value) + 1 ' if current cell value is >4 then update dictionary with key=combination of columns pair first row content and value=value+1
Next
End If
.AutoFilter 'remove current filter
Next
End With
.AutoFilterMode = False 'remove filters headers
End With
If pairDict.Count > 0 Then ' if any pair found
Dim key As Variant
For Each key In pairDict.Keys 'loop through each dictionary key
Debug.Print key, pairDict(key) 'print the key (i.e. the pair of matching columns first row content) and the value ( i.e. the number of occurrences found)
Next
End If
End Sub

Repeating elements in a dataframe

Hi all I have the following dataframe:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
And I am trying to only repeat the last two rows of the data so that it looks like this:
A | B | C
1 2 3
2 3 4
3 4 5
3 4 5
4 5 6
4 5 6
I have tried using append, concat and repeat to no avail.
repeated = lambda x:x.repeat(2)
df.append(df[-2:].apply(repeated),ignore_index=True)
This returns the following dataframe, which is incorrect:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
3 4 5
3 4 5
4 5 6
4 5 6
You can use numpy.repeat for repeating index and then create df1 by loc, last append to original, but before filter out last 2 rows by iloc:
df1 = df.loc[np.repeat(df.index[-2:].values, 2)]
print (df1)
A B C
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
print (df.iloc[:-2])
A B C
0 1 2 3
1 2 3 4
df = df.iloc[:-2].append(df1,ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
If want use your code add iloc for filtering only last 2 rows:
repeated = lambda x:x.repeat(2)
df = df.iloc[:-2].append(df.iloc[-2:].apply(repeated),ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
Use pd.concat and index slicing with .iloc:
pd.concat([df,df.iloc[-2:]]).sort_values(by='A')
Output:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I'm partial to manipulating the index into the pattern we are aiming for then asking the dataframe to take the new form.
Option 1
Use pd.DataFrame.reindex
df.reindex(df.index[:-2].append(df.index[-2:].repeat(2)))
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Same thing in multiple lines
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.reindex(idx)
Could also use loc
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.loc[idx]
Option 2
Reconstruct from values. Only do this is all dtypes are the same.
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
pd.DataFrame(df.values[idx], df.index[idx])
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Option 3
Can also use np.array in iloc
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
df.iloc[idx]
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources