Pandas groupby value and return observation count to dataset

Pandas groupby value and return observation count to dataset - python-3.x

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.

Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.

group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

Related

how to get value of column2 when column 1 is greater 3 and check this value belong to which Bin

I have one dataframe with two columns , A and B . first i need to make empty bins with step 1 from 1 to 11 , (1,2),(2,3)....(10,11). then check from original dataframe if column B value greater than 3 then get value of column 'A' 2 rows before when column B is greater than 3.
Here is example dataframe :
df=pd.DataFrame({'A':[1,8.5,5.2,7,8,9,0,4,5,6],'B':[1,2,2,2,3.1,3.2,3,2,1,2]})
Required output 1:
df_out1=pd.DataFrame({'Value_A':[8.5,5.2]})
Required_output_2:
df_output2:
Bins count
(1 2) 0
(2,3) 0
(3,4) 0
(4,5) 0
(5,6) 1
(6,7) 0
（7,8） 0
（8,9） 1
（9,10） 0
（10,11） 0

You can index on a shifted series to get the two rows before 'A' satisfies some condition like
out1 = df['A'].shift(3)[df['B'] > 3]
The thing you want to do with the bins is known as a histogram. You can easily do this with numpy like
count, bin_edges = np.histogram(out1, bins=[i for i in range(1, 12)])
out2 = pd.DataFrame({'bin_lo': bin_edges[:-1], 'bin_hi': bin_edges[1:], 'count': count})
Here 'bin_lo' and 'bin_hi' are the lower and upper bounds of the bins.

How to extract row before and after when flag change from 0 to 1

I have one dataframe , i want to extract 2 rows before flag change from 0 to one and get row where value 'B' is minimum , also extract two rows after flag 1 and get row with minimum value of 'B'
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
df_out=pd.DataFrame({'A':[4,1],
'B':[4,1],
'flag':[0,1]})

To find indices of both rows of interest, run:
ind1 = df[df.flag.shift(-1).eq(0) & df.flag.shift(-2).eq(1)].index[0]
ind2 = df[df.index > ind1].B.idxmin()
For your data sample the result is 2 and 6.
Then, to retrieve rows with these indices, run:
df.loc[[ind1, ind2]]
The result is:
A B flag
2 4 4 0
6 1 1 1

is it possible to manually assign the value of the dummy variable?

I have a data set for automotive sales and i want to change the feature 'aspiration' which contains two unique values 'std' & 'turbo' to categorical values using pd.get_dummies. using the code below;
dummy_variable_2 = pd.get_dummies(df['aspiration'])
It is automatically assigning 0 to 'std' & 1 to 'turbo'.
I would like to change to 'std' to 1 & 'turbo' to 0.

The return of pd.get_dummmies is a dataframe, which contains one column for each unique value in the dataframe. Whereby, in each column only the values of the corresponding unique value are set to one.
In your case, the dataframe contains two columns. One column is named turbo and one column std. If you want the column where the values of std are set to one, you have to do following:
df = pd.DataFrame({"aspiration":["std", "turbo", "std", "std", "std", "turbo"]})
dummies = pd.get_dummies(df)
std= dummies["aspiration_std"]
In this example, the variable dummy looks like:
std turbo
0 1 0
1 0 1
2 1 0
3 1 0
4 1 0
5 0 1
and std looks like:
0 1
1 0
2 1
3 1
4 1
5 0

Trying to ignore zero values in Excel column

Expected OutputIn an Excel spreadsheet I am working on there are two columns of interest, column B and column E. In column B there are some 0 values and these are getting carried over to the column E based on the loop that I am running with respect to column D. I want to write a Python script to ignore these 0's and pick the next highest value based on their frequencies into column E.
12NC ModifiedSOCwrt12NC SOC
0 232270463903 0 0
1 232270463903 0 0
2 232270463903 0 0
3 232270463903 0 0
4 232270463903 0 RC0603FR-0738KL
5 232270463903 0 RC0603FR-0738KL
6 232270463903 0 RC0603FR-0738KL
I want to run a loop which picks non-zero values from SOC (column B) and carries it over to ModifiedSOCwrt12NC (column E) based on unique values in Column D.
For example, Column B has values = [0, RCK2] in multiple rows which are based on unique values in column D. So the current loop picks the maximum occurrences of values in column B and fills it into column E. If there is a tie between occurrences of 0 and RCK2, it picks 0 as per the ASCII standard (which I don't want to happen). I want the code to pick RCK2 and fill those in column E.

Since your Data is not accessible, I have created a test data similar to one below -
We can read data in pandas -
import pandas as pd
df = pd.read_excel("ExcelTemplate.xlsx")
df
Index SOC Index2 12NC
0 YXGMY 0 ZJIZX 23445
1 NQHQC 0 JKJKT 23445
2 MWTLY 0 EFCYD 23445
3 RPQFE AC VLOJZ 23445
4 GPLUQ AC AKKKG 23445
5 WGYYM AC DSMLO 23445
6 XGTAQ 0 ZHGWS 45667
7 AMWDT 0 YROLO 45667
following code will do the summarization -
First summarize data on 12NC and SOC and take count
Sort by 12NC, count and SOC, with highest count first
Take the first value of SOC for each 12NC
Merge with original Data to create column E
Export back to Excel
df1 = df.groupby(['12NC', 'SOC'])['Index'].count().reset_index()
df = df.merge(df1[df1['SOC']!=0].sort_values(by=['12NC', 'Index', 'SOC'], ascending=[True, False, True])\
.drop_duplicates(subset=['12NC'], keep='first')[['12NC', 'SOC']].\
rename(index=str, columns={'SOC': 'ModifiedSOCwrt12NC'}),\
on = ['12NC'], how='left')
df.to_excel("ExcelTemplate_modifies.xlsx", index=False)

Matrix with boolean values from a list of paired observations

In the below spreadsheet, the cell values represent an ID for a person. The person in column A likes the person in column B, but it may not be mutual. So, in the first row with data, person 1 likes 2. In the second row with data person 1 likes 3.
A B
1 2
1 3
2 1
2 4
3 4
4 1
I'm looking for a way to have a 4 x 4 matrix with an entry of 1 in (i,j) to indicate person i likes person j and an entry of 0 to indicate they don't. The example above should like this after performing the task:
1 2 3 4
1 0 1 1 0
2 1 0 0 1
3 0 0 0 1
4 1 0 0 0
So, reading the first row of the matrix we would interpret it like this: person 1 does not like person 1 (cell value = 0), person 1 likes person 2 (cell value = 1), person 1 likes person 3 (cell value =1), person 1 does not like person 4 (cell value = 0)
Note that order of pairing matter so [4 2] does not equal [2 4].
How could this be done?

Assuming your existing data is in A1:B6, then in A10 enter:
=COUNTIFS($A$1:$A$6, ROW()-9,$B$1:$B$6, COLUMN())
This will return a 1 or a 0 depending on whether person 1 likes person 1. They don't so you get a 0. It uses Row()-9 to return 1 and COLUMN() to return 1 to find the match.
Copy this formula over 4 columns and down 4 rows and that ROW()-9 and COLUMN() formula will return the appropriate values for the check into the COUNTIFS() formula which will look for the matching pair.
Personally, if this was something I had to do and my matrix was of indeterminate size, I would probably stick these formulas on a second tab, starting at A1 and use ROW() where I don't have to adjust it by 9. But for a one off on the same tab, to help check the results, the above is fine.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas groupby value and return observation count to dataset - python-3.x

group by id and and value then count value. data.groupby(['id' , 'value'])['id'].transform('count')

Related

how to get value of column2 when column 1 is greater 3 and check this value belong to which Bin

How to extract row before and after when flag change from 0 to 1

is it possible to manually assign the value of the dummy variable?

Trying to ignore zero values in Excel column

Matrix with boolean values from a list of paired observations

Categories

Resources