Pandas "countif" based on column value and multiindex

Pandas "countif" based on column value and multiindex - python-3.x

I have a DataFrame that looks like:
Where I have YEAR and RACEETHN as a multiindex. I want to to count the number of "1" values (note, the data are not only 0 and 1 so I cannot do a sum) for each YEAR and RACEETHN combination for each column variable.
I am able to count where value = 1 for each column by doing this:
(df_3.ACSUPPSV == 1).sum()
(df_3.PSEDSUPPSV == 1).sum()
I want to do this with groupby, but am unable to get it to work. I've tried the following code to test if I could do it on a single column 'ACSUPPSV' and it did no work:
df.groupby(['YEAR', 'RACEETHN']).loc[df.ACSUPPSV == 1, 'ACSUPPSV'].count()
I exported the data to excel and was able to calculate this with a quick "COUNTIF" formula, but I know there must be a way to do this in pandas - the results from excel look like:
Would appreciate if someone had a better way to do this than export to Excel! :)

I think you need agg with custom function for count 1 only:
df_3 = pd.DataFrame({'ACSUPPSV':[1,1,1,1,0,1],
'PSEDSUPPSV':[1,1,0,1,0,0],
'BUDGETSV':[1,0,1,1,1,0],
'YEAR':[2000,2000,2001,2000,2000,2000],
'RACEETHN':list('aaabbb')}).set_index(['YEAR','RACEETHN'])
print (df_3)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 1 1 1
a 1 0 1
2001 a 1 1 0
2000 b 1 1 1
b 0 1 0
b 1 0 0
df2 = df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())
print (df2)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 2 1 2
b 2 2 1
2001 a 1 1 0
Old answer:
df_3[((df_3.ACSUPPSV == 1) & (df_3.PSEDSUPPSV == 1))].groupby(['YEAR', 'RACEETHN']).size()
df_3.query('ACSUPPSV == 1 & PSEDSUPPSV == 1').groupby(['YEAR', 'RACEETHN']).size()
More general:
cols = ['ACSUPPSV','PSEDSUPPSV']
df_3[(df_3[cols] == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
For all columns:
df_3[(df_3 == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
EDIT:
Or maybe need:
df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())

Related

Updating Pandas data fram cells by condition

I have a data frame and want to update specific cells in a column based on a condition on another column.
ID Name Metric Unit Value
2 1 K2 M1 msecond 1
3 1 K2 M2 NaN 10
4 2 K2 M1 usecond 500
5 2 K2 M2 NaN 8
The condition is, if Unit string is msecond, then multiply the corresponding value in Value column by 1000 and store it in the same place. Considering a constant step for row iteration (two-by-two), the following code is not correct
i = 0
while i < len(df_group):
x = df.iloc[i].at["Unit"]
if x == 'msecond':
df.iloc[i].at["Value"] = df.iloc[i].at["Value"] * 1000
i += 2
However, the output is the same as before modifications. How can I fix that? Also what are the alternatives for better coding instead of that while loop?

A much simpler (and more efficient) form would be to use loc:
df.loc[df['Unit'] == 'msecond', 'Value'] *= 100
If you consider it essentially to only update a specific step of indexes:
step = 2
start = 0
df.loc[df['Unit'].eq('msecond') & (df.index % step == start), 'Value'] *= 100

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe

I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

I want to count the occurrence of duplicate values in a column in a dataframe and update the count in a new column in python

Example: Let's say I have a df
Id
A
B
C
A
A
B
It should look like:
Id count
A. 1
B. 1
C. 1
A. 2
A. 3
B. 2
Note: I've tried using the for loop method and while loop option but it works for small datasets but takes a lot of time for large datasets.
for i in df:
for j in df:
if i==j:
count+=1

You can groupby with cumcount, like this:
df['counts'] = df.groupby('Id', sort=False).cumcount() + 1
df.head()
Id counts
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2

dups_values = df.pivot_table(index=['values'], aggfunc='size')
print(dups_values)

Replace values in multiple untitled columns to 0, 1, 2 depending on column

EDITED AS PER COMMENTS
Background: Here is what the current dataframe looks like. The row labels are information texts in original excel file. But I hope this small reproduction of data will be enough for a solution? Actual file has about 100 columns and 200 rows.
Column headers and Row #0 values are repeated with pattern shown below -- except the Sales or Validation text changes at every occurrence of column with an existing title.
One more column before sales with text in each row. Mapping of Xs done for this test. Unfortunately, found no elegant way of displaying text as part of output below.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
Expected Output: Replacing the X with 0s, 1s and 2s depending on which column they are in (Commented / No Comment)
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 0 1
2 2 0
3 1 2
Possible Code: I assume the loop would look something like this:
while in row 9:
if column value = "commented":
replace all "x" with 1
elif row 9 when column valkue = "no comment":
replace all "x" with 2
else:
replace all "x" with 0
But being a python novice, I am not sure how to convert this to a working code. I'd appreciate all support and help.

Here is one way to do it:
Define a function to replace the x:
import re
def replaceX(col):
cond = ~((col == "x") | (col == "X"))
# Check if the name of the column is undefined
if not re.match(r'Unnamed: \d+', col.name):
return col.where(cond, 0)
else:
# Check what is the value of the first row
if col.iloc[0] == "Commented":
return col.where(cond, 1)
elif col.iloc[0] == "No comment":
return col.where(cond, 2)
return col
Or if your first row don't contain "Commented" or "No comment" for titled columns you can have a solution without regex:
def replaceX(col):
cond = ~((col == "x") | (col == "X"))
# Check what is the value of the first row
if col.iloc[0] == "Commented":
return col.where(cond, 1)
elif col.iloc[0] == "No comment":
return col.where(cond, 2)
return col.where(cond, 0)
Apply this function on the DataFrame:
# Apply the function on every column (axis not specified so equal 0)
df.apply(lambda col: replaceX(col))
Output:
title Unnamed: 2 Unnamed: 3
0 Commented No comment
1
2 0 2
3 1
Documentation:
Apply: apply a function on every columns/rows depending on the axis
Where: check where a condition is met on a series, if it is not met, replace with value specified.

Pandas groupby value and return observation count to dataset

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.

Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.

group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas "countif" based on column value and multiindex - python-3.x

Related

Updating Pandas data fram cells by condition

Python and Pandas, find rows that contain value, target column has many sets of ranges

I want to count the occurrence of duplicate values in a column in a dataframe and update the count in a new column in python

Replace values in multiple untitled columns to 0, 1, 2 depending on column

Pandas groupby value and return observation count to dataset

Categories

Resources