Replace values in multiple untitled columns to 0, 1, 2 depending on column - python-3.x

EDITED AS PER COMMENTS
Background: Here is what the current dataframe looks like. The row labels are information texts in original excel file. But I hope this small reproduction of data will be enough for a solution? Actual file has about 100 columns and 200 rows.
Column headers and Row #0 values are repeated with pattern shown below -- except the Sales or Validation text changes at every occurrence of column with an existing title.
One more column before sales with text in each row. Mapping of Xs done for this test. Unfortunately, found no elegant way of displaying text as part of output below.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
Expected Output: Replacing the X with 0s, 1s and 2s depending on which column they are in (Commented / No Comment)
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 0 1
2 2 0
3 1 2
Possible Code: I assume the loop would look something like this:
while in row 9:
if column value = "commented":
replace all "x" with 1
elif row 9 when column valkue = "no comment":
replace all "x" with 2
else:
replace all "x" with 0
But being a python novice, I am not sure how to convert this to a working code. I'd appreciate all support and help.

Here is one way to do it:
Define a function to replace the x:
import re
def replaceX(col):
cond = ~((col == "x") | (col == "X"))
# Check if the name of the column is undefined
if not re.match(r'Unnamed: \d+', col.name):
return col.where(cond, 0)
else:
# Check what is the value of the first row
if col.iloc[0] == "Commented":
return col.where(cond, 1)
elif col.iloc[0] == "No comment":
return col.where(cond, 2)
return col
Or if your first row don't contain "Commented" or "No comment" for titled columns you can have a solution without regex:
def replaceX(col):
cond = ~((col == "x") | (col == "X"))
# Check what is the value of the first row
if col.iloc[0] == "Commented":
return col.where(cond, 1)
elif col.iloc[0] == "No comment":
return col.where(cond, 2)
return col.where(cond, 0)
Apply this function on the DataFrame:
# Apply the function on every column (axis not specified so equal 0)
df.apply(lambda col: replaceX(col))
Output:
title Unnamed: 2 Unnamed: 3
0 Commented No comment
1
2 0 2
3 1
Documentation:
Apply: apply a function on every columns/rows depending on the axis
Where: check where a condition is met on a series, if it is not met, replace with value specified.

Related

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe
I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

How to extract row before and after when flag change from 0 to 1

I have one dataframe , i want to extract 2 rows before flag change from 0 to one and get row where value 'B' is minimum , also extract two rows after flag 1 and get row with minimum value of 'B'
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
df_out=pd.DataFrame({'A':[4,1],
'B':[4,1],
'flag':[0,1]})
To find indices of both rows of interest, run:
ind1 = df[df.flag.shift(-1).eq(0) & df.flag.shift(-2).eq(1)].index[0]
ind2 = df[df.index > ind1].B.idxmin()
For your data sample the result is 2 and 6.
Then, to retrieve rows with these indices, run:
df.loc[[ind1, ind2]]
The result is:
A B flag
2 4 4 0
6 1 1 1

Selecting rows where a numeric column value change sign through openpyxl

I'm learning Python and openpyxl for data analysis on a large xlsx workbook. I have a for loop that can iterate down an entire column. Here's some example data:
ROW: VALUE:
1 1
2 2
3 3
4 4
5 -4
6 -1
7 -6
8 2
9 3
10 -3
I want to print out the row in which the value changes from positive to negative, and vice versa. So in the above example, row number 5, 8, and 10 would print in the console. How can I use an if statement within a for loop to iterate through a column on openpyxl?
So far I can print all of the cells in a column:
import openpyxl
wb = openpyxl.load_workbook('ngt_log.xlsx')
sheet = wb.get_sheet_by_name('sheet1')
for i in range(1, 10508, 1): # 10508 is the length of the column
print(i, sheet.cell(row=i, column=6).value)
My idea was to just add an if statement inside of the for loop:
for i in range(1, 10508, 1): # 10508 is the length of the column
if(( i > 0 and (i+1) < 0) or (i < 0 and (i+1) > 0)):
print((i+1), sheet.cell(row=i, column=6).value)
But that doesn't work. Am I formulating the if statement correctly?
It looks to me as though your statement is contradicting itself
for i in range(1, 10508, 1): # 10508 is the length of the column
if(( i greater than 0 and (i+1) less than 0) or (i less than 0 and (i+1) greater than
0)):
print((i+1), sheet.cell(row=i, column=6).value)
I wrote the > and < symbols in plain English but if i is greater than 0 then i + 1 is never less than 0 and vise versa so they will never work as both cannot be true
You need to get the sheet.cell values first, and then do the comparisons:
end_range = 10508
for i in range(1, end_range):
current, next = sheet.cell(row=i, column=6).value, sheet.cell(row=i+1, column=6).value
if current > 0 and next < 0 or current < 0 and next > 0:
print(i+1, next)
I am pretty sure there's a sign() function in the math library, but kinda overkill. You may also want to figure out what you want to do if the values are 0.
You can use a flag to check for positive and negative.
ws = wb['sheet1'] # why people persist in using long deprecated syntax is beyond me
flag = None
for row in ws.iter_rows(max_row=10508, min_col=6, max_col=6):
cell = row[0]
sign = cell.value > 0 and "positive" or "negative"
if flag is not None and sign != flag:
print(cell.row)
flag = sign
You can write the rules to select the rows where the sign has changed and put them in a generator expression without using extra memory, like this:
pos = lambda x: x>=0
keep = lambda s, c, i, v: pos(s[c][x].value)!=pos(v.value)
gen = (x+1 for x, y in enumerate(sheet['f']) if x>0 and keep(sheet, 'f', x-1, y))
Then, when you need to know the rows where the sign has changed, you just iterate on gen as below:
for row in gen:
# here you use row

selecting different columns each row

I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.

Pandas "countif" based on column value and multiindex

I have a DataFrame that looks like:
Where I have YEAR and RACEETHN as a multiindex. I want to to count the number of "1" values (note, the data are not only 0 and 1 so I cannot do a sum) for each YEAR and RACEETHN combination for each column variable.
I am able to count where value = 1 for each column by doing this:
(df_3.ACSUPPSV == 1).sum()
(df_3.PSEDSUPPSV == 1).sum()
I want to do this with groupby, but am unable to get it to work. I've tried the following code to test if I could do it on a single column 'ACSUPPSV' and it did no work:
df.groupby(['YEAR', 'RACEETHN']).loc[df.ACSUPPSV == 1, 'ACSUPPSV'].count()
I exported the data to excel and was able to calculate this with a quick "COUNTIF" formula, but I know there must be a way to do this in pandas - the results from excel look like:
Would appreciate if someone had a better way to do this than export to Excel! :)
I think you need agg with custom function for count 1 only:
df_3 = pd.DataFrame({'ACSUPPSV':[1,1,1,1,0,1],
'PSEDSUPPSV':[1,1,0,1,0,0],
'BUDGETSV':[1,0,1,1,1,0],
'YEAR':[2000,2000,2001,2000,2000,2000],
'RACEETHN':list('aaabbb')}).set_index(['YEAR','RACEETHN'])
print (df_3)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 1 1 1
a 1 0 1
2001 a 1 1 0
2000 b 1 1 1
b 0 1 0
b 1 0 0
df2 = df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())
print (df2)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 2 1 2
b 2 2 1
2001 a 1 1 0
Old answer:
df_3[((df_3.ACSUPPSV == 1) & (df_3.PSEDSUPPSV == 1))].groupby(['YEAR', 'RACEETHN']).size()
df_3.query('ACSUPPSV == 1 & PSEDSUPPSV == 1').groupby(['YEAR', 'RACEETHN']).size()
More general:
cols = ['ACSUPPSV','PSEDSUPPSV']
df_3[(df_3[cols] == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
For all columns:
df_3[(df_3 == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
EDIT:
Or maybe need:
df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())

Resources