selecting different columns each row - python-3.x
I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.
Related
how to get value of column2 when column 1 is greater 3 and check this value belong to which Bin
I have one dataframe with two columns , A and B . first i need to make empty bins with step 1 from 1 to 11 , (1,2),(2,3)....(10,11). then check from original dataframe if column B value greater than 3 then get value of column 'A' 2 rows before when column B is greater than 3. Here is example dataframe : df=pd.DataFrame({'A':[1,8.5,5.2,7,8,9,0,4,5,6],'B':[1,2,2,2,3.1,3.2,3,2,1,2]}) Required output 1: df_out1=pd.DataFrame({'Value_A':[8.5,5.2]}) Required_output_2: df_output2: Bins count (1 2) 0 (2,3) 0 (3,4) 0 (4,5) 0 (5,6) 1 (6,7) 0 (7,8) 0 (8,9) 1 (9,10) 0 (10,11) 0
You can index on a shifted series to get the two rows before 'A' satisfies some condition like out1 = df['A'].shift(3)[df['B'] > 3] The thing you want to do with the bins is known as a histogram. You can easily do this with numpy like count, bin_edges = np.histogram(out1, bins=[i for i in range(1, 12)]) out2 = pd.DataFrame({'bin_lo': bin_edges[:-1], 'bin_hi': bin_edges[1:], 'count': count}) Here 'bin_lo' and 'bin_hi' are the lower and upper bounds of the bins.
Python and Pandas, find rows that contain value, target column has many sets of ranges
I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers: 2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929) If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8. Example Dataframe x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920', '1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920', '2340:2341,1142:1143,1594:1593,1597:1596,1310,1311', '2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974', '2409:2414,2377:2378,1478:1484', '2474:2476', ] y = [6.48,7.02,7.02,6.55,5.99,6.39,] df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val']) display(df) Desired Output Dataframe
I would write a function that perform 2 steps: Given the ids_string that contains the range of ids, list all the ids as ids_num_list Check if the query_id is in the ids_num_list def check_num_in_ids_string(ids_string, query_id): # Convert ids_string to ids_num_list ids_range_list = ids_string.split(',') ids_num_list = set() for ids_range in ids_range_list: if ':' in ids_range: lower, upper = sorted(ids_range.split(":")) num_list = list(range(int(lower), int(upper)+ 1)) ids_num_list.update(num_list) else: ids_num_list.add(int(ids_range)) # Check if query number is in the list if int(query_id) in ids_num_list: return 1 else: return 0 # Example usage query_id_list = ['2340', '1930'] for query_id in query_id_list: df[f'n{query_id}'] = ( df['ids'] .apply(lambda x : check_num_in_ids_string(x, query_id)) ) which returns you what you require: ids val n2340 n1930 0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1 1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1 2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0 3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0 4 2409:2414,2377:2378,1478:1484 5.99 0 0 5 2474:2476 6.39 0 0
How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time?
Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages. I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row. If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe EmptyDataframe = pd.DataFrame(columns['Name','Age']) csvfile = pd.DataFrame(columns['Value', 'Name', 'Age']) row_for_csv_dataframe = next(csv.iterrows()) for index, row_for_csv_dataframe in csv.iterrows(): if row_for_csv_dataframe['Value'] == '1': # How to code this: # if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2 # then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.
Assuming you have a dataframe data like this: Value Name Age 0 1 Anne 10 1 2 Bert 20 2 3 Caro 30 3 2 Dora 40 4 1 Emil 50 5 1 Flip 60 6 2 Gabi 70 You could do something like this, although this is probably not the most efficient: iterator1 = data.iterrows() iterator2 = data.iterrows() iterator2.__next__() for current, next in zip(iterator1,iterator2): if(current[1].Value==1 and next[1].Value==2): print(current[1].Value, current[1].Name, current[1].Age) And would get this result: 1 Anne 10 1 Flip 60
SUMPRODUCT with a conditional with two ranges to calculate
To calculate a margin (JAN) I need to calculate: sales(loja1)*margin(loja1)+sales(loja2)*margin(loja2)+sales(loja3)*margin(loja3) / (SUM(sales(loja1);sales(loja2);sales(loja3)) but I need to make this using a SUMPRODUCT. I tried: =SUMPRODUCT((B3:B11="sales")*(C3:C11);(B3:B11="margin")*C3:C11))/SUMPRODUCT((B3:B11="sales")*(C3:C11)) but gave error!
When SUMPRODUCT is used to select cells within a range with text, the result for each evaluation will either be TRUE or FALSE. You will need to convert this to 1's or 0's by using '--' before the function so that when you multiply it by another range of cells, you will get the expected value SUMPRODUCT Example: Sum of column B where column A is equal to 'Sales" A B 1 | Sales 5 2 | Sales 6 3 | Margin 3 4 | Margin 2 Resulting Formula =SUMPRODUCT(--(A1:A4 = "Sales"),B1:B4) How SUMPRODUCT works: First, an array is returned that has True for each value in A1:A4 that equals "Sales", and False for each value that doesn't Sales TRUE Sales -> TRUE Margin FALSE Margin FALSE Then the double negative converts TRUE to 1 and False to 0 1 1 0 0 Next, the first array (now the one with 1's and 0's) is multiplied by your second array (B1:B4) to get a new array 1st 2nd New Array 1 * 5 = 5 1 * 6 = 6 0 * 3 = 0 0 * 2 = 0 Finally all the values in the new array are summed to get your result (5+6+0+0 = 11) Step 1: For your scenario, you're going to need find the sales amount for each Location and multiply it by the margin for the corresponding location location 1: sales * margin =SUMPRODUCT(--(A3:A11="loja1"),--(B3:B11="venda"),(C3:C11)) * SUMPRODUCT(--(A3:A11="loja1"),--(B3:B11="margem"),(C3:C11)) You can do a similar formula for location 2 and 3 and then sum them all together. Step: 2 To sum the sales for all locations, you can do a similar formula, again using the double negative, i.e. "--" SUMPRODUCT(--(B3:B11="sales"),(C3:C11)) The resulting formula will be a bit long, but when you divide Step 1 by Step 2, you'll get the desired result
Excel split given number into sum of other numbers
I'm trying to write formulae that will split a given number into the sum of 4 other numbers. The other numbers are 100,150,170 and 200 so the formula would be x = a*100+b*150+c*170+d*200 where x is the given number and a,b,c,d are integers. My spreadsheet is set up as where col B are x values, and C,D,E,F are a,b,c,d respectively (see below). B | C | D | E | F | 100 1 0 0 0 150 0 1 0 0 200 0 0 0 1 250 1 1 0 0 370 0 0 1 1 400 0 0 0 2 I need formulae for columns C,D,E,F (which are a,b,c,d in the formula) Your help is greatly appreciated.
UPDATE: Based on the research below, for input numbers greater than 730 and/or for all actually divisible input numbers use the following formulas: 100s: =CHOOSE(MOD(ROUNDUP([#number]/10;0); 20)+1; 0;1;1;0;1;1;0;1;0;0;1;0;0;1;0;0;1;0;1;1) 150s: =CHOOSE(MOD(ROUNDUP([#number]/10;0); 10)+1; 0;0;1;1;0;1;1;0;0;1) 170s: =CHOOSE(MOD(ROUNDUP([#number]/10;0); 5)+1; 0;3;1;4;2) 200s: =CEILING(([#number]-930)/200;1) + CHOOSE(MOD(ROUNDUP([#number]/10;0); 20)+1; 4;1;2;0;2;3;1;3;1;2;4;2;3;0;2;3;0;3;0;1) MOD(x; 20) will return numbers 0 - 19, CHOOSE(x;a;b;...) will return n-th argument based on the first argument (1=>second argument, ...) see more info about CHOOSE use , instead of ; based on your Windows language®ion settings let's start with the assumption that you want to preferably use 200s over 170s over 150s over 100s - i.e. 300=200+100 instead of 300=2*150 and follow the logical conclusions: the result set can only contain at most 1 100, at most 1 150, at most 4 170s and unlimited number of 200s (i started with 9 170s because 1700=8x200+100, but in reality there were at most 4) there are only 20 possible subsets of (100s, 150s, 170s) - 2*2*5 options 930 is the largest input number without any 200s in the result set based on observation of the data points, the subset repeats periodically for number = 740*k + 10*l, k>1, l>0 - i'm not an expert on reverse-guessing on periodic functions from data, but here is my work in progress (charted data points are from the table at the bottom of this answer) the functions are probably more complicated, if i manage to get them right, i'll update the answer anyway for numbers smaller than 740, more tweaking of the formulas or a lookup table are needed (e.g. there is no way to get 730, so the result should be the same as for 740) Here is my solution based on lookup formulas: Following is the python script i used to generate the data points, formulas from the picture and the 60-row table itself in csv format (sorted as needed by the match function): headers = ("100s", "150s", "170s", "200s") table = {} for c200 in range(30, -1, -1): for c170 in range(9, -1, -1): for c150 in range(1, -1, -1): for c100 in range(1, -1, -1): nr = 200*c200 + 170*c170 + 150*c150 + 100*c100 if nr not in table and nr <= 6000: table[nr] = (c100, c150, c170, c200) print("number\t" + "\t".join(headers)) for r in sorted(table): c100, c150, c170, c200 = table[r] print("{:6}\t{:2}\t{:2}\t{:2}\t{:2}".format(r, c100, c150, c170, c200)) __________ =IF(E$1<740; 0; INT((E$1-740)/200)) =E$1 - E$2*200 =MATCH(E$3; table[number]; -1) =INDEX(table[number]; E$4) =INDEX(table[100s]; E$4) =INDEX(table[150s]; E$4) =INDEX(table[170s]; E$4) =INDEX(table[200s]; E$4) + E$2 __________ number,100s,150s,170s,200s 940,0,0,2,3 930,1,1,4,0 920,0,1,1,3 910,0,0,3,2 900,1,0,0,4 890,0,1,2,2 880,0,0,4,1 870,1,0,1,3 860,0,1,3,1 850,1,1,0,3 840,1,0,2,2 830,0,1,4,0 820,1,1,1,2 810,1,0,3,1 800,0,0,0,4 790,1,1,2,1 780,1,0,4,0 770,0,0,1,3 760,1,1,3,0 750,0,1,0,3 740,0,0,2,2 720,0,1,1,2 710,0,0,3,1 700,1,0,0,3 690,0,1,2,1 680,0,0,4,0 670,1,0,1,2 660,0,1,3,0 650,1,1,0,2 640,1,0,2,1 620,1,1,1,1 610,1,0,3,0 600,0,0,0,3 590,1,1,2,0 570,0,0,1,2 550,0,1,0,2 540,0,0,2,1 520,0,1,1,1 510,0,0,3,0 500,1,0,0,2 490,0,1,2,0 470,1,0,1,1 450,1,1,0,1 440,1,0,2,0 420,1,1,1,0 400,0,0,0,2 370,0,0,1,1 350,0,1,0,1 340,0,0,2,0 320,0,1,1,0 300,1,0,0,1 270,1,0,1,0 250,1,1,0,0 200,0,0,0,1 170,0,0,1,0 150,0,1,0,0 100,1,0,0,0 0,0,0,0,0
Assuming that you want as many of the highest values as possible (so 500 would be 2*200 + 100) try this approach assuming the number to split in B2 down: Insert a header row with the 4 numbers, e.g. 100, 150, 170 and 200 in the range C1:F1 Now in F2 use this formula: =INT(B2/F$1) and in C2 copied across to E2 =INT(($B2-SUMPRODUCT(D$1:$G$1,D2:$G2))/C$1) Now you can copy the formulas in C2:F2 down all columns That should give the results from your table