Compare row with all other previous string in one column and change value of another column in Python - python-3.x

I have a csv file named namelist.csv, it includes:
Index String Size Name
1 AAA123000DDD 10 One
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 25 One
4 AAA123D 20 One
5 ABA 15 One
6 FFFrrrSSSBBB 60 Two
7 FFFrrrSSSBBB 30 Two
8 FFFrrrSS 50 Two
9 AAA12 70 Two
I want to compare row in column String of each name group: if the string in each row is match or is substring of all above rows then remove the previous rows and sum the value of Size column to the value of subtring row.
Example: i take row 3rd: AAA123000DDD, i compare it to 2 row 1st and 2nd, it see that it is a match with 1st row, it will remove the 1st row then sum value of the 1st row column Size to the 3rd row column Size .
then the table will be like:
Index String Size Name
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 35 One
4 AAA123D 20 One
...
the final result will be:
Index String Size Name
3 AAA123000DDD 35 One
4 AAA123D 40 One
5 ABA 15 One
8 FFFrrrSS 140 Two
9 AAA12 70 Two
i think of using groupby of pandas to group all Name column, but i don't know how to apply the comparison of String column and sum of Size column.
I am new to Python so any help I will very appreciate.

Assuming Name is distinct with String, here's how you would do the aggregation. I kept Name so that it also shows in the final DataFrame.
df_group = df.groupby(['String', 'Name'])['Size'].sum().reset_index()
Edit:
To match the substrings (and using the example above that it appears that a substring will not match with multiple strings), you can make a mapping of substrings to full strings and then group by the full string column as before:
all_strings = set(df['Strings'])
substring_dict = dict()
for row in df.itertuples():
for item in all_strings:
if row.String in item:
substring_dict[row.String] = item
def match_substring(x):
return substring_dict[x]
df['full_strings'] = df.String.apply(match_substring)
df_group = df.groupby(['full_strings', 'Name'])['Size'].sum().reset_index()

Related

Extract subsequences from main dataframe based on the locations in another dataframe

I want to extract the subsequences indicated by the first and last locations in data frame 'B'.
The algorithm that I came up with is:
Identify the rows of B that fall in the locations of A
Find the relative position of the locations (i.e. shift the locations to make them start from 0)
Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last']+1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.

Optical differences between characters within a string of equal length

I'm having a data set with different length of string and they get concatenated into a separate column to be made equal via LEN(), TRIM() and REPT().
The formulas I used can be seen in the last row for each column (B:E).
Althought the length of the final string is equal, one can see that the strings within the "Name with equal length" column are not optically identical/ of "same" length.
As I want to use this column for making new file names via VBA, I wanted to explicitly have file names with "optically smooth names". (I hope you get what I mean.)
How can I achieve this? Do I have to calculate the pixel differences within (case-sensitive) letters? If so, how can I do this?
Text
Place
Length of String
Needed Spaces
Name with equal length
Length of Name
SaMPLE_TEXT
P 1
12
2
SaMPLE_TEXT--P 1_.pdf
22
SaMPLE_TexT
P 2
13
1
SaMPLE_TexT-P 2_.pdf
22
SaMPLE_text
P 3
13
1
SaMPLE_text-P 3_.pdf
22
sample_TEXT
P 4
12
2
sample_TEXT--P 4_.pdf
22
SaMPLE_TEXT
P 5
12
2
SaMPLE_TEXT--P 5_.pdf
22
=LEN(TRIM(B1))
=MAX($D$1:$D$6)-LEN(TRIM(B2))+1
=TRIM(A2)&REPT("-";D2)&TRIM(B2)&"_.pdf"
=LEN(E2)

How do you search through a row (Row 1) of a CSV file, but search through the next row (Row 2) at the same time?

Imagine there are THREE columns and a certain number of rows in a dataframe. First column are random values, second column are Names, third column are Ages.
I want to search through every row (First Row) of this dataframe and find when value 1 appears in the first column. Then simultaneously, I want to know that if value 1 does indeed exist in the column, does value 2 appear in the SAME column but in the next row.
If this is the case. Copy First Rows, Value, Name And Age into an empty dataframe. Every time this condition is met, copy these rows into an empty dataframe
EmptyDataframe = pd.DataFrame(columns['Name','Age'])
csvfile = pd.DataFrame(columns['Value', 'Name', 'Age'])
row_for_csv_dataframe = next(csv.iterrows())
for index, row_for_csv_dataframe in csv.iterrows():
if row_for_csv_dataframe['Value'] == '1':
# How to code this:
# if the NEXT row after row_for_csv_dataframe finds the 'Value' == 2
# then copy 'Age' and 'Name' from row_for_csv_dataframe into the empty DataFrame.
Assuming you have a dataframe data like this:
Value Name Age
0 1 Anne 10
1 2 Bert 20
2 3 Caro 30
3 2 Dora 40
4 1 Emil 50
5 1 Flip 60
6 2 Gabi 70
You could do something like this, although this is probably not the most efficient:
iterator1 = data.iterrows()
iterator2 = data.iterrows()
iterator2.__next__()
for current, next in zip(iterator1,iterator2):
if(current[1].Value==1 and next[1].Value==2):
print(current[1].Value, current[1].Name, current[1].Age)
And would get this result:
1 Anne 10
1 Flip 60

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

Excel: Sum columns and rows if criteria is met

I have a sheet with product names in column I and then dates from there on. For each date there are numbers of how many pieces of a certain product have to be made. I'm trying to sum all those numbers based on a product type, i.e.:
I K L M ...
30.8. 31.8. 1.9. ...
MAD23 2 0 45 ...
MMR32 5 7 33 ...
MAD17 17 56 0 ...
MAD: 120 (2+0+45+17+56+0)
MMR: 45 (5+7+33)
What I'm doing now is sum the row first:
=SUM(K6:GN6)
MAD23 = 47
MMR32 = 45
MAD32 = 73
And then sum those numbers in column J based on part of the product name in column I:
=SUMIF(Sheet1!I6:I775;"MAD*";Sheet1!J6:J775)
MAD = 120
MMR = 45
Is it possible to do this with just one formula per criteria?
Just trying it on those three rows, I get
=SUM($K$6:$M$8*(LEFT($I$6:$I$8,LEN(I10)-1)=LEFT(I10,LEN(I10)-1)))
which is an array formula and must be entered with CtrlShiftEnter
That's assuming that I10 is going to contain some characters followed by a colon and you want to match those with the first characters of I6:I8.
=SUM(IF(MID(Sheet1!I6:I775,1,3)="MAD",Sheet1!k6:gn775,""))
With ctrl +shift+enter

Resources