Item in String - Create new column - string

So I have found some other answers to this question that involved conditionally selecting with operators but I haven't found a solution that involves contain statements.
What I am trying to accomplish is inside a dataframe
A cat ?
B dog ?
C rat ?
How do I set the third column to a value depending on whether the second column contains 'a'?

Use contains:
df['Col2'] = df['Col1'].str.contains('a')
Output:
Col0 Col1 Col2
0 A cat True
1 B dog False
2 C rat True

Related

subtracting values between columns both ways

I need to find those values that are in column 1 and not in column 2 and vica versa. It can look like this: take fist row in the first column and look if there is same number in the second column if so then on the third column write 0 (substraction) and if there won't be the same number then write searched number or error, doesn't matter. This should work both ways (some numbers can be in col2 but not in col1, those i need to find aswell). So probably there would be 2 formulas in 2 columns. one searching from col1 to col2, and same for col2 to col1. And if there for example in col1 would be twice some value and in col2 just once, than it should show for the first number 0 and for second number error or searched number.
Dataset looks like this:
Col1
Col2.
42646
55
42646
77
33
25
77
Col3
Col4
0
55
0
0
33(or error,NA etc)
25
0
I have tried vlook up, but wasn't sucesfull.
I guess this is what you are looking for. You can use for Col3:
=IF(A2:A6="", "",IF(ISNA(XMATCH(A2:A6,B2:B6)),A2:A6,0))
and for Col4:
=IF(B2:B6="", "", IF(ISNA(XMATCH(B2:B6,A2:A6)),B2:B6,0))
Both formulas returns 0 if the value was found (including blanks), otherwise the missing value.
You can put all together using HSTACK:
= HSTACK(IF(A2:A6="", "",IF(ISNA(XMATCH(A2:A6,B2:B6)),A2:A6,0)),
IF(B2:B6="", "", IF(ISNA(XMATCH(B2:B6,A2:A6)),B2:B6,0)))
Or using LET to avoid repetitions.
= LET(A, A2:A6, B, B2:B6, HSTACK(IF(A="","",IF(ISNA(XMATCH(A,B)),A,0)),
IF(B="", "", IF(ISNA(XMATCH(B,A)),B,0))))
Here is the output:
You can use XLOOKUP too, but the formula is longer, because the first three input arguments are required:
=IF(ISNA(XLOOKUP(A2:A6,B2:B6, A2:A6)),A2:A6,0)
A shame you haven't added a sample in your data that would show what you meant with:
"And if there for example in col1 would be twice some value and in col2 just once, than it should show for the first number 0 and for second number error or searched number."
Your requirements make this a little tricky, but try:
Formula in C1:
=IF(A1="","",IF(COUNTIF(B:B,A1)-COUNTIF(A$1:A1,A1)<0,A1,0))
Formula in D1:
=IF(B1="","",IF(COUNTIF(A:A,B1)-COUNTIF(B$1:B1,B1)<0,B1,0))

Increase the values in a column values based on values in other column in pandas

I have my source data in the form of csv file as below:
id,col1,col2
123,11|22|33||||||,val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7
I need to add a new column(fnlsrc) which will have the values based on values in Col2 and Col1, i.e if col1 has 9 values(separated with pipe) and col2 has 3 values(separated with pipe), then in fnlsrc column I have to load 9 values(separated with pipe) 3 set of col2(val1|val3|val2|val1|val3|val2|val1|val3|val2). Please refer the output below, which will help in understanding the requirement easily:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7
I have tried following code, but its adding only the one set:
zipped = zip(df['col1'], df['col2'])
for s,t in zipped:
count = int((s.count('|') + 1)/(t.count('|') + 1))
for val in range(count):
df['fnlsrc'] = t
As the new column is based on the other two, I would use panda's apply() function. I defined a function that calculates the new column value based on the other two columns, which is then applied to each row:
def new_value(x):
# Find out number of values in both columns
col1_numbers = x['col1'].count('|') + 1
col2_numbers = x['col2'].count('|') + 1
# Calculate how many times col2 should appear in the new column
repetition = int(col1_numbers/col2_numbers)
# Create list of strings containing the values of the new column
values = [x['col2']]*repetition
# Join the list of strings with pipes
return '|'.join(values)
# Apply the function on every row
df['fnlsrc'] = df.apply(lambda x:new_value(x), axis=1)
df
Output:
id col1 col2 fnlsrc
0 123 11|22|33|||||| val1|val3|val2 val1|val3|val2|val1|val3|val2|val1|val3|val2
1 456 99||77|||88|||||||||6| val4|val5|val6|val7 val4|val5|val6|val7|val4|val5|val6|val7|val4|v...
Full output in your input format:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7

removing rows that have same information regardless the order in python Data Frame [duplicate]

This question already has answers here:
Remove reverse duplicates from dataframe
(6 answers)
Closed 1 year ago.
I have a data frame that contains two columns I want to remove all duplicated regardless of the order
col1 col2
A B
B A
C D
E F
F E
The output should be
col1 col2
A B
C D
E F
I have tried using the duplicate function but it did not remove anything because they are not in the same order
One way:
Take the inner numpy array and sort it.
Use the dataframe constructor to recreate the dataframe(sorted by row).
Drop the duplicates.
df = pd.DataFrame(np.sort(df.values), columns = df.columns).drop_duplicates()
OUTPUT:
col1 col2
0 A B
2 C D
3 E F

Sumifs <> not operating as an AND function

I have two criteria columns with data I want to exclude, but the result of my sumifs is wrong when I enter the two criteria. When I concatenate the two columns and use a sumifs with one criteria (could also use a sumif), then the result is correct.
I would like to sum col1 where col2 is not "a" and where col3 is not "b". The formula I have used is =SUMIFS(A9:A12,B9:B12,"<>a",C9:C12,"<>b") which returns 0.
=SUMIFS(A9:A12,D9:D12,"<>ab") returns 7, which is correct.
I understood that SUMIFS runs on an AND operator so all conditions must be true, but in the first case with two criteria it excludes all of the numbers because everything in col3 is a "b".
col1 col2 col3 col4
1 a b ab
2 b b bb
3 a b ab
5 d b db
Why am I getting different results? When I do the same formula but as inclusive such as =SUMIFS(A9:A12,B9:B12,"a",C9:C12,"b") and =SUMIFS(A9:A12,D9:D12,"ab"), both formulas return 4 which is correct. But using <> provides mismatched answers.
All formulas in your question give correct results.
col1 col2 col3 col4
1 a b ab // a<>a false, b<>b false -> no summing
2 b b bb // b<>a true , b<>b false -> no summing
3 a b ab // a<>a false, b<>b false -> no summing
5 d b db // d<>a true , b<>b false -> no summing
Try to change the second line to:
2 b e be // b<>a true , b<>e true
You will see that the result will change.

Excel formulas matching numbers [duplicate]

I have been looking through all different sources and cannot find the exact answer to this. Was hoping someone can help me out.
I have two columns:
COL1 COL2
abc defghe
def iabclmn
ghi zhued
fgh lmnop
I want to know if a value in COL1 exist in COL2. So in this case I want it to look like this:
COL1 COL2 COL3
abc defghe TRUE
def iabclmn TRUE
ghi zhued FALSE
fgh lmnop TRUE
Is there a function that can do this, I have over 500 rows so I cannot just call out specific values?
I know there is an example that does specific values like this, but I want it to be by the entire column:
=ISNUMBER(SEARCH(substring,text))
Thanks!
To do it for full columns as real non-array formula:
=COUNTIF(B:B,"*"&A1&"*")>0
This will do it:
=SUMPRODUCT(ISNUMBER(SEARCH(A1,$B$1:$B$4))*1)>0
The SUMPRODUCT() forces it to iterate through Column B and keep track of the ones that return true. So if any are found it adds 1 to the pool.
The >0 test whether any returned TRUE.

Resources