Comparing strings in same series (row) but different columns - python-3.x

I ran into this problem with comparing strings between two columns. What I want to do is to: For each row, check whether the string is column A is included in column B and if so, print a new string 'Yes' in column C.
Column A contains NaN values (blank cells in the csv I imported).
I have tried:
df['C']=df['B'].str.contains(df.loc['A'])
df.loc[df['A'].isin(df['B']), 'C']='Yes'
They both didn't work as I couldn't find the right way to compare strings.

This uses list comprehension, so it may not be the fastest solution, but works and is concise.
df['C'] = pd.Series(['Yes' if a in b else 'No' for a,b in zip(df['A'],df['B'])])
EDIT: If you don't want to keep the values in C instead of overwriting them with 'No', you can do it like this:
df['C'] = pd.Series(['Yes' if a in b else c for a,b,c in zip(df['A'],df['B'], df['C'])])

df = pd.DataFrame([['ab', 'abc'],
['abc', 'ab']], columns=list('AB'))
df['C'] = np.where(df.apply(lambda x: x.A in x.B, axis=1), 'Yes', 'No')
df

Try regex: https://docs.python.org/2/library/re.html if you already made the code to id every cell or value have to work with.

Related

Is there any python-Dataframe function in which I can iterate over rows of certain columns?

Want to solve this kind of problem in python:
tran_df['bad_debt']=train_df.frame_apply(lambda x: 1 if (x['second_mortgage']!=0 and x['home_equity']!=0) else x['debt'])
I want be able to create a new column and iterate over index row for specific columns.
in excel it's really easy I did:
if(AND(col_name1<>0,col_name2<>0),1,col_name5)
Any help will be very appreciated.
To iterate over rows only for certain columns:
for rowIndex, row in df[['col1','col2']].iterrows(): #iterate over rows
To create a new column:
df['new'] = 0 # Initialise as 0
As a rule, iterating over rows in pandas is wrong. Use the np.where function from NumPy to select the right values for the rows:
tran_df['bad_debt'] = np.where(
(tran_df['second_mortgage'] != 0) & (tran_df['home_equity'] != 0),
1, tran_df['debt'])
First to create a new column with initial value, then to use .loc to locate rows that match certain condition and assign new value:
tran_df['bad_debt']=tran_df['debt']
tran_df.loc[(tran_df['second_mortgage']!=0)&(tran_df['home_equity']!=0),'bad_debt']=1
Or
tran_df['bad_debt']=1
tran_df.loc[(tran_df['second_mortgage']==0)|(tran_df['home_equity']==0),'bad_debt']=tran_df['debt']
Remember to put round brackets for each condition between bitwise operators (& |)

Split all column names by specific characters and take the last part as new column names in Pandas

I have a dataframe which has column names like this:
id, xxx>xxx>x, yy>y, zzzz>zzz>zz>z, ...
I need to split by the second > from the right side and take the first element as new column names, id, xxx>x, yy>y, zz>z, ....
I have used: 'zzzz>zzz>zz>z'.rsplit('>', 1)[-1] to get z as the expected new column name for the third column.
When I use: df.columns = df.columns.rsplit('>', 1)[-1]:
Out:
ValueError: Length mismatch: Expected axis has 13 elements, new values have 2 elements
How could I do that correctly?
try doing:
names = pd.Index(['xxx>xxx>x', 'yy>y', 'zzzz>zzz>zz>z'])
names = pd.Index([idx[-1] for idx in names.str.rsplit('>')])
print(names)
# Index(['x', 'y', 'z'], dtype='object')

split multiple values into two columns based on single seprator

I am new to pandas.I have a situation I want to split length column into two columns a and b.Values in length column are in pair.I want to compare first pair smaller value should be in a nad larger in b.then compare next pair on same row and smaller in a,larger in b.
I have hundred rows.I think I can not use str.split because there are multiple values and same delimiter.I have no idea how to do it
The output should be same like this.
Any help will be appreciated
length a b
{22.562,"35.012","25.456",37.342,24.541,38.241} 22.562,25.45624.541 35.012,37.342,38.241
{21.562,"37.012",25.256,36.342} 31.562,25.256 37.012,36.342
{22.256,36.456,26.245,35.342,25.56,"36.25"} 22.256,26.245,25.56 36.456,35.342,36.25
I have tried
df['a'] = df['length'].str.split(',').str[0::2]
df['b'] = df['length'].str.split(',').str[1::3]
through this ode column b output is perfect but col a is printing first full pair then second.. It is not giving only 0,2,4th values
The problem comes from the fact that your length column is made of set not lists.
Here is a way to do what you want by casting your length column as list:
df['length'] = [list(x) for x in df.length] # We cast the sets as lists
df['a'] = [x[0::2] for x in df.length]
df['b'] = [x[1::2] for x in df.length]
Output:
length a \
0 [35.012, 37.342, 38.241, 22.562, 24.541, 25.456] [35.012, 38.241, 24.541]
1 [25.256, 36.342, 21.562, 37.012] [25.256, 21.562]
2 [35.342, 36.456, 36.25, 22.256, 25.56, 26.245] [35.342, 36.25, 25.56]
b
0 [37.342, 22.562, 25.456]
1 [36.342, 37.012]
2 [36.456, 22.256, 26.245]

Names of the columns we're searching for missing values

Searching for missing values ?
columns = ['median', 'p25th', 'p75th']
# Look at the dtypes of the columns
print(____)
# Find how missing values are represented (Search for missing values in the median, p25th, and p75th columns.)
print(recent_grads["median"].____)
# Replace missing values with NaN,using numpy's np.nan.
for column in ___:
recent_grads.loc[____ == '____', column] = ____?
Names of the columns we're searching for missing values
columns = ['median', 'p25th', 'p75th']
Take a look at the dtypes
print(recent_grads[columns].dtypes)
Find how missing values are represented
print(recent_grads["median"].unique())
Replace missing values with NaN
for column in columns:
recent_grads.loc[recent_grads[column] == 'UN', column] = np.nan
right answer is--
for column in columns:
recent_grads.loc[recent_grads[column] == 'UN', column] = np.nan
# Print .dtypes
print(recent_grads.dtypes)
# Output summary statistics
print(recent_grads.describe())
# Exclude data of type object
print(recent_grads.describe(exclude=["object"]))

pandas iterate rows and then break until condition

I have a column that's unorganized like this;
Name
Jack
James
Riddick
Random value
Another random value
What I'm trying to do is get only the names from this column, but struggling to find a way to differentiate real names to random values. Fortunately the names are all together, and the random values are all together as well. The only thing I can do is iterate the rows until it gets to 'Random value' and then break off.
I've tried using lambda's for this but with no success as I don't think there's a way to break. And I'm not sure how comprehension could work in this case.
Here's the example I've been trying to play with;
df['Name'] = df['Name'].map(lambda x: True if x != 'Random value' else break)
But the above doesn't work. Any suggestions on what could work based on what I'm trying to achieve? Thanks.
Find index of row containing 'Random value':
index_split = df[df.Name == 'Random value'].index.values[0]
Save your random values column for use later if you want:
random_values = df.iloc[index_split+1:,].values[0]
Remove random values from the Names column:
df = df[0:index_split]

Resources