creating a new column from an existent categorical column in python - python-3.x

I have a data frame like this:
ID col
1 a
2 b
3 c
4 d
I want to create a new column so that if it is a or c, new column will give Y, otherwise N.
So, it will look like the following:
ID col col1
1 a Y
2 b N
3 c Y
4 d N
I am working in python3.

Try this code, simple effective
df = pd.DataFrame({'ID':[1,2,3,4],
'Col':['a', 'b', 'c', 'd']})
df['Col_2'] = df.apply(lambda row: 'Y' if (row.Col=='a' or row.Col=='c') else 'N' , axis = 1)

Related

replace values with randomly selected values in a pandas dataframe

Python 3.6, Pandas 1.1.5 on windows 10
Trying to optimize the below for better performance on large dataset.
Purpose: randomly select a single value if the data contains several values separated by a space.
For example, from:
col1 col2 col3
0 a a b c a c
1 a b c a
2 a b c b b
to:
col1 col2 col3
0 a b c
1 a c a
2 b b b
So far:
df = pd.DataFrame({'col1': ['a', 'a b', 'a b c'],
'col2':['a b c', 'c', 'b'],
'col3':['a c', 'a', 'b'], })
# make data into a flat np.array
vals = list(itertools.chain.from_iterable(df.values))
vals_ = []
# randomly select a single value from each data point
for v in vals:
v = v.split(' ')
a = np.random.choice(len(v), 1)[0]
v = v[a]
vals_.append(v)
gf = pd.DataFrame(np.array(vals_).reshape(df.shape),
index = df.index,
columns =df.columns)
This is not fast on a large dataset. Any lead will be appreciated.
Defining a function and applying it to the entire Pandas dataframe via
The function could be implemented via
def rndVal(x:str):
if len(x) > 1:
x = x.split(' ')
a = np.random.choice(len(x), 1)[0]
return x[a]
else:
return x
and is applicable with
df.applymap(rndVal)
returning
Regarding Performance. Running your attempt and applymap on a dataframe with 300,000 rows requires the former 18.6 s while this solution only takes 8.4 s.
Pandas fast approach
Stack to reshape then split and explode the strings then groupby on multiindex and draw a sample of size 1 per group, then unstack back to reshape
(
df.stack().str.split().explode()
.groupby(level=[0, 1]).sample(1).unstack()
)
col1 col2 col3
0 a a c
1 b c a
2 a b b

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

I need a function to check string values in pandas df columns and cannot find an answer

I made a pandas df from parts of 2 others:
Here is the pseudocode for what I want to do.
4-column pandas dataframe, values in all columns are single words cols A B C D
and I want this: cols A B C D E F
in pseudcode (for every s in A; if s == any string (not substring) in D; write Yes to E (new column) else write No to E; if str in B (same row as s) == str in C (same row as string found in D) write yes to F (new column): else write No to F)
The following code works but now I need a function to do what is described above:
I'm not allowd to paste images of sample data and expected outcome.
cols = [1,2,3,5]
df3.drop(df3.columns[cols],axis=1, inplace=True)
df4.drop(df4.columns[1],axis=1, inplace=True)
listi = [df4]
listi.append(df3)
df5 = pd.concat(listi, axis = 1)
Hope this helps
I created a sample data frame
>>> df
A B C D
0 alpha spiderman theta superman
1 beta batman alpha spiderman
2 gamma superman epsilon hulk
Now add column E that shows if item in A is in C and add column F that shows if item in B is in D
>>> df['E'] = df.A.isin(df.C).replace({True: "Yes", False: "No"})
>>> df['F'] = df.B.isin(df.D).replace({True: "Yes", False: "No"})
>>> df
A B C D E F
0 alpha spiderman theta superman Yes Yes
1 beta batman alpha spiderman No No
2 gamma superman epsilon hulk No Yes

Search for value in all DataFrame columns (except first column !) and add new column with matching column name

I'd like to do a search on all columns (except the first column !) of a DataFrame and add a new column (like 'Column_Match') with the name of the matching column.
I tried something like this:
df.apply(lambda row: row.astype(str).str.contains('my_keyword').any(), axis=1)
But it's not excluding the first column and I don't know how to return and add the column name.
Any help much appreciated !
If want columns name of first matched value per rows add new column for match not exist values by DataFrame.assign and DataFrame.idxmax for column name:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'A':list('abcdef'),
'C':list('akabbe'),
'F':list('eakbbb')
})
f = lambda row: row.astype(str).str.contains('e')
df['new'] = df.iloc[:,1:].apply(f, axis=1).assign(missing=True).idxmax(axis=1)
print (df)
B A C F new
0 4 a a e F
1 5 b k a missing
2 4 c a k missing
3 5 d b b missing
4 5 e b b A
5 4 f e b C
If need all columns names of all matched values create boolean DataFrame and use dot product with columns names by DataFrame.dot and Series.str.rstrip:
f = lambda row: row.astype(str).str.contains('a')
df1 = df.iloc[:,1:].apply(f, axis=1)
df['new'] = df1.dot(df.columns[1:] + ', ').str.rstrip(', ').replace('', 'missing')
print (df)
B A C F new
0 4 a a e A, C
1 5 b k a F
2 4 c a k C
3 5 d b b missing
4 5 e b b missing
5 4 f e b missing

add numeric prefix to pandas dataframe column names

how would I add variable numeric prefix to dataframe column names
If I have a DataFrame df
colA colB
0 A X
1 B Y
2 C Z
How would I rename the columns according to the number of columns. Something like this:
1_colA 2_colB
0 A X
1 B Y
2 C Z
The actually number of columns is very large to be renamed manually
Thanks for the help
Use enumerate for count with f-strings and list comprehension:
#python 3.6+
df.columns = [f'{i}_{x}' for i, x in enumerate(df.columns, 1)]
#python below 3.6
#df.columns = ['{}_{}'.format(i, x) for i, x in enumerate(df.columns, 1)]
print (df)
1_colA 2_colB
0 A X
1 B Y
2 C Z

Resources