creating a new column from an existent categorical column in python

creating a new column from an existent categorical column in python - python-3.x

I have a data frame like this:
ID col
1 a
2 b
3 c
4 d
I want to create a new column so that if it is a or c, new column will give Y, otherwise N.
So, it will look like the following:
ID col col1
1 a Y
2 b N
3 c Y
4 d N
I am working in python3.

Try this code, simple effective
df = pd.DataFrame({'ID':[1,2,3,4],
'Col':['a', 'b', 'c', 'd']})
df['Col_2'] = df.apply(lambda row: 'Y' if (row.Col=='a' or row.Col=='c') else 'N' , axis = 1)

Related

replace values with randomly selected values in a pandas dataframe

Python 3.6, Pandas 1.1.5 on windows 10
Trying to optimize the below for better performance on large dataset.
Purpose: randomly select a single value if the data contains several values separated by a space.
For example, from:
col1 col2 col3
0 a a b c a c
1 a b c a
2 a b c b b
to:
col1 col2 col3
0 a b c
1 a c a
2 b b b
So far:
df = pd.DataFrame({'col1': ['a', 'a b', 'a b c'],
'col2':['a b c', 'c', 'b'],
'col3':['a c', 'a', 'b'], })
# make data into a flat np.array
vals = list(itertools.chain.from_iterable(df.values))
vals_ = []
# randomly select a single value from each data point
for v in vals:
v = v.split(' ')
a = np.random.choice(len(v), 1)[0]
v = v[a]
vals_.append(v)
gf = pd.DataFrame(np.array(vals_).reshape(df.shape),
index = df.index,
columns =df.columns)
This is not fast on a large dataset. Any lead will be appreciated.

Defining a function and applying it to the entire Pandas dataframe via
The function could be implemented via
def rndVal(x:str):
if len(x) > 1:
x = x.split(' ')
a = np.random.choice(len(x), 1)[0]
return x[a]
else:
return x
and is applicable with
df.applymap(rndVal)
returning
Regarding Performance. Running your attempt and applymap on a dataframe with 300,000 rows requires the former 18.6 s while this solution only takes 8.4 s.

Pandas fast approach
Stack to reshape then split and explode the strings then groupby on multiindex and draw a sample of size 1 per group, then unstack back to reshape
(
df.stack().str.split().explode()
.groupby(level=[0, 1]).sample(1).unstack()
)
col1 col2 col3
0 a a c
1 b c a
2 a b b

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00

Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00

Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B

Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

I need a function to check string values in pandas df columns and cannot find an answer

I made a pandas df from parts of 2 others:
Here is the pseudocode for what I want to do.
4-column pandas dataframe, values in all columns are single words cols A B C D
and I want this: cols A B C D E F
in pseudcode (for every s in A; if s == any string (not substring) in D; write Yes to E (new column) else write No to E; if str in B (same row as s) == str in C (same row as string found in D) write yes to F (new column): else write No to F)
The following code works but now I need a function to do what is described above:
I'm not allowd to paste images of sample data and expected outcome.
cols = [1,2,3,5]
df3.drop(df3.columns[cols],axis=1, inplace=True)
df4.drop(df4.columns[1],axis=1, inplace=True)
listi = [df4]
listi.append(df3)
df5 = pd.concat(listi, axis = 1)

Hope this helps
I created a sample data frame
>>> df
A B C D
0 alpha spiderman theta superman
1 beta batman alpha spiderman
2 gamma superman epsilon hulk
Now add column E that shows if item in A is in C and add column F that shows if item in B is in D
>>> df['E'] = df.A.isin(df.C).replace({True: "Yes", False: "No"})
>>> df['F'] = df.B.isin(df.D).replace({True: "Yes", False: "No"})
>>> df
A B C D E F
0 alpha spiderman theta superman Yes Yes
1 beta batman alpha spiderman No No
2 gamma superman epsilon hulk No Yes

Search for value in all DataFrame columns (except first column !) and add new column with matching column name

I'd like to do a search on all columns (except the first column !) of a DataFrame and add a new column (like 'Column_Match') with the name of the matching column.
I tried something like this:
df.apply(lambda row: row.astype(str).str.contains('my_keyword').any(), axis=1)
But it's not excluding the first column and I don't know how to return and add the column name.
Any help much appreciated !

If want columns name of first matched value per rows add new column for match not exist values by DataFrame.assign and DataFrame.idxmax for column name:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'A':list('abcdef'),
'C':list('akabbe'),
'F':list('eakbbb')
})
f = lambda row: row.astype(str).str.contains('e')
df['new'] = df.iloc[:,1:].apply(f, axis=1).assign(missing=True).idxmax(axis=1)
print (df)
B A C F new
0 4 a a e F
1 5 b k a missing
2 4 c a k missing
3 5 d b b missing
4 5 e b b A
5 4 f e b C
If need all columns names of all matched values create boolean DataFrame and use dot product with columns names by DataFrame.dot and Series.str.rstrip:
f = lambda row: row.astype(str).str.contains('a')
df1 = df.iloc[:,1:].apply(f, axis=1)
df['new'] = df1.dot(df.columns[1:] + ', ').str.rstrip(', ').replace('', 'missing')
print (df)
B A C F new
0 4 a a e A, C
1 5 b k a F
2 4 c a k C
3 5 d b b missing
4 5 e b b missing
5 4 f e b missing

add numeric prefix to pandas dataframe column names

how would I add variable numeric prefix to dataframe column names
If I have a DataFrame df
colA colB
0 A X
1 B Y
2 C Z
How would I rename the columns according to the number of columns. Something like this:
1_colA 2_colB
0 A X
1 B Y
2 C Z
The actually number of columns is very large to be renamed manually
Thanks for the help

Use enumerate for count with f-strings and list comprehension:
#python 3.6+
df.columns = [f'{i}_{x}' for i, x in enumerate(df.columns, 1)]
#python below 3.6
#df.columns = ['{}_{}'.format(i, x) for i, x in enumerate(df.columns, 1)]
print (df)
1_colA 2_colB
0 A X
1 B Y
2 C Z

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

creating a new column from an existent categorical column in python - python-3.x

I have a data frame like this: ID col 1 a 2 b 3 c 4 d I want to create a new column so that if it is a or c, new column will give Y, otherwise N. So, it will look like the following: ID col col1 1 a Y 2 b N 3 c Y 4 d N I am working in python3.

Try this code, simple effective df = pd.DataFrame({'ID':[1,2,3,4], 'Col':['a', 'b', 'c', 'd']}) df['Col_2'] = df.apply(lambda row: 'Y' if (row.Col=='a' or row.Col=='c') else 'N' , axis = 1)

Related

replace values with randomly selected values in a pandas dataframe

Pandas dataframe deduplicate rows with column logic

I need a function to check string values in pandas df columns and cannot find an answer

Search for value in all DataFrame columns (except first column !) and add new column with matching column name

add numeric prefix to pandas dataframe column names

Categories

Resources