how to filter out rows in pandas which are just numbers and not fully numeric? - python-3.x

I have a Pandas dataframe column which has data in rows such as below:
col1
abc
ab23
2345
fgh67#
8980
I need to create 2 more columns col 2 and col 3 as such below:
col2 col3
abc 2345
ab23 8980
fgh67#
I have used str.isnumeric(), but thats not helping me in a dataframe column. can someone kindly help?

Use str.isnumeric or to_numeric with check non NaNs for boolean mask and filter by boolean indexing:
m = df['col1'].str.isnumeric()
#alternative
#m = pd.to_numeric(df['col1'], errors='coerce').notnull()
df = pd.concat([df.loc[~m, 'col1'].reset_index(drop=True),
df.loc[m, 'col1'].reset_index(drop=True)], axis=1, keys=('col2','col3'))
print (df)
col2 col3
0 abc 2345
1 ab23 8980
2 fgh67# NaN
If want add new columns to existed DataFrame with align by indices:
df['col2'] = df.loc[~m, 'col1']
df['col3'] = df.loc[m, 'col1']
print (df)
col1 col2 col3
0 abc abc NaN
1 ab23 ab23 NaN
2 2345 NaN 2345
3 fgh67# fgh67# NaN
4 8980 NaN 8980
Or without align:
df['col2'] = df.loc[~m, 'col1'].reset_index(drop=True)
df['col3'] = df.loc[m, 'col1'].reset_index(drop=True)
print (df)
col1 col2 col3
0 abc abc 2345
1 ab23 ab23 8980
2 2345 fgh67# NaN
3 fgh67# NaN NaN
4 8980 NaN NaN

Related

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Summing up two columns of pandas dataframe ignoring NaN

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A"], 'col1':[np.nan, np.nan], 'col2':[np.nan, 5]})
df
ORDER col1 col2
0 A NaN NaN
1 A NaN 5.0
I want to create a column 'new' as sum(col1, col2) ignoring Nan only if one of the column as Nan,
If both of the columns have NaN value, it should return NaN as below
I tried the below code and it works fine. Is there any way to achieve the same with just one line of code.
df['new'] = df[['col1', 'col2']].sum(axis = 1)
df['new'] = np.where(pd.isnull(df['col1']) & pd.isnull(df['col2']), np.nan, df['new'])
df
ORDER col1 col2 new
0 A NaN NaN NaN
1 A NaN 5.0 5.0
Do sum with min_count
df['new'] = df[['col1','col2']].sum(axis=1,min_count=1)
Out[78]:
0 NaN
1 5.0
dtype: float64
Use the add function on the two columns, which takes a fill_value argument that lets you replace NaN:
df['col1'].add(df['col2'], fill_value=0)
0 NaN
1 5.0
dtype: float64
Is this ok?
df['new'] = df[['col1', 'col2']].sum(axis = 1).replace(0,np.nan)

Dropping columns with high missing values

I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set.
My original data set - data_merge2 looks like this :
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
The count data set looks like this that gives me the missing count and ratio:
missing_count missing_ratio
C 4 0.10
D 4 0.66
The code that I used to create the count dataset looks like :
#Only check those columns where there are missing values as we have got a lot of columns
new_df = (data_merge2.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100)
.loc[data_merge2.isna().any()] )
print(new_df)
Now I want to drop the columns from the original dataframe whose missing ratio is >50%
How should I achieve this?
Use:
data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)]
#Alternative
#df[df.columns[df.count().mul(2).gt(len(df))]]
or DataFrame.drop using new_df DataFrame
data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)])
Output
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC
Adding another way with query and XOR:
data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index]
Or pandas way using Index.difference
data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)]
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC

concatenate column values in a loop

I have a csv file with two columns:
col1 col2
----- | -----
link1 unix=number1
link2 unix=number2
link3 unix=number3
What I need:
I need to concatenate each value in col1 with each value in col2 to have the following result:
ecol1 col2 col3
----- | ----- | ----
link1 unix=number1 link1unix=number1
link2 unix=number2 link1unix=number2
link3 unix=number2 link1unix=number3
link4 NAN link2unix=number1
link5 NAN link2unix=number2
link6 NAN link2unix=number3
etc ..
This is my code and it is not working:
i = 0
while True:
df= pd.read_csv('file.csv', skiprows=lambda x: x in range(0,i))
for i, row in df.iterrows():
row = df['col1'] + df['col2']
i+=1
Please help
Use:
import itertools
df['col3']=[''.join(i) for i in list(itertools.product(df['col1'],df['col2']))]
EDIT:
l= [''.join(i) for i in list(itertools.product(df1.col1,df1.col2))]
df=df.reindex(range(len(l)))
df['col3']=l
print(df)
col1 col2 col3
0 a x ax
1 b y ay
2 c z az
3 NaN NaN bx
4 NaN NaN by
5 NaN NaN bz
6 NaN NaN cx
7 NaN NaN cy
8 NaN NaN cz

Pandas Use Value if Not Null, Else Use Value From Next Column

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

Resources