How to update NaN values in a dataframe with a dictionary - python-3.x

I'm trying to replace NaN values in one column when the adjacent column starts with a string.
starts_with ={'65S':'test1','95S':test2,'20S':"test3"}
Here is the DF
15 65String NaN
16 95String NaN
17 20String NaN
I would like it to look like
15 65String test1
16 95String test2
17 20String test3

Using findall find the key in dict , the just map back the dict's value
df.c3=df.c2.str.findall('|'.join(starts_with.keys())).str[0].map(starts_with)
df
Out[110]:
c1 c2 c3
0 15 65String test1
1 16 95String test2
2 17 20String test3

Use Series.str.extract with ^ for match start of string with Series.map by dictionary:
starts_with ={'65S':'test1','95S':'test2','20S':"test3"}
pat = '|'.join(r"^{}".format(x) for x in starts_with)
df['C'] = df['B'].str.extract('(' + pat + ')', expand=False).map(starts_with)
print (df)
A B C
0 15 65String test1
1 16 95String test2
2 17 20String test3

Related

Apply customized function for extracting numbers from string to multiple columns in Python

Given a dataset as follows:
id name value1 value2 value3
0 1 gz 1 6 st
1 2 sz 7-tower aka 278 3
2 3 wh NaN 67 s6.1
3 4 sh x3 45 34
I'd like to write a customized function to extract numbers from values columns.
Here is the pseudocode I have written:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=True)
df[['value1', 'value2', 'value3']] = df[['value1', 'value2', 'value3']].apply(extract_nums)
It raise an error:
ValueError: If using all scalar values, you must pass an index
Code for manipulating columns one by one works, but not concise:
df['value1'] = df['value1'].str.extract('(\d*\.?\d+)', expand=True)
df['value2'] = df['value2'].str.extract('(\d*\.?\d+)', expand=True)
df['value3'] = df['value3'].str.extract('(\d*\.?\d+)', expand=True)
How could write the code correctly? Thanks.
You can filter the value like columns then stack the columns and use str.extract to extract the numbers followed by unstack to reshape:
c = df.filter(like='value').columns
df[c] = df[c].stack().str.extract('(\d*\.?\d+)', expand=False).unstack()
Alternatively you can try str.extract with apply:
c = df.filter(like='value').columns
df[c] = df[c].apply(lambda s: s.str.extract('(\d*\.?\d+)', expand=False))
Result:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34
The following code also works out:
cols = ['value1', 'value2', 'value3']
for col in cols:
df[col] = df[col].str.extract('(\d*\.?\d+)', expand=True)
Alternative solution with function:
def extract_nums(row):
return row.str.extract('(\d*\.?\d+)', expand=False)
df[cols] = df[cols].apply(extract_nums)
Out:
id name value1 value2 value3
0 1 gz 1 6 NaN
1 2 sz 7 278 3
2 3 wh NaN 67 6.1
3 4 sh 3 45 34

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Find Ranges of Null Values in a Column - Pandas

I'm trying to set the ranges of NaN values in a df like this:
[Column_1] [Column_2]
1 A 10
2 B 20
3 C NaN
4 D NaN
5 E NaN
6 F 60
7 G 65
8 H NaN
9 I NaN
10 J NaN
11 K 90
12 L NaN
13 M 100
So, for now what I just did was to list the index of the NaN values with this line:
df['Column_2'].isnull()].index.tolist()
But then, I don't know how to set the intervals of these values in terms of Column_1, which for this case would be:
[C-E] [H-J] [L]
Thanks for your insights!
Filter the rows where the values in Column_2 are NaN, then groupby these rows on consecutive occurrence of NaN values in Column_2 and collect the corresponding values of Column_1 inside a list comprehension:
m = df['Column_2'].isna()
r = [[*g['Column_1']] for _, g in df[m].groupby((~m).cumsum())]
print(r)
[['C', 'D', 'E'], ['H', 'I', 'J'], ['L']]

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources