Doubts pandas filtering data row by row - python-3.x

How can I solve this issue related on pandas? I've a dataframe of the following approach:
datetime64ns
type(int)
datetime64ns(analysis)
2019-02-02T10:02:05
4
2019-02-02T10:02:01
3
2019-02-02T10:02:02
4
2019-02-02T10:02:02
2019-02-02T10:02:04
3
2019-02-02T10:02:04
The goal is to do the following issue:
# psuedocode
for all the rows:
if datetime(analysis) exists and type=4:
insert in the a new row column type4=1
elseif datetime(analysis) exists and type=2:
insert in the a new row column type2=1
the idea to develop it is in order to make a group by count value. I'm sure that is possible because I manage to develop it in the past but I lost my .py file. Thanks for the attention

Need this?
df = pd.concat([df, pd.get_dummies(df['type(int)'].mask(
df['datetime64ns(analysis)'].isna()).astype('Int64')).add_prefix('type')], 1)
OUTPUT:
datetime64ns type(int) datetime64ns(analysis) type3 type4
0 2019-02-02T10:02:05 4 NaN 0 0
1 2019-02-02T10:02:01 3 NaN 0 0
2 2019-02-02T10:02:02 4 2019-02-02T10:02:02 0 1
3 2019-02-02T10:02:04 3 2019-02-02T10:02:04 1 0

Related

How to unmerge cells and create a standard dataframe when reading excel file?

I would like to convert this dataframe
into this dataframe
So far reading excel the standard way gives me the following result.
df = pd.read_excel(folder + 'abcd.xlsx', sheet_name="Sheet1")
Unnamed: 0 Unnamed: 1 T12006 T22006 T32006 \
0 Casablanca Global 100 97.27252 93.464538
1 NaN RĂ©sidentiel 100 95.883979 92.414063
2 NaN Appartement 100 95.425152 91.674379
3 NaN Maison 100 101.463607 104.039383
4 NaN Villa 100 102.45132 101.996932
Thank you
You can try method .fillna() with parameter method='ffill'. According to the pandas documentation for the ffill method: ffill: propagate last valid observation forward to next valid backfill.
So, your code would be like:
df.fillna(method='ffill', inplace=True)
And change name of 0 and 1 columns with this lines:
df.columns.values[0] = "City"
df.columns.values[1] = "Type"

Access substring in a dataframe column to create a new column

I've a dataframe
df = pd.DataFrame(np.random.randint(0,10,size=(5, 1)), columns=list('A'))
df.insert(0, 'n', ['this-text in presence 20-30%, and another string','id XDTV/HGF, publication',
'this-text, 37$degree','this-text K0.5, coefficient 0.007',' '])
>>> df
n A
0 this-text in presence 20-30%, and another string 2
1 id XDTV/HGF, publication 1
2 this-text, 37$degree 4
3 coefficient 0.007,this-text K0.5 1
4 2
I'd like to create a new column
>>> df
new A
0 this-text 2
1 1
2 this-text 4
3 this-text 1
4 2
I could save the column n in a list and check if each item of the list contains the substring this-text. But I'd like to know if there are better ways of doing this.
Suggestions will be really helpful.
Try with str.findall or extract
df['new']=df.n.str.findall('this-text').str[0]
#df.n.str.extract('(this-text)')[0]
df
Out[373]:
n A new
0 this-text in presence 20-30%, and another string 7 this-text
1 id XDTV/HGF, publication 4 NaN
2 this-text, 37$degree 6 this-text
3 this-text K0.5, coefficient 0.007 0 this-text
4 7 NaN

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Pandas data frame concat return same data of first dataframe

I have this datafram
PNN_sh NN_shap PNN_corr NN_corr
1 25005 1 25005
2 25012 2 25001
3 25011 3 25009
4 25397 4 25445
5 25006 5 25205
Then I made 2 dataframs from this one.
NN_sh = data[['PNN_sh', 'NN_shap']]
NN_corr = data[['PNN_corr', 'NN_corr']]
Thereafter, I sorted them and saved in new dataframes.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'])
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'])
Now I want to combine 2 columns from the 2 dataframs above.
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')
But what I got is only the first column copied into second one also.
PNN_sh PNN_corr
1 1
5 5
3 3
2 2
4 4
The second column should be
PNN_corr
2
1
3
5
4
Any idea how to fix it? Thanks in advance
Put ignore_index=True to sort_values():
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'], ignore_index=True)
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'], ignore_index=True)
Then the result after concat will be:
PNN_sh PNN_corr
0 1 2
1 5 1
2 3 3
3 2 5
4 4 4
I think when you sort you are preserving the original indices of the example DataFrames. Therefore, it is joining the PNN_corr value that was originally in the same row (at same index). Try resetting the index of each DataFrame after sorting, then join/concat.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap']).reset_index()
NN_corr_sort = NN_corr.sort_values(by=['NN_corr']).reset_index()
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')

Python 3.x - Merge pandas data frames

I am using Python for Titanic disaster competition on Kaggle. The dataset (df) contains 3 attributes corresponding to each passenger - 'Gender'(1/0), 'Age' and 'Pclass'(1/2/3). I want to obtain median age corresponding to each Gender-Pclass combination.
The end result should be a dataframe as -
Gender Class
1 1
0 2
1 3
0 1
1 2
0 3
Median age will be calculated later
I tried to create the data frame as follows -
unique_gender = pd.DataFrame(df.Gender.unique())
unique_class = pd.DataFrame(df.Class.unique())
reqd_df = pd.merge(unique_gender, unique_class, how = 'outer')
But the output obtained is -
0
0 3
1 1
2 2
3 0
can someone please help me get the desired output?
You want df.groupby(['gender','class'])['age'].median() (per JohnE)

Resources