Python 3.x - Merge pandas data frames - python-3.x

I am using Python for Titanic disaster competition on Kaggle. The dataset (df) contains 3 attributes corresponding to each passenger - 'Gender'(1/0), 'Age' and 'Pclass'(1/2/3). I want to obtain median age corresponding to each Gender-Pclass combination.
The end result should be a dataframe as -
Gender Class
1 1
0 2
1 3
0 1
1 2
0 3
Median age will be calculated later
I tried to create the data frame as follows -
unique_gender = pd.DataFrame(df.Gender.unique())
unique_class = pd.DataFrame(df.Class.unique())
reqd_df = pd.merge(unique_gender, unique_class, how = 'outer')
But the output obtained is -
0
0 3
1 1
2 2
3 0
can someone please help me get the desired output?

You want df.groupby(['gender','class'])['age'].median() (per JohnE)

Related

Get the difference between two dates when string value changes

I want to get the number of days between the change of string values (ie., the symbol column) in one column, grouped by their respective id. I want a separate column for datediff like the one below.
id date symbol datediff
1 2022-08-26 a 0
1 2022-08-27 a 0
1 2022-08-28 a 0
2 2022-08-26 a 0
2 2022-08-27 a 0
2 2022-08-28 a 0
2 2022-08-29 b 3
3 2022-08-29 c 0
3 2022-08-30 b 1
For id = 1, datediff = 0 since symbol stayed as a. For id = 2, datediff = 3 since symbol changed after 3 days from a to b. Hence, what I'm looking for is a code that computes the difference in which the id changes it's symbol.
I am currently using this code:
df['date'] = pd.to_datetime(df['date'])
diff = ['0 days 00:00:00']
for st,i in zip(df['symbol'],df.index):
if i > 0:#cannot evaluate previous from index 0
if df['symbol'][i] != df['symbol'][i-1]:
diff.append(df['date'][i] - df['data_ts'][i-1])
else:
diff.append('0 days 00:00:00')
The output becomes:
id date symbol datediff
1 2022-08-26 a 0
1 2022-08-27 a 0
1 2022-08-28 a 0
2 2022-08-26 a 0
2 2022-08-27 a 0
2 2022-08-28 a 0
2 2022-08-29 b 1
3 2022-08-29 c 0
3 2022-08-30 b 1
It also computes the difference between two different ids. But I want the computation to be separate from different ids.
I only see questions about difference of dates when values changes, but not when string changes. Thank you!
IIUC: my solution works with the assumption that the symbols within one id ends with a single changing symbol, if there is any (as in the example given in the question).
First use df.groupby on id and symbol and get the minimum date for each combination. Then, find the difference between the dates within each id. This gives the datediff. Finally, merge the findings with the original dataframe.
df1 = df.groupby(['id', 'symbol'], sort=False).agg({'date': np.min}).reset_index()
df1['datediff'] = abs(df1.groupby('id')['date'].diff().dt.days.fillna(0))
df1 = df1.drop(columns='date')
df_merge = pd.merge(df, df1, on=['id', 'symbol'])

How to add text element to series data in Python

I have a series data in python defined as:
scores_data = (pd.Series([F1[0], auc, ACC[0], FPR[0], FNR[0], TPR[0], TNR[0]])).round(4)
I want to append the text 'Featues' at location 0 to the series data.
I tried scores_data.loc[0] but that replaced the data at location 0.
Thanks for your help.
You can't directly insert a value in a Series (like you could in a DataFrame with insert).
You can use concat:
s = pd.Series([1,2,3,4])
s2 = pd.concat([pd.Series([0], index=[-1]), s])
output:
-1 0
0 1
1 2
2 3
3 4
dtype: int64
Or create a new Series from the values:
pd.Series([0]+s.to_list())
output:
0 0
1 1
2 2
3 3
4 4
dtype: int64

Doubts pandas filtering data row by row

How can I solve this issue related on pandas? I've a dataframe of the following approach:
datetime64ns
type(int)
datetime64ns(analysis)
2019-02-02T10:02:05
4
2019-02-02T10:02:01
3
2019-02-02T10:02:02
4
2019-02-02T10:02:02
2019-02-02T10:02:04
3
2019-02-02T10:02:04
The goal is to do the following issue:
# psuedocode
for all the rows:
if datetime(analysis) exists and type=4:
insert in the a new row column type4=1
elseif datetime(analysis) exists and type=2:
insert in the a new row column type2=1
the idea to develop it is in order to make a group by count value. I'm sure that is possible because I manage to develop it in the past but I lost my .py file. Thanks for the attention
Need this?
df = pd.concat([df, pd.get_dummies(df['type(int)'].mask(
df['datetime64ns(analysis)'].isna()).astype('Int64')).add_prefix('type')], 1)
OUTPUT:
datetime64ns type(int) datetime64ns(analysis) type3 type4
0 2019-02-02T10:02:05 4 NaN 0 0
1 2019-02-02T10:02:01 3 NaN 0 0
2 2019-02-02T10:02:02 4 2019-02-02T10:02:02 0 1
3 2019-02-02T10:02:04 3 2019-02-02T10:02:04 1 0

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

How to remove Initial rows in a dataframe in python

I have 4 dataframes with weekly sales values for a year for 4 products. Some of the initial rows are 0 as no sales. there are some other 0 values as well in between the weeks.
I want to remove those initial 0 values, keeping the in between 0s.
For example
Week Sales(prod 1)
1 0
2 0
3 100
4 120
5 55
6 0
7 60.
Week Sales(prod 2)
1 0
2 0
3 0
4 120
5 0
6 30
7 60.
I want to remove row 1,2 from 1st table and 1,2,3 frm 2nd.
Few Assumption based on your example dataframe:
DataFrame is created using pandas
week always start with 1
will remove all the starting weeks only which are having 0 sales
Solution:
Python libraries Required
- pandas, more_itertools
Example DataFrame (df):
Week Sales
1 0
2 0
3 0
4 120
5 0
6 30
7 60
Python Code:
import pandas as pd
import more_itertools as mit
filter_col = 'Sales'
filter_val = 0
##function which returns the index to be removed
def return_initial_week_index_with_zero_sales(df,filter_col,filter_val):
index_wzs = [False]
if df[filter_col].iloc[1]==filter_val:
index_list = df[df[filter_col]==filter_val].index.tolist()
index_wzs = [list(group) for group in mit.consecutive_groups(index_list)]
else:
pass
return index_wzs[0]
##calling above function and removing index from the dataframe
df = df.set_index('Week')
weeks_to_be_removed = return_initial_week_index_with_zero_sales(df,filter_col,filter_val)
if weeks_to_be_removed:
print('Initial weeks with 0 sales are {}'.format(weeks_to_be_removed))
df = df.drop(index=weeks_to_be_removed)
else:
print('No initial week has 0 sales')
df.reset_index(inplace=True)
Result:df
Week Sales
4 120
5 55
6 0
7 60
I hope it helps, you can modify the function as per your requirement.

Resources