I have a dataset given below:
weekid type amount
1 A 10
1 B 20
1 C 30
1 D 40
1 F 50
2 A 70
2 E 80
2 B 100
I am trying to convert it to another panda frame based on total number of type values defined with:
import pandas as pd
import numpy as np
df=pd.read_csv(INPUT_FILE)
for type in df["type"].unique():
//todo
My aim is to get a data given below:
weekid type_A type_B type_C type_D type_E type_F
1 10 20 30 40 0 50
2 70 100 0 0 80 0
Is there any specific function that convert unique values as a column and fills the missing values as 0 for each weekId groups? I am wondering that how this conversion can be done efficiently?
You can use the following:
df = df.pivot(columns=['type'], values=['amount'])
df.fillna(0)
dfp.columns = dfp.columns.droplevel(0)
Given your input this yields:
type A B C D F
weekid
1 10.0 20.0 30.0 40.0 50.0
2 70.0 80.0 100.0 0.0 0.0
Related
I have a dataframe:
> df = batch Code. time
> a 100. 2019-08-01 00:59:12.000
> a 120. 2019-08-01 00:59:32.000
> a 130. 2019-08-01 00:59:42.000
> a 120. 2019-08-01 00:59:52.000
> b 100. 2019-08-01 00:44:11.000
> b 140. 2019-08-02 00:14:11.000
> b 150. 2019-08-03 00:47:11.000
> c 150. 2019-09-01 00:44:11.000
> d 100. 2019-08-01 00:10:00.000
> d 100. 2019-08-01 00:10:05.000
> d 130. 2019-08-01 00:10:10.000
> d 130. 2019-08-01 00:10:20.000
I want to get the number of seconds, per group, between the time of the first '100' code to the last '130' code.
If for a group there is no code 100 with code 130 after (one of them is missing) - put nan.
So the output should be:
df2 = batch duration
a 30
b. nan
c. nan
d. 20
What is the best way to do it?
Use:
#convert values to datetimes
df['time'] = pd.to_datetime(df['time'])
#get first 100 Code per batch
s1=df[df['Code.'].eq(100)].drop_duplicates('batch').set_index('batch')['time']
#get last 130 Code per batch
s2=df[df['Code.'].eq(130)].drop_duplicates('batch', keep='last').set_index('batch')['time']
#subtract and convert to timedeltas
df = (s2.sub(s1)
.dt.total_seconds()
.reindex(df['batch'].unique())
.reset_index(name='duration'))
print (df)
batch duration
0 a 30.0
1 b NaN
2 c NaN
3 d 20.0
As an alternative:
batchs = pd.DataFrame(df['batch'].unique(),columns=['batch'])
df = df[(df['code'] == 100) | (df['code']==130)]
final=pd.concat([
df.drop_duplicates(subset='code',keep='first'),
df.drop_duplicates(subset='code',keep='last'),
])
final['duration'] = (final['time'].shift(-1) - final['time']).dt.total_seconds()
final = final.drop_duplicates('batch',keep='first').drop(['time','code'],axis=1).merge(batchs,on='batch',how='right')
final
batch duration
0 a 30.0
1 b nan
2 c nan
3 d 15.0
I have a df as shown below
df:
ID Number_of_Cars Age_in_days Total_amount Total_N Type
1 2 100 10000 100 A
2 5 10 1000 2 B
3 1 1000 1000 200 B
4 1 20 0 0 C
5 3 1000 100000 20 A
6 6 100 10000 20 C
7 4 200 10000 200 A
from the above df I would like to prepare df1 as shown below
df1:
ID Avg_Monthly_Amount Avg_Monthly_N Type
1 3000 30 A
2 3000 6 B
3 30 6 B
4 0 0 C
5 3000 0.6 A
6 3000 6 C
7 1500 30 A
Explanation:
Avg_Monthly_Amount = Avg monthly amount
Avg_Monthly_N = Avg monthly N
To prepare df1, I tried below code
df['Avg_Monthly_Amount'] = df['Total_amount'] / df['Age_in_days'] * 30
df['Avg_Monthly_N'] = df['Total_N'] / df['Age_in_days'] * 30
From df and df1 (or df alone) I would like to prepare below dataframe as df2
I could not a write a proper code to generate below df2
Explanation:
Aggregate above number at Type level
Example:
There are 3 customers (ID = 1, 5, 7) with Type = A, hence for Type = A, Number_Of_Type = 3
Avg_Cars for Type = A, is (2+3+4)/3 = 3
Avg_age_in_years for Type = A is ((100+1000+200)/3)/365
Avg_amount_monthly for Type = A is Mean of Average_Monthly_Amount in for type = A in df1
Avg_N_monthly for Type = A is Mean of Avg_Monthly_N in for type = A in df1
Final expected output (df2)
Type Number_Of_Type Avg_Cars Avg_age_in_years Avg_amount_monthly Avg_N_monthly
A 3 3 1.19 2500 20.2
B 2 3 1.38 1515 6
C 2 3.5 0.16 1500 3
Don't prepare other df named df1 from your original dataframe df
your dataframe df:-
ID Number_of_Cars Age_in_days Total_amount Total_N Type
1 2 100 10000 100 A
2 5 10 1000 2 B
3 1 1000 1000 200 B
4 1 20 0 0 C
5 3 1000 100000 20 A
6 6 100 10000 20 C
7 4 200 10000 200 A
After you created/imported df:-
df['Avg_Monthly_Amount'] = df['Total_amount'] / df['Age_in_days'] * 30
df['Avg_Monthly_N'] = df['Total_N'] / df['Age_in_days'] * 30
df['Age_in_year']=df['Age_in_days']/365
Then:-
df2=df.groupby('Type').agg({'Type':'count','Number_of_Cars':'mean','Age_in_year':'mean','Avg_Monthly_Amount':'mean','Avg_Monthly_N':'mean'}).rename(columns={'Type':'Number_Of_Type'})
Now if you print or write df2(if you are using jupyter notebook) then you get your desired output
Output:-
Number_Of_Type Number_of_Cars Age_in_year Avg_Monthly_Amount Avg_Monthly_N
Type
A 3 3.0 1.187215 2500.0 20.2
B 2 3.0 1.383562 1515.0 6.0
C 2 3.5 0.164384 1500.0 3.0
I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0
Cant be this hard. I Have
df=pd.DataFrame({'id':[1,2,3],'name':['j','l','m'], 'mnt':['f','p','p'],'nt':['b','w','e'],'cost':[20,30,80],'paid':[12,23,45]})
I need
import numpy as np
df1=pd.DataFrame({'id':[1,2,3,1,2,3],'name':['j','l','m','j','l','m'], 't':['f','p','p','b','w','e'],'paid':[12,23,45,np.nan,np.nan,np.nan],'cost':[20,30,80,np.nan,np.nan,np.nan]})
I have 45 columns to invert.
I tried
(df.set_index(['id', 'name'])
.rename_axis(['paid'], axis=1)
.stack().reset_index())
EDIT: I think simpliest here is set missing values by variable column in DataFrame.melt:
df2 = df.melt(['id', 'name','cost','paid'], value_name='t')
df2.loc[df2.pop('variable').eq('nt'), ['cost','paid']] = np.nan
print (df2)
id name cost paid t
0 1 j 20.0 12.0 f
1 2 l 30.0 23.0 p
2 3 m 80.0 45.0 p
3 1 j NaN NaN b
4 2 l NaN NaN w
5 3 m NaN NaN e
Use lreshape working with dictionary of lists for specified which columns are 'grouped' together:
df2 = pd.lreshape(df, {'t':['mnt','nt'], 'mon':['cost','paid']})
print (df2)
id name t mon
0 1 j f 20
1 2 l p 30
2 3 m p 80
3 1 j b 12
4 2 l w 23
5 3 m e 45
I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000