I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0
Related
Given a dataframe df as follows:
date value 20211003 20211010 20211017
0 2021-9-19 3613.9663 NaN NaN NaN
1 2021-9-26 3613.0673 NaN NaN NaN
2 2021-10-3 3568.1668 NaN NaN NaN
3 2021-10-10 3592.1666 3510.221000 NaN NaN
4 2021-10-17 3572.3662 3465.737012 3534.220800 NaN
5 2021-10-24 3582.6036 3479.107035 3539.856801 3514.420400
6 2021-10-31 3547.3361 3421.161235 3481.911001 3456.474600
7 2021-11-7 3491.5677 3370.140147 3439.284539 3416.621024
8 2021-11-14 3539.1002 3319.289523 3391.930037 3370.079953
9 2021-11-21 3560.3734 3261.343723 3333.984237 3312.134153
10 2021-11-28 3564.0894 3255.328902 3338.967086 3305.054247
11 2021-12-5 3607.4320 3313.274702 3396.912886 3363.000047
12 2021-12-12 3666.3479 3371.220502 3450.172564 3412.234440
13 2021-12-19 3632.3638 NaN 3466.930383 3428.683490
14 2021-12-26 3618.0535 NaN NaN 3370.737690
Let's say the columns after value column (20211003, 20211010 and 20211017) are rolling forecast result of value, instead of 10 values for each column, I'll need to keep 3 values only. Here is the slicing rule: from left to right, from bottom to top to keep 3 values for each date column, so row 2021-11-28 from column 20211003 will be the starting point, and then increase day by day. The expected result will like this:
date value 20211003 20211010 20211017
0 2021-9-19 3613.9663 NaN NaN NaN
1 2021-9-26 3613.0673 NaN NaN NaN
2 2021-10-3 3568.1668 NaN NaN NaN
3 2021-10-10 3592.1666 NaN NaN NaN
4 2021-10-17 3572.3662 NaN NaN NaN
5 2021-10-24 3582.6036 NaN NaN NaN
6 2021-10-31 3547.3361 NaN NaN NaN
7 2021-11-7 3491.5677 NaN NaN NaN
8 2021-11-14 3539.1002 NaN NaN NaN
9 2021-11-21 3560.3734 NaN NaN NaN
10 2021-11-28 3564.0894 3255.328902 NaN NaN
11 2021-12-5 3607.4320 3313.274702 3396.912886 NaN
12 2021-12-12 3666.3479 3371.220502 3450.172564 3412.23444
13 2021-12-19 3632.3638 NaN 3466.930383 3428.68349
14 2021-12-26 3618.0535 NaN NaN 3370.73769
How could I achieve that in Pandas? Thanks.
Reference:
Iterate over multiple columns and replace the values in these columns after a row (increment) with null values
df.iloc[:, :2].join(df.iloc[:, 2:].apply(lambda x:x.dropna().tail(3)))
date value 20211003 20211010 20211017
0 2021-9-19 3613.9663 NaN NaN NaN
1 2021-9-26 3613.0673 NaN NaN NaN
2 2021-10-3 3568.1668 NaN NaN NaN
3 2021-10-10 3592.1666 NaN NaN NaN
4 2021-10-17 3572.3662 NaN NaN NaN
5 2021-10-24 3582.6036 NaN NaN NaN
6 2021-10-31 3547.3361 NaN NaN NaN
7 2021-11-7 3491.5677 NaN NaN NaN
8 2021-11-14 3539.1002 NaN NaN NaN
9 2021-11-21 3560.3734 NaN NaN NaN
10 2021-11-28 3564.0894 3255.328902 NaN NaN
11 2021-12-5 3607.4320 3313.274702 3396.912886 NaN
12 2021-12-12 3666.3479 3371.220502 3450.172564 3412.23444
13 2021-12-19 3632.3638 NaN 3466.930383 3428.68349
14 2021-12-26 3618.0535 NaN NaN 3370.73769
I am trying to subtract or compare Only the time component of two datetime64 columns but have been unsuccessful. I have tried using strftime with an exception block to catch NaTs but no luck. Any help is much appreciated. I have attached the Python code below.
Column A Column B
1/1/1900 10:00 NaT
1/1/1900 10:30 NaT
1/1/1900 11:00 NaT
1/1/1900 9:00 2/6/2021 23:59
1/1/1900 11:00 2/6/2021 8:59
1/1/1900 9:30 2/6/2021 16:00
def convert(x):
try:
return x.strftime("%H:%M:%S")
except ValueError:
return x
df['B'].apply(convert)-df['A'].apply(convert)
I get the error TypeError: unsupported operand type(s) for -: 'NaTType' and 'str'
Convert both columns to pandas datetime using pd.to_datetime. Then extract just time using Series.dt.time:
df['Column A'] = pd.to_datetime(df['Column A'])
df['Column B'] = pd.to_datetime(df['Column B'])
In [213]: (df['Column A'] - df['Column B']).dt.components
Out[213]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 -44232.0 9.0 1.0 0.0 0.0 0.0 0.0
4 -44231.0 2.0 1.0 0.0 0.0 0.0 0.0
5 -44232.0 17.0 30.0 0.0 0.0 0.0 0.0
From the above, you can extract hours, minutes, etc.. separately:
In [215]: (df['Column A'] - df['Column B']).dt.components.hours
Out[215]:
0 NaN
1 NaN
2 NaN
3 9.0
4 2.0
5 17.0
Name: hours, dtype: float64
I'm working with Python 3 on Mac OS 10.11.06 (el capitan).
I have a .csv dataset consisting of about 3,700 time series sets (of unequal lengths). The data are currently formatted as follows:
Current Format
trade_date price_usd ticker
0 2016-01-01 434.33000 BTC
1 2016-01-02 433.44000 BTC
2 2016-01-03 430.01000 BTC
3 2016-01-04 433.09000 BTC
4 2016-01-05 431.96000 BTC
... ... ... ...
2347227 2020-10-19 74.13000 BRAIN
2347228 2020-10-20 71.97000 BRAIN
2347229 2020-10-21 76.64000 BRAIN
2347230 2020-10-22 80.90000 BRAIN
2347231 2020-10-19 0.15004 DAOFI
Ignoring the default numerical index for the moment, notice that the datetime column, trade_date, is such that the sequence of values repeats with each new ticker group. My goal is to transform the data such that each ticker name becomes a column header under which its corresponding daily prices are listed in correct order with the datetime value on which it was recorded (i.e. the datetime index does not repeat and the daily price values for the ticker symbols are the rows):
Target Format
trade_date ticker1 ticker2 ... tickerN
day1 t1p1 t2p1 ... tNp1
day2 t1p2 t2p2 ... etc...
.
.
.
dayK
Thus far I've tried various approaches, including experiments with various methods, e.g. stack()/unstack(), groupby(), etc., as well as custom functions that attempt to iterate through the values to assign them to a new DF in which I created a structured frame into which to drop the values, but to no avail (see failed attempt below).
New, empty target data frame with ticker symbol as col and trade_date range as index:
BTC ETH XRP MKR LTC USDT BCH XLM EOS BNB ... MTLX INDEX WOA HAUT THRM YFED NMT DOKI BRAIN DAOFI
2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Failed attempt to populate the above ...
for element in crypto_df['ticker']:
if element == new_df.column and crypto['trade_date'] == new_df.index:
df['ticker'] = element
new_df.head()
My ultimate goal is to produce a multi-series time series forecast using FBProphet because of its ability to handle multiple time series forecasts in a "single" model.
One last thought I've just had is that one could maybe create separate data frames for each ticker, then rejoin along the datetime index, creating the separate columns in the new DF along the way, but that seems a bit round-about (I've literally just done this for a couple thousand .csv files with equities data, for example)... But I'd still like to find a more direct solution, if there is one? Surely this scenario will arise again in the future!
Thanks for any thoughts ...
You can set_index and unstack:
print(df.set_index(["trade_date", "ticker"]).unstack("ticker"))
price_usd
ticker BRAIN BTC DAOFI
trade_date
2016-01-01 NaN 434.33 NaN
2016-01-02 NaN 433.44 NaN
2016-01-03 NaN 430.01 NaN
2016-01-04 NaN 433.09 NaN
2016-01-05 NaN 431.96 NaN
2020-10-19 74.13 NaN 0.15004
2020-10-20 71.97 NaN NaN
2020-10-21 76.64 NaN NaN
2020-10-22 80.90 NaN NaN
First use .groupby(), then use .unstack():
import pandas as pd
from io import StringIO
text = """
trade_date price_usd ticker
2016-01-01 434.33000 BTC
2016-01-02 433.44000 BTC
2016-01-02 430.01000 Google
2016-01-03 433.09000 BTC
2016-01-03 431.96000 Google
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
df.groupby(['trade_date', 'ticker'])['price_usd'].mean().unstack()
Resulting dataframe:
trade_date ticker BTC Google
2016-01-01 434.33 NaN
2016-01-02 433.44 430.01
2016-01-03 433.09 431.96
My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN
I tried to add from df5 columns to df_prog. But for some reason they remain empty. I do not understand what I'm doing wrong. Code:
df5['Kol1_1Y']
223520 14.0
223521 65.0
223522 13.0
223523 39.0
223524 13.0
223525 3.0
223526 10.0
223527 19.0
223528 16.0
223529 29.0
Name: Kol1_1Y, dtype: float64
df_prog['Kol1_1Y'] = df5['Kol1_1Y']
df_prog['Kol2_1Y'] = df5['Kol2_1Y']
df_prog['Kol1_3M'] = df5['Kol1_3M']
df_prog['Kol2_3M'] = df5['Kol2_3M']
df_prog.to_excel("C:\python\progGB.xlsx")
df_prog
0 RESPR PREVPR Kol1_1Y Kol2_1Y Kol1_3M Kol2_3M
0 0.4944 0.4944 1.4894 NaN NaN NaN NaN
1 0.7073 0.7073 3.2020 NaN NaN NaN NaN
2 0.3965 0.3965 -0.3989 NaN NaN NaN NaN
3 0.4501 0.4501 -0.1826 NaN NaN NaN NaN
4 0.0271 0.0271 -6.1202 NaN NaN NaN NaN
5 0.2488 0.2488 -2.8447 NaN NaN NaN NaN
6 0.5190 0.5190 0.0176 NaN NaN NaN NaN
7 0.6667 0.6667 2.2334 NaN NaN NaN NaN
8 0.7708 0.7708 4.5216 NaN NaN NaN NaN
9 0.7074 0.7074 2.9906 NaN NaN NaN NaN
Pandas = assignment checking both index and columns. In your case, columns is matched, but index is different. Therefore, it assigns all NaN. To ignore index and columns, you need assigning from numpy ndarray such as:
df_prog['Kol1_1Y'] = df5['Kol1_1Y'].values
df_prog['Kol2_1Y'] = df5['Kol2_1Y'].values
df_prog['Kol1_3M'] = df5['Kol1_3M'].values
df_prog['Kol2_3M'] = df5['Kol2_3M'].values