What is the efficient way of splitting a pandas DataFrame column in 2 - python-3.x

I have a pandas DataFrame that looks like this:
x1y1 x2y2
0 [694.0, 427.0] [1178.0, 601.0]
1 [621.0, 415.0] [736.0, 456.0]
2 [551.0, 404.0] [669.0, 461.0]
3 [514.0, 421.0] [569.0, 463.0]
4 [181.0, 406.0] [320.0, 462.0]
5 [738.0, 415.0] [873.0, 474.0]
6 [1158.0, 446.0] [1209.0, 513.0]
7 [613.0, 176.0] [692.0, 272.0]
8 [2.0, 295.0] [50.0, 368.0]
9 [817.0, 305.0] [870.0, 373.0]
10 [1130.0, 410.0] [1174.0, 500.0]
11 [1155.0, 420.0] [1199.0, 497.0]
12 [990.0, 417.0] [1053.0, 524.0]
13 [952.0, 409.0] [1003.0, 515.0]
14 [905.0, 412.0] [944.0, 503.0]
15 [34.0, 432.0] [84.0, 485.0]
16 [1091.0, 1.0] [1172.0, 78.0]
17 [859.0, 49.0] [975.0, 146.0]
18 [710.0, 76.0] [827.0, 145.0]
19 [68.0, 62.0] [181.0, 115.0]
20 [1076.0, 252.0] [1142.0, 297.0]
21 [1058.0, 298.0] [1103.0, 372.0]
22 [642.0, 336.0] [675.0, 366.0]
23 [777.0, 382.0] [800.0, 408.0]
24 [264.0, 241.0] [331.0, 292.0]
I want to split it into a DataFrame with x1, y1, x2, y2 as columns efficiently without for loops or iterating over the rows, is there some way to do so?

Idea is use numpy.hstack to 2d array and pass to DataFrame constructor:
df3 = pd.DataFrame(np.hstack((df['x1y1'].tolist(),
df['x2y2'].tolist())), columns=['x1', 'y1', 'x2', 'y2'])
print (df3)
x1 y1 x2 y2
0 694.0 427.0 1178.0 601.0
1 621.0 415.0 736.0 456.0
2 551.0 404.0 669.0 461.0
3 514.0 421.0 569.0 463.0
4 181.0 406.0 320.0 462.0
5 738.0 415.0 873.0 474.0
6 1158.0 446.0 1209.0 513.0
7 613.0 176.0 692.0 272.0
8 2.0 295.0 50.0 368.0
9 817.0 305.0 870.0 373.0
10 1130.0 410.0 1174.0 500.0
11 1155.0 420.0 1199.0 497.0
12 990.0 417.0 1053.0 524.0
13 952.0 409.0 1003.0 515.0
14 905.0 412.0 944.0 503.0
15 34.0 432.0 84.0 485.0
16 1091.0 1.0 1172.0 78.0
17 859.0 49.0 975.0 146.0
18 710.0 76.0 827.0 145.0
19 68.0 62.0 181.0 115.0
20 1076.0 252.0 1142.0 297.0
21 1058.0 298.0 1103.0 372.0
22 642.0 336.0 675.0 366.0
23 777.0 382.0 800.0 408.0
24 264.0 241.0 331.0 292.0

Related

Convert monthly cumulative values to current month values in Pandas

For the following data df1 with missing January data, cumul_val1 and cumul_val2 are the monthly cumulative values of value1 and value2 respectively.
date cumul_val1 cumul_val2
0 2020-05-31 48702.97 45919.59
1 2020-06-30 69403.68 62780.21
2 2020-07-31 83631.36 75324.61
3 2020-08-31 98485.95 88454.14
4 2020-09-30 117072.67 103484.20
5 2020-10-31 133293.80 116555.76
6 2020-11-30 150834.45 129492.36
7 2020-12-31 176086.22 141442.95
8 2021-02-28 17363.14 13985.87
9 2021-03-31 36007.05 27575.82
10 2021-04-30 50305.00 40239.76
11 2021-05-31 66383.32 54318.08
12 2021-06-30 88635.35 72179.07
13 2021-07-31 101648.18 84895.41
14 2021-08-31 114192.81 98059.73
15 2021-09-30 130331.78 112568.07
16 2021-10-31 143040.71 124933.62
17 2021-11-30 158130.73 137313.96
18 2021-12-31 179433.41 147602.08
19 2022-02-28 15702.61 14499.38
20 2022-03-31 31045.96 27764.95
21 2022-04-30 39768.15 39154.31
22 2022-05-31 50738.38 52133.62
I now want to convert them into monthly values. For example, the value of value1 on 2021-04-30 is calculated by 50305.00-36007.05. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February.
May I ask how to achieve it?
The expected result:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 NaN NaN
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.96 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.94 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.22 12979.31
Notes: in order to simplify the question, I added a new alternative sample data df2 without missing months:
date cumul_val monthly_val
0 2020-09-30 32144142.46 NaN
1 2020-10-31 36061223.45 3917080.99
2 2020-11-30 40354684.50 4293461.05
3 2020-12-31 44360036.58 4005352.08
4 2021-01-31 4130729.28 4130729.28
5 2021-02-28 7985781.64 3855052.36
6 2021-03-31 12306556.74 4320775.10
7 2021-04-30 16873032.10 4566475.36
8 2021-05-31 21730065.01 4857032.91
9 2021-06-30 26816787.85 5086722.84
10 2021-07-31 31785276.80 4968488.95
11 2021-08-31 37030178.38 5244901.58
12 2021-09-30 42879767.13 5849588.75
13 2021-10-31 48392250.79 5512483.66
14 2021-11-30 53655448.65 5263197.86
15 2021-12-31 59965790.04 6310341.39
16 2022-01-31 5226910.15 5226910.15
17 2022-02-28 9481147.06 4254236.91
18 2022-03-31 14205738.71 4724591.65
19 2022-04-30 19096746.32 4891007.61
20 2022-05-31 24033460.77 4936714.45
21 2022-06-30 28913566.31 4880105.54
22 2022-07-31 34099663.15 5186096.84
23 2022-08-31 39082926.81 4983263.66
24 2022-09-30 44406354.61 5323427.80
25 2022-10-31 48889431.89 4483077.28
26 2022-11-30 52956747.09 4067315.20
27 2022-12-31 57184652.60 4227905.51
Had there been no gap in the data, the problem would have been an easy .diff(). However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months.
idx = pd.to_datetime(df["date"])
month_val = (
df[["cumul_val1", "cumul_val2"]]
# Fill the gap months with 0
.set_index(idx)
.reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
# Take the diff
.diff()
# Keep only the original months
.loc[idx]
# Beat into shape for the subsequent concat
.set_axis(["month_val1", "month_val2"], axis=1)
.set_index(df.index)
)
result = pd.concat([df, month_val], axis=1)
Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this:
cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]
# Make sure `date` is of type Timestamp and the dataframe is sorted. You data
# may have satisfied both conditions already.`
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")
# Return True if current row is in the same year as the previous row.
# Repeat the result for each cumul_val column.
is_same_year = np.tile(
df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
(1, len(cumul_cols)),
)
month_val = np.where(
is_same_year,
df[cumul_cols].diff(),
df[cumul_cols],
)
month_val[0, :] = np.nan
df[monthly_cols] = month_val
It would be much easier to use your date as a PeriodIndex with monthly frequencies:
# set up the date as a monthly period Index
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')
# subtract the previous month
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)
Output:
cumul_val1 cumul_val2
date
2020-05 48702.97 45919.59
2020-06 20700.71 16860.62
2020-07 14227.68 12544.40
2020-08 14854.59 13129.53
2020-09 18586.72 15030.06
2020-10 16221.13 13071.56
2020-11 17540.65 12936.60
2020-12 25251.77 11950.59
2021-02 17363.14 13985.87
2021-03 18643.91 13589.95
2021-04 14297.95 12663.94
2021-05 16078.32 14078.32
2021-06 22252.03 17860.99
2021-07 13012.83 12716.34
2021-08 12544.63 13164.32
2021-09 16138.97 14508.34
2021-10 12708.93 12365.55
2021-11 15090.02 12380.34
2021-12 21302.68 10288.12
2022-02 15702.61 14499.38
2022-03 15343.35 13265.57
2022-04 8722.19 11389.36
2022-05 10970.23 12979.31
If you want to assign back to the original DataFrame:
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()
Updated df:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 48702.97 45919.59
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.95 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.93 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.23 12979.31
Fortunately pandas offers a diff function for this:
df = pd.DataFrame([['2020-05-31',48702.97,45919.59], ['2020-06-30',69403.68,62780.21], ['2020-07-31',83631.36,75324.61]], columns=['date','cumul_val1','cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)

How to groupby and sum on different column pandas

I have a dataframe df which I want to group by the column Letter and From. Example df below:
Letter Price From RT To
0 A 4 2020-06-04 11 2020-06-05
1 B 12 2020-06-04 11 2020-06-05
2 A 20 2020-06-04 11 2020-06-05
3 A 5 2020-06-04 11 2020-06-05
4 B 89 2020-06-05 11 2020-06-06
5 A 56 2020-06-05 11 2020-06-06
6 B 1 2020-06-06 11 2020-06-07
In standard SQL I would write a query following to achieve the desired results.
SELECT
Letter,
From,
SUM(Price),
MAX(RT)
FROM
some.table
GROUP BY Letter,From
I tried with the following but it didn't work.
df.groupby(['Letter','From']).sum(['RT','To'])
Try this:
df['From'] = pd.to_datetime(df['From'])
df['To'] = pd.to_datetime(df['To'])
df = df.groupby(by=['Letter', 'From'], as_index=False).agg({
'Price': 'sum',
'RT': 'max'
})
print(df)
Letter From Price RT
0 A 2020-06-04 29 11
1 A 2020-06-05 56 11
2 B 2020-06-04 12 11
3 B 2020-06-05 89 11
4 B 2020-06-06 1 11
The columns to be aggregated need to be right after the groupby to work as a slicer just as you would slice a standard df. See the docs for the parameters you can pass to sum. But you actually need aggregate or agg to be able to aggregate each column with a different function.
df.groupby(['Letter','From'])[['Price', 'RT']].agg({'Price': sum, 'RT': max})

Year On Year Growth Using Pandas - Traverse N rows Back

I have a lot of parameters on which I have to calculate the year on year growth.
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 3.44 3.60 3.99 4.40 4.61 4.73 5.11 4.97 4.92 4.89 5.29 4.51
RtlVol 97.08 97.94 98.25 99.15 99.63 100.29 100.71 101.18 102.04 101.56 101.05 99.49
IntRt 4.44 5.60 6.99 7.40 8.61 9.73 9.11 9.97 9.92 9.89 7.29 9.51
GMR 9.08 9.94 9.25 9.15 9.63 10.29 10.71 10.18 10.04 10.56 10.05 9.49
I need to calculate the growth, i.e in column 2007-Q1 i need to find the growth from 2006-Q1. The formula is (2007-Q1/2006-Q1) - 1
I have gone through the link below and tried to code
Calculating year over year growth by group in Pandas
df = pd.read_csv('c:/Econometric/EconoModel.csv')
df.set_index('Type',inplace=True)
df.sort_index(axis=1, inplace=True)
df_t = df.T
df_output=(df_cd_americas_t/df_cd_americas_t.shift(4)) -1
The output is as below
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 0.3398 0.3159 0.2806 0.1285 0.0661 0.0340 0.0363 -0.0912
RtlVol 0.0261 0.0240 0.0249 0.0204 0.0242 0.0126 0.0033 -0.0166
IntRt 0.6666 0.5375 0.3919 0.2310 0.1579 0.0195 0.0856 -0.2688
GMR 0.0077 -0.031 0.1124 0.1704 0.0571 -0.024 -0.014 -0.0127
Use iloc to shift data slices. See an example on test df.
df= pd.DataFrame({i:[0+i,1+i,2+i] for i in range(0,12)})
print(df)
0 1 2 3 4 5 6 7 8 9 10 11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12 13
df.iloc[:,list(range(3,12))] = df.iloc[:,list(range(3,12))].values/ df.iloc[:,list(range(0,9))].values - 1
print(df)
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 inf 3.0 1.50 1.00 0.75 0.600000 0.500000 0.428571
1 1 2 3 3.0 1.5 1.00 0.75 0.60 0.500000 0.428571 0.375000
2 2 3 4 1.5 1.0 0.75 0.60 0.50 0.428571 0.375000 0.333333
11
0 0.375000
1 0.333333
2 0.300000
I could not find any issue with your code.
Simply added axis=1 to the dataframe.shift() method as you are trying to do the column comparison
I have executed the following code it is giving the result you expected.
def getSampleDataframe():
df_economy_model = pd.DataFrame(
{
'Type':['MonMkt_IntRt', 'RtlVol', 'IntRt', 'GMR'],
'2006-Q1':[3.44, 97.08, 4.44, 9.08],
'2006-Q2':[3.6, 97.94, 5.6, 9.94],
'2006-Q3':[3.99, 98.25, 6.99, 9.25],
'2006-Q4':[4.4, 99.15, 7.4, 9.15],
'2007-Q1':[4.61, 99.63, 8.61, 9.63],
'2007-Q2':[4.73, 100.29, 9.73, 10.29],
'2007-Q3':[5.11, 100.71, 9.11, 10.71],
'2007-Q4':[4.97, 101.18, 9.97, 10.18],
'2008-Q1':[4.92, 102.04, 9.92, 10.04],
'2008-Q2':[4.89, 101.56, 9.89, 10.56],
'2008-Q3':[5.29, 101.05, 7.29, 10.05],
'2008-Q4':[4.51, 99.49, 9.51, 9.49]
}) # Your data
return df_economy_model>
df_cd_americas = getSampleDataframe()
df_cd_americas.set_index('Type', inplace=True)
df_yearly_growth = (df/df.shift(4, axis=1))-1
print (df_cd_americas)
print (df_yearly_growth)

pandas summation per year

python3, pandas version 0.23.4
Let's say we have a pandas DataFrame as follows
np.random.seed(45)
df = pd.DataFrame({'A': np.random.randint(0, 10, 20)}, index = pd.to_datetime(dd).sort_values(ascending=False))
Now, I would like to total the data in column A with respect to each year. I could do:
gf_perYear = gf.groupby(by= gf.index.year)
gf_perYear.sum()
A
2012 11
2013 8
2014 15
2015 44
2016 13
2017 11
However, I am wondering if there would be a way that would allow me to get the result posted in a new column right by the last day if each year, as shown below:
A sum_per_year
2017-12-15 3 11
2017-11-27 0
2017-07-24 5
2017-06-28 3
2016-11-07 4 13
2016-06-03 9
2015-12-18 8 44
2015-10-16 1
2015-09-18 5
2015-07-15 9
2015-04-09 6
2015-03-18 8
2015-02-18 7
2014-10-21 8 15
2014-09-16 5
2014-01-29 2
2013-01-04 8 8
2012-12-28 1 11
2012-08-21 6
2012-03-02 4
You can using transform
gf_perYear = gf.groupby(by= gf.index.year)
gf['new'] = gf_perYear.transform('sum')

How can I add an X axis showing plot data seconds to a matplotlib pyplot price volume graph?

The code below plots a price volume chart using data from a tab separated csv file. Each row contains values for those columns: IDX, TRD, TIMESTAMPMS, VOLUME and PRICE. As is, the X axis shows the IDX value. I would like the X axis to display the seconds computed from the timestamp in milliseconds attached to each row. How can this be obtained ?
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
data = pd.read_csv('secondary-2018-08-12-21-32-56.csv', index_col=0, sep='\t')
print(data.head(50))
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(10,5))
ax[0].plot(data.index, data['PRICE'])
ax[1].bar(data.index, data['VOLUME'])
plt.show()
The drawn graph looks like this:
Here are the data as displayed by the
print(data.head(50))
instruction:
TRD TIMESTAMPMS VOLUME PRICE
IDX
1 4 1534102380000 0.363583 6330.41
2 20 1534102381000 5.509219 6329.13
3 3 1534102382000 0.199049 6328.69
4 5 1534102383000 1.055055 6327.36
5 2 1534102384000 0.006343 6328.26
6 4 1534102385000 0.167502 6330.38
7 1 1534102386000 0.002039 6326.69
8 0 1534102387000 0.000000 6326.69
9 4 1534102388000 0.163813 6327.62
10 2 1534102389000 0.007060 6326.66
11 4 1534102390000 0.015489 6327.64
12 5 1534102391000 0.035618 6328.35
13 2 1534102392000 0.006003 6330.12
14 5 1534102393000 0.172913 6328.77
15 1 1534102394000 0.019972 6328.03
16 3 1534102395000 0.007429 6328.03
17 1 1534102396000 0.000181 6328.03
18 3 1534102397000 1.041483 6328.03
19 2 1534102398000 0.992897 6328.74
20 3 1534102399000 0.061871 6328.11
21 2 1534102400000 0.000123 6328.77
22 4 1534102401000 0.028650 6330.25
23 2 1534102402000 0.035504 6330.01
24 3 1534102403000 0.982527 6330.11
25 5 1534102404000 0.298366 6329.11
26 2 1534102405000 0.071119 6330.06
27 3 1534102406000 0.025547 6330.02
28 2 1534102407000 0.003413 6330.11
29 4 1534102408000 0.431217 6330.05
30 3 1534102409000 0.021627 6330.23
31 1 1534102410000 0.009661 6330.28
32 1 1534102411000 0.004209 6330.27
33 1 1534102412000 0.000603 6328.07
34 6 1534102413000 0.655872 6330.31
35 1 1534102414000 0.000452 6328.09
36 7 1534102415000 0.277340 6328.07
37 8 1534102416000 0.768351 6328.04
38 1 1534102417000 0.078893 6328.20
39 2 1534102418000 0.000446 6326.24
40 2 1534102419000 0.317381 6326.83
41 2 1534102420000 0.100009 6326.24
42 2 1534102421000 0.000298 6326.25
43 6 1534102422000 0.566820 6330.00
44 1 1534102423000 0.000060 6326.30
45 2 1534102424000 0.047524 6326.30
46 4 1534102425000 0.748773 6326.61
47 3 1534102426000 0.007656 6330.23
48 1 1534102427000 0.000019 6326.32
49 1 1534102428000 0.000014 6326.34
50 0 1534102429000 0.000000 6326.34
I believe you need to data.setindex('TIMESTAMPMS') to get the axis to autoscale
I dont know if i understood you correctly, try with:
data['TIMESTAMPMS'] = data['TIMESTAMPMS']/1000
ax[0].plot(data['TIMESTAMPMS'], data['PRICE'])
ax[1].bar(data['TIMESTAMPMS'], data['VOLUME'])

Resources