I have data in the following format:
| | Measurement 1 | | Measurement 2 | |
|------|---------------|------|---------------|------|
| | Mean | Std | Mean | Std |
| Time | | | | |
| 0 | 17 | 1.10 | 21 | 1.33 |
| 1 | 16 | 1.08 | 21 | 1.34 |
| 2 | 14 | 0.87 | 21 | 1.35 |
| 3 | 11 | 0.86 | 21 | 1.33 |
I am using the following code to generate a matplotlib line graph from this data, which shows the standard deviation as a filled in area, see below:
def seconds_to_minutes(x, pos):
minutes = f'{round(x/60, 0)}'
return minutes
fig, ax = plt.subplots()
mean_temperature_over_time['Measurement 1']['mean'].plot(kind='line', yerr=mean_temperature_over_time['Measurement 1']['std'], alpha=0.15, ax=ax)
mean_temperature_over_time['Measurement 2']['mean'].plot(kind='line', yerr=mean_temperature_over_time['Measurement 2']['std'], alpha=0.15, ax=ax)
ax.set(title="A Line Graph with Shaded Error Regions", xlabel="x", ylabel="y")
formatter = FuncFormatter(seconds_to_minutes)
ax.xaxis.set_major_formatter(formatter)
ax.grid()
ax.legend(['Mean 1', 'Mean 2'])
Output:
This seems like a very messy solution, and only actually produces shaded output because I have so much data. What is the correct way to produce a line graph from the dataframe I have with shaded error regions? I've looked at Plot yerr/xerr as shaded region rather than error bars, but am unable to adapt it for my case.
What's wrong with the linked solution? It seems pretty straightforward.
Allow me to rearrange your dataset so it's easier to load in a Pandas DataFrame
Time Measurement Mean Std
0 0 1 17 1.10
1 1 1 16 1.08
2 2 1 14 0.87
3 3 1 11 0.86
4 0 2 21 1.33
5 1 2 21 1.34
6 2 2 21 1.35
7 3 2 21 1.33
for i, m in df.groupby("Measurement"):
ax.plot(m.Time, m.Mean)
ax.fill_between(m.Time, m.Mean - m.Std, m.Mean + m.Std, alpha=0.35)
And here's the result with some random generated data:
EDIT
Since the issue is apparently iterating over your particular dataframe format let me show how you could do it (I'm new to pandas so there may be better ways). If I understood correctly your screenshot you should have something like:
Measurement 1 2
Mean Std Mean Std
Time
0 17 1.10 21 1.33
1 16 1.08 21 1.34
2 14 0.87 21 1.35
3 11 0.86 21 1.33
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
(1, Mean) 4 non-null int64
(1, Std) 4 non-null float64
(2, Mean) 4 non-null int64
(2, Std) 4 non-null float64
dtypes: float64(2), int64(2)
memory usage: 160.0 bytes
df.columns
MultiIndex(levels=[[1, 2], [u'Mean', u'Std']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'Measurement', None])
And you should be able to iterate over it with and obtain the same plot:
for i, m in df.groupby("Measurement"):
ax.plot(m["Time"], m['Mean'])
ax.fill_between(m["Time"],
m['Mean'] - m['Std'],
m['Mean'] + m['Std'], alpha=0.35)
Or you could restack it to the format above with
(df.stack("Measurement") # stack "Measurement" columns row by row
.reset_index() # make "Time" a normal column, add a new index
.sort_values("Measurement") # group values from the same Measurement
.reset_index(drop=True)) # drop sorted index and make a new one
Related
I am trying to use the "groupby" to plot all features in a dataframe in different subplots based on each serial number with ignoring the feature 'Ft' of each serial number (in subplot) where the all data are Zeros, for example, we should ignore 'Ft1' in S/N 'KLM10015' because all data in this feature are Zeros. The size of the dataframe is "5514 rows and 565 columns" with the ability of using a dataframe with different sizes.
The x-axis of each subplot represent the "Date" , y-axis represents each feature values (Ft) and the title represent the serial number (S/N).
This is an example of the dataframe which I have:
df =
S/N Ft1 Ft12 Ft17 ---- Ft1130 Ft1140 Ft1150
DATE
2021-01-06 KLM10015 0 12 14 ---- 17 52 47
2021-01-07 KLM10015 0 10 48 ---- 19 20 21
2021-01-11 KLM10015 0 0 45 ---- 0 19 0
2021-01-17 KLM10015 0 1 0 ---- 16 44 66
| | | | | | | |
| | | | | | | |
| | | | | | | |
2021-02-09 KLM10018 1 11 0 ---- 25 27 19
2021-12-13 KLM10018 12 0 19 ---- 78 77 18
2021-12-16 kLM10018 14 17 14 ---- 63 19 0
2021-07-09 KLM10018 18 0 77 ---- 65 34 98
2021-07-15 KLM10018 0 88 82 ---- 63 31 22
Code:
list_ID = ["ft1","ft12", "ft17, ......, ft1130, 1140, ft1150]
def plot_fun (dataframe):
for item in list_ID:
fig = plt.figure(figsize=(35, 20))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax1)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax2)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax3)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax4)
plt.show()
plot_fun (df)
I need really to your help. Thanks a lot
I need to create two different columns, one for this year sales and one column for last year sales from a transactional level data?
Data format:-
Date | bill amount
2019-07-22 | 500
2019-07-25 | 200
2020-11-15 | 100
2020-11-06 | 900
2020-12-09 | 50
2020-12-21 | 600
Required format:-
Year_month |This month Sales | Prev month sales
2019_07 | 700 | -
2020_11 | 1000 | -
2020_12 | 650 | 1000
The relatively tricky bit is to figure out what the previous month is. We do it by figuring out the beginning of the month for each date and then rolling back by 1 month. Note that this will take care of January -> December of previous year issues
We start by creating a sample dataframe and importing some useful modules
from io import StringIO
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
""")
df = pd.read_csv(data,sep='|')
df['date'] = pd.to_datetime(df['date'])
df
we get
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Then we figure out the month start and the previous month start using datetime utilities
df['month_start'] = df['date'].apply(lambda d:datetime(year = d.year, month = d.month, day = 1))
df['prev_month_start'] = df['month_start'].apply(lambda d:d+relativedelta(months = -1))
Then we summarize monthly sales using groupby on month start
ms_df = df.drop(columns = 'date').groupby('month_start').agg({'prev_month_start':'first','amount':sum}).reset_index()
ms_df
so we get
month_start prev_month_start amount
0 2019-07-01 2019-06-01 700
1 2020-11-01 2020-10-01 1000
2 2020-12-01 2020-11-01 650
Then we join (merge) ms_df on itself by mapping 'prev_month_start' to 'month_start'
ms_df2 = ms_df.merge(ms_df, left_on='prev_month_start', right_on='month_start', how = 'left', suffixes = ('','_prev'))
We are more or less there but now make it pretty by getting rid of superfluous columns, adding labels, etc
ms_df2['label'] = ms_df2['month_start'].dt.strftime('%Y_%m')
ms_df2 = ms_df2.drop(columns = ['month_start','prev_month_start','month_start_prev','prev_month_start_prev'])
columns = ['label','amount','amount_prev']
ms_df2 = ms_df2[columns]
and we get
| | label | amount | amount_prev |
|---:|--------:|---------:|--------------:|
| 0 | 2019_07 | 700 | nan |
| 1 | 2020_11 | 1000 | nan |
| 2 | 2020_12 | 650 | 1000 |
Using #piterbarg's data, we can use resample, combined with shift and concat to get your desired data:
import pandas as pd
from io import StringIO
data = StringIO(
"""
date|amount
2019-07-22|500
2019-07-25|200
2020-11-15|100
2020-11-06|900
2020-12-09|50
2020-12-21|600
"""
)
df = pd.read_csv(data, sep="|", parse_dates=["date"])
df
date amount
0 2019-07-22 500
1 2019-07-25 200
2 2020-11-15 100
3 2020-11-06 900
4 2020-12-09 50
5 2020-12-21 600
Get the sum for current sales:
data = df.resample(on="date", rule="1M").amount.sum().rename("This_month")
data
date
2019-07-31 700
2019-08-31 0
2019-09-30 0
2019-10-31 0
2019-11-30 0
2019-12-31 0
2020-01-31 0
2020-02-29 0
2020-03-31 0
2020-04-30 0
2020-05-31 0
2020-06-30 0
2020-07-31 0
2020-08-31 0
2020-09-30 0
2020-10-31 0
2020-11-30 1000
2020-12-31 650
Freq: M, Name: This_month, dtype: int64
Now, we can shift the month to get values for previous month, and drop rows that have 0 as total sales to get your final output:
(pd.concat([data, data.shift().rename("previous_month")], axis=1)
.query("This_month!=0")
.fillna(0))
This_month previous_month
date
2019-07-31 700 0.0
2020-11-30 1000 0.0
2020-12-31 650 1000.0
This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 3 years ago.
I want to create a new column in python dataframe based on other column values in multiple rows.
For example, my python dataframe df:
A | B
------------
10 | 1
20 | 1
30 | 1
10 | 1
10 | 2
15 | 3
10 | 3
I want to create variable C that is based on the value of variable A with condition from variable B in multiple rows. When the value of variable B in row i,i+1,..., the the value of C is the sum of variable A in those rows. In this case, my output data frame will be:
A | B | C
--------------------
10 | 1 | 70
20 | 1 | 70
30 | 1 | 70
10 | 1 | 70
10 | 2 | 10
15 | 3 | 25
10 | 3 | 25
I haven't got any idea the best way to achieve this. Can anyone help?
Thanks in advance
recreate the data:
import pandas as pd
A = [10,20,30,10,10,15,10]
B = [1,1,1,1,2,3,3]
df = pd.DataFrame({'A':A, 'B':B})
df
A B
0 10 1
1 20 1
2 30 1
3 10 1
4 10 2
5 15 3
6 10 3
and then i'll create a lookup Series from the df:
lookup = df.groupby('B')['A'].sum()
lookup
A
B
1 70
2 10
3 25
and then i'll use that lookup on the df using apply
df.loc[:,'C'] = df.apply(lambda row: lookup[lookup.index == row['B']].values[0], axis=1)
df
A B C
0 10 1 70
1 20 1 70
2 30 1 70
3 10 1 70
4 10 2 10
5 15 3 25
6 10 3 25
You have to use groupby() method, to group the rows on B and sum() on A.
df['C'] = df.groupby('B')['A'].transform(sum)
I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!
With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN
Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.
If I have one dimension data such as
v1 =: 5 $ i.5
v1
0 1 2 3 4
then v1 -"1 0 v1 gives me the Euclidean vectors like
v1 -"1 0 v1
0 1 2 3 4
_1 0 1 2 3
_2 _1 0 1 2
_3 _2 _1 0 1
_4 _3 _2 _1 0
We can also find Euclidean distance matrix easily.
This is how I find the Euclidean vectors and distance matrix of a 2D vector
v2=: <"1 ? 5 2 $ 10
v2
┌───┬───┬───┬───┬───┐
│4 0│4 5│5 7│8 3│6 0│
└───┴───┴───┴───┴───┘
direction_vector=: <"1 #: (-"0 #:(-/"2 #: (>"0 #: (diff))))
distance =: +/"1 #: *: #: (>"2 #:(direction_vector))
m =: 3 : '(i.(#y)) distance"0 _ y'
m v2
0 25 50 25 4
25 0 5 20 29
50 5 0 25 50
25 20 25 0 13
4 29 50 13 0
However, my problem is that I am not sure how to find the Euclidean vectors and distance of 2D data in a smart and clean way
As you can see from the table below, my algorithm took more than 1/3 of time to calculate the direction vectors of data. 14.5 second is not bad, but the problem arises when I have a bigger dataset.
Time (seconds)
+----------------+------+--------+---------+-----+----+------+
|name |locale|all |here |here%|cum%|rep |
+----------------+------+--------+---------+-----+----+------+
|direction_vector|base |6.239582| 5.105430| 35.2| 35 |773040|
|move |base |9.741510| 1.753868| 12.1| 47 | 3390|
|script |base |1.969949| 1.443148| 9.9| 57 | 18|
|distance |base |5.650358| 1.318022| 9.1| 66 |579780|
|enclosestrings |pcsv |1.491832| 1.255603| 8.6| 75 | 1|
|diff |base |1.134585| 1.134585| 7.8| 83 |773186|
|makedsv |pcsv |1.728721| 0.236883| 1.6| 84 | 4|
|norm |base |0.221794| 0.221794| 1.5| 86 | 3390|
|xpt |base |0.194896| 0.194896| 1.3| 87 | 3390|
|ypt |base |0.193579| 0.193579| 1.3| 89 | 3390|
|writedsv |pcsv |2.067408| 0.186687| 1.3| 90 | 4|
|cosd |base |0.172359| 0.172359| 1.2| 91 | 113|
|[rest] | | | 1.300733| 9.0|100 | |
|[total] | | |14.517587|100.0|100 | |
+----------------+------+--------+---------+-----+----+------+
I think I can definitely simplify direction_vector one by using rank, but I got stuck. I tried "2 1 "1 1 "1 2 "_ 1 "1 0 ..., but none of them gave me a clear result.
Can anyone help me on this issue? Thank you!
I would begin with noticing that v2 u/ v2 makes a table out of the items of v2 by applying u between those items. Also, you can simplify this a bit by using u/~ v2 which is the same as v2 u/ v2. The next question is what is u, but before we go there, boxing really slows things down and you don't actually have to box the vector for this to work since the items can already be written like this:
[ v2=: 5 2 $ 4 0 4 5 5 7 8 3 6 0
4 0
4 5
5 7
8 3
6 0
and this makes the items the vectors, which is what you want to be able to use u/~ v2
Now, back to the question of what we want u to be. We are going to be working on the items of v2 Since u is being fed the items of v2 to make the table, you would like to subtract the items from each other as vectors (rank 1) and then square them and then add them together. Translating this into J you get +/#:*:#:-"1 as u
+/#:*:#:-"1/~ v2
0 25 50 25 4
25 0 5 20 29
50 5 0 25 50
25 20 25 0 13
4 29 50 13 0
If you time this I hope that you will find it much faster than your solution because it does not require boxing. The key area that you were concerned about with rank was that it be applied after the Table adverb /
Hope this helps even though it is a slightly different approach and let me know what your timings look like.
New answer so that I could format the result properly.
Looking at your result I think that there is further room for improvement at least by my measurements. Getting away from the tacit approach and moving to a sequential application with the change of +/ to +/"1 gives roughly twice the speed.
[ v2=: 5 2 $ 4 0 4 5 5 7 8 3 6 0
4 0
4 5
5 7
8 3
6 0
(10000) 6!:2 '+/#:*:#:-"1/~ v2'
8.4647e_6
(10000) 6!:2 '+/"1 *: -"1/~ v2'
3.1289e_6
(+/"1 *: -"1/~ v2) -: (+/#:*:#:-"1/~ v2)
1
+/"1 *: -"1/~ v2
0 25 50 25 4
25 0 5 20 29
50 5 0 25 50
25 20 25 0 13
4 29 50 13 0