pandas - get difference to previous n-th rows - python-3.x

Assume I have the following data frame in pandas, with accumulated values over time for all ids:
id
date
value
1
01.01.1999
2
2
01.01.1999
3
3
01.01.1999
5
1
03.01.1999
5
2
03.01.1999
8
3
03.01.1999
7
And I want to have the following, the difference for each id to the previous date:
id
date
value
1
01.01.1999
2
2
01.01.1999
3
3
01.01.1999
5
1
03.01.1999
3
2
03.01.1999
5
3
03.01.1999
2
This is basically the difference. I can only apply something like this:
df["values"].diff().fillna(0)
But this would not include the date column. Any help?

IIUC, you want to groupby and diff
df['value'] = df.groupby('id')['value'].diff().fillna(df['value'])
print(df)
id date value
0 1 01.01.1999 2.0
1 2 01.01.1999 3.0
2 3 01.01.1999 5.0
3 1 03.01.1999 3.0
4 2 03.01.1999 5.0
5 3 03.01.1999 2.0

Related

pandas - rolling sum last seven days over different rows

Starting from this data frame:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
5
2
01.03.
3
1
01.09.
5
2
01.09.
2
1
01.10.
5
2
01.10.
2
I would like to get a weekly sum of value:
id
date
value
1
01.01.
2
2
01.01.
3
1
01.03.
7
2
01.03.
6
1
01.09.
10
2
01.09.
5
1
01.10.
15
2
01.10.
8
I use this command, but it is not working:
df['value'] = df.groupby('id')['value'].rolling(7).sum()
Any ideas?
You can do groupby and apply.
df['date'] = pd.to_datetime(df['date'], format='%m.%d.')
df['value'] = (df.groupby('id', as_index=False, group_keys=False)
.apply(lambda g: g.rolling('7D', on='date')['value'].sum()))
Note that for 1900-01-10, the rolling window is 1900-01-04, 1900-01-05...1900-01-10
print(df)
id date value
0 1 1900-01-01 2.0
1 2 1900-01-01 3.0
2 1 1900-01-03 7.0
3 2 1900-01-03 6.0
4 1 1900-01-09 10.0
5 2 1900-01-09 5.0
6 1 1900-01-10 10.0
7 2 1900-01-10 4.0

sumproduct in different columns between dates

Im trying to sum between two dates across columns. If I had a start date input in Sheet1!F1 and an end date input in Sheet1!F2 and I needed to multiply column B times column E.
I can do sumproduct(Sheet1!B2:B14,Sheet1!E2:E14) which would result in 48 based on the example table below. However, I need to include date parameters so I could choose between dates 2/1/15 and 6/1/15 which should result in 20.
A B C D E
Date Value1 Value2 Value3 Value4
1/1/2015 1 2 3 4
2/1/2015 1 2 3 4
3/1/2015 1 2 3 4
4/1/2015 1 2 3 4
5/1/2015 1 2 3 4
6/1/2015 1 2 3 4
7/1/2015 1 2 3 4
8/1/2015 1 2 3 4
9/1/2015 1 2 3 4
10/1/2015 1 2 3 4
11/1/2015 1 2 3 4
12/1/2015 1 2 3 4
Try,
=SUMPRODUCT((Sheet1!A2:A14>=Sheet1!F1)*(Sheet1!A2:A14<=Sheet1!F2)*Sheet1!B2:B14*Sheet1!E2:E14)

Pandas how to turn each group into a dataframe using groupby

I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6

Repeating elements in a dataframe

Hi all I have the following dataframe:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
And I am trying to only repeat the last two rows of the data so that it looks like this:
A | B | C
1 2 3
2 3 4
3 4 5
3 4 5
4 5 6
4 5 6
I have tried using append, concat and repeat to no avail.
repeated = lambda x:x.repeat(2)
df.append(df[-2:].apply(repeated),ignore_index=True)
This returns the following dataframe, which is incorrect:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
3 4 5
3 4 5
4 5 6
4 5 6
You can use numpy.repeat for repeating index and then create df1 by loc, last append to original, but before filter out last 2 rows by iloc:
df1 = df.loc[np.repeat(df.index[-2:].values, 2)]
print (df1)
A B C
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
print (df.iloc[:-2])
A B C
0 1 2 3
1 2 3 4
df = df.iloc[:-2].append(df1,ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
If want use your code add iloc for filtering only last 2 rows:
repeated = lambda x:x.repeat(2)
df = df.iloc[:-2].append(df.iloc[-2:].apply(repeated),ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
Use pd.concat and index slicing with .iloc:
pd.concat([df,df.iloc[-2:]]).sort_values(by='A')
Output:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I'm partial to manipulating the index into the pattern we are aiming for then asking the dataframe to take the new form.
Option 1
Use pd.DataFrame.reindex
df.reindex(df.index[:-2].append(df.index[-2:].repeat(2)))
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Same thing in multiple lines
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.reindex(idx)
Could also use loc
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.loc[idx]
Option 2
Reconstruct from values. Only do this is all dtypes are the same.
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
pd.DataFrame(df.values[idx], df.index[idx])
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Option 3
Can also use np.array in iloc
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
df.iloc[idx]
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6

how to calculate standard deviation from different colums in shell script

I have a datafile with 10 columns as given below
ifile.txt
2 4 4 2 1 2 2 4 2 1
3 3 1 5 3 3 4 5 3 3
4 3 3 2 2 1 2 3 4 2
5 3 1 3 1 2 4 5 6 8
I want to add 11th column which will show the standard deviation of each rows along 10 columns. i.e. STDEV(2 4 4 2 1 2 2 4 2 1) and so on.
I am able to do by taking tranpose, then using the following command and again taking transpose
awk '{x[NR]=$0; s+=$1} END{a=s/NR; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/NR); print sd}'
Can anybody suggest a simpler way so that I can do it directly along each row.
You can do the same with one pass as well.
awk '{for(i=1;i<=NF;i++){s+=$i;ss+=$i*$i}m=s/NF;$(NF+1)=sqrt(ss/NF-m*m);s=ss=0}1' ifile.txt
Do you mean something like this ?
awk '{for(i=1;i<=NF;i++)s+=$i;M=s/NF;
for(i=1;i<=NF;i++)sd+=(($i-M)^2);$(NF+1)=sqrt(sd/NF);M=sd=s=0}1' file
2 4 4 2 1 2 2 4 2 1 1.11355
3 3 1 5 3 3 4 5 3 3 1.1
4 3 3 2 2 1 2 3 4 2 0.916515
5 3 1 3 1 2 4 5 6 8 2.13542
You just use the fields instead of transposing and using the rows.

Resources