Python Pandas: How to insert a new column which is a sum of next 'n' (can be a fraction also) values of another column? - python-3.x

I've got a DataFrame, let's say the name is 'test' storing data as below:
Week Stock(In Number of Weeks) Demand (In Units)
0 W01 2.4 37
1 W02 3.6 33
2 W03 2.0 46
3 W04 5.8 45
4 W05 4.6 56
5 W06 3.0 38
6 W07 5.0 45
7 W08 7.5 54
8 W09 4.3 35
9 W10 2.2 38
10 W11 2.0 50
11 W12 6.0 37
I want to insert a new column in this dataframe which for every row, is the sum of "No. of weeks" rows of column "Demand(In Units)".
That is, in the case of this dataframe,
for 0th row that new column should be the sum of 2.4 rows of column "Demand(In Units)" which would be 37+33+ 0.4*46
for 1st row, the value should be 33+46+45+ 0.6*56
for 2nd row, it should be 46+45
.
.
.
for 7th row, it should be 54+35+38+50+37 (since number of rows left are smaller than the value 7.5, all the remaining rows get summed up)
.
.
.
and so on.
Effectively, I want my dataframe to have a new column as follows:
Week Stock(In Number of Weeks) Demand (In Units) Stock (In Units)
0 W01 2.4 37 88.4
1 W02 3.6 33 157.6
2 W03 2.0 46 91.0
3 W04 5.8 45 266.0
4 W05 4.6 56 214.0
5 W06 3.0 38 137.0
6 W07 5.0 45 222.0
7 W08 7.5 54 214.0
8 W09 4.3 35 160.0
9 W10 2.2 38 95.4
10 W11 2.0 50 87.0
11 W12 6.0 37 37.0
Can somebody suggest some way to achieve this?
I can achieve it through iterating over each row but it would be very slow for millions of rows which I want to process at a time.
The code which I am using right now is:
for i in range(len(test)):
if int(np.floor(test.loc[i, 'Stock(In Number of Weeks)'])) >= len(test[i:]):
number_of_full_rows = len(test[i:])
fraction_of_last_row = 0
y = 0
else:
number_of_full_rows = int(np.floor(test.loc[i, 'Stock(In Number of Weeks)']))
fraction_of_last_row = test.loc[i, 'Stock(In Number of Weeks)'] - number_of_full_rows
y = test.loc[i+number_of_full_rows, 'Demand (In Units)'] * fraction_of_last_row
x = np.sum(test[i:i+number_of_full_rows]['Demand (In Units)'])
test.loc[i, 'Stock (In Units)'] = x+y

I tried with some test data:
def func(r, col):
n = int(r['Stock(In Number of Weeks)'])
f = float(r['Stock(In Number of Weeks)'] - n)
i = r.name # row index value
z = np.zeros(len(df)) #initialize all zeros
v = np.hstack((np.ones(n), np.array(f))) # vecotor of ones and fraction part
e = min(len(v), len(z[i:]))
z[i:i+e] = v[:len(z[i:])] #change z starting at index until lenght
r['Stock (In Units)'] = col # z #compute scalar product
return r
df = df.apply(lambda r: func(df['Demand (In Units)'].values, r), axis=1)

Related

Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie
start end total
9 60 51
200 260 60
I'm trying to keep the start and end separate so I can call on them independently later.
I have this to open the .csv file and read its contents:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
df.groupby(0).filter(lambda x: len(x) > 0)
Which gives me this:
0
0 52.0
1 65.0
2 52.0
3 52.0
4 52.0
... ...
4995 8.0
4996 8.0
4997 8.0
4998 8.0
4999 8.0
5000 rows × 1 columns
I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:
df.groupby(0).count()
gives me this:
0
0.0
1.0
2.0
3.0
4.0
...
68.0
69.0
70.0
71.0
73.0
65 rows × 0 columns
Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do
df.groupby(0).filter(lambda x: len(x) > 0)
but if I change the 0 to anything else then it screws up my generated list.
Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.
Can I get some help, please?
UPDATE
Thanks! So this is what I have thanks to #piterbarg:
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df
.reset_index()
.shift(periods=-1)
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
out:
0 index
mean first last len
0 7 32 87 56
1 19 277 333 57
2 1 785 940 156
3 30 4062 4125 64
4 29 4214 4269 56
5 7 4450 4599 150
6 1 4612 4775 164
7 7 4778 4882 105
8 8 4945 4999 56
The current issue is trying to make it so the dataframe includes the last row of my .csv. If anyone happens to see this, I would love your input!
Let's start by mocking a df:
import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5
df.loc[600:660,0] = 4
First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({'index':['first','last',len]})
)
Then we only pick those groups that are longer than 50
(df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True)
)
output:
index
first last len
0 300 400 101
1 600 660 61
For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df
Edit
Your code is not quite right, should be
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({0 : 'mean', 'index':['first','last',len]}))
df3 = (df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True))
print(df3)
note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)
first is the index when the repetition starts, last is the last index, and len is how many elements there.
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
.shift(-1)
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
yields this:
0 index
mean first last len
0 7 31 86 56
1 19 276 332 57
2 1 784 939 156
3 31 4061 4124 64
4 29 4213 4268 56
5 8 4449 4598 150
6 1 4611 4774 164
7 8 4777 4881 105
8 8 4944 4999 56
Which I love. I did notice that the group with 56x duplicates of '7' actually starts on row 32, and ends on row 87 (just one later in both cases, and the pattern is consistent throughout the sheet). Am I right in believing that this can be fixed with the shift() function somehow? I'm toying around with this still :D

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

Fill in missing values in DataFrame Column which is incrementing by 10

Say , Some Values in the 'Counts' column are missing. These numbers are meant to be increased by 10 with each row so '35' and '55' need to be put in place. I would want to fill in these missing values.
Counts
0 25
1 NaN
2 45
3 NaN
4 65
So my output should be :
Counts
0 25
1 35
2 45
3 55
4 65
Thanks,
We have interpolate
df=df.interpolate()
Counts
0 25.0
1 35.0
2 45.0
3 55.0
4 65.0
Since you now the pattern, you can simply recreate it:
start = df.iloc[0]['Counts'] # first row
end = df.iloc[-1]['Counts'] # last row
df['Counts'] = np.where(df['Counts'].notnull(), df['Counts'],
np.arange(start, end + 1, 10))

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

How to find exponential formula coefficients?

I have the following pairs of values:
X Y
1 2736
2 3124
3 3560
4 4047
5 4594
6 5205
7 5890
8 6658
9 7518
10 8480
18 21741
32 108180
35 152237
36 170566
37 191068
38 214087
39 239838
40 268679
When I put these pairs in Excel, I get a exponential formula:
Y = 2559*e^(0.1167*X)
with an accuracy of 99,98%.
Is there a way to ask from Excel to provide a formula in the following format:
Y = (A/B)*C^X-D
If not, is it possible to convert the above formula to the wanted one?
Note, that I am not familiar with Matlab.
You already have it !
A = 2559
B = 1
C = exp(0.1167)
D = 0
You'll see that it is equivalent to your formula Y = 2559*e^(0.1167*X), because e^(0.1167*X) = (e^0.1167)^X

Resources