Cumulatively Reduce Values in Column by Results of Another Column - python-3.x

I am dealing with a dataset that shows duplicate stock per part and location. Orders from multiple customers are coming in and the stock was just added via a vlookup. I need help writing some sort of looping function in python that cumulatively decreases the stock quantity by the order quantity.
Currently data looks like this:
SKU Plant Order Stock
0 5455 989 2 90
1 5455 989 15 90
2 5455 990 10 80
3 5455 990 20 80
I want to accomplish this:
SKU Plant Order Stock
0 5455 989 2 88
1 5455 989 15 73
2 5455 990 10 70
3 5455 990 20 50

Try:
df.Stock -= df.groupby(['SKU','Plant'])['Order'].cumsum()

Related

How do I create series in Excel using criteria from values in other cells?

I have an Excel spreadsheet populated as below:
Latitude
Longitude
Altitude
Value
10
10
1
100
10
10
5
105
10
10
20
120
10
5
1
150
10
5
5
155
10
5
20
170
15
10
1
500
15
10
5
505
15
10
20
520
15
5
1
550
15
5
5
555
15
5
20
570
Using this data, I would like to create a Chart in Excel where I have Value on the X-axis, Altitude on the Y-Axis and a series for each unique combination of Latitude and Longitude.
This should result in 4 series being plotted on the Chart with each series having 3 values (one value for each Altitude. I feel like this should be easy to do but I'm struggling to do it myself or find something using the grand-old Google.
Any help you could provide this Excel-noob would be greatly appreciated!
If you re-arrange your data like that
value
altitude
value
altitude
value
altitude
value
altitude
long-lat:
10-10
10-5
15-10
15-5
100
1
150
1
500
1
550
1
105
5
155
5
505
5
555
5
120
20
170
20
520
20
570
20
you can insert the four curves individually into a "points (x/y)" diagram:
Here is a screenshot of how the curves are defined:

Selecting columns/axes for correlation from Pandas df

I have a pandas dataframe like the one below. I would like to build a correlation matrix that establishes the relationship between product ownership and the profit/cost/rev for a series of customer records.
prod_owned_a prod_owned_b profit cost rev
0 1 0 100 75 175
1 0 1 125 100 225
2 1 0 100 75 175
3 1 1 225 175 400
4 0 1 125 100 225
Ideally, the matrix will have all prod_owned along one axis with profit/cost/rev along another. I would like to avoid including the correlation between prod_owned_a and prod_owned_b in the correlation matrix.
Question: How can I select specific columns for each axis? Thank you!
As long as the order of the columns does not change, you can use slicing:
df.corr().loc[:'prod_owned_b', 'profit':]
# profit cost rev
#prod_owned_a 0.176090 0.111111 0.147442
#prod_owned_b 0.616316 0.666667 0.638915
A more robust solution locates all "prod_*" columns:
prod_cols = df.columns.str.match('prod_')
df.corr().loc[prod_cols, ~prod_cols]
# profit cost rev
#prod_owned_a 0.176090 0.111111 0.147442
#prod_owned_b 0.616316 0.666667 0.638915
Not very optimized but still;
df.corr().loc[['prod_owned_a', 'prod_owned_b'], ['profit', 'cost', 'rev']]

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Excel - running % of running total in pivot table

I have a table like:
periodo quintil pos
201611 1 10
201611 2 20
201611 3 30
201611 4 40
201611 5 50
201612 1 9
201612 2 19
201612 3 29
201612 4 39
201612 5 49
I need to create a pivot table like:
periodo quintil running_pos running_%
201611
1 10 7%
2 30 20%
3 60 40%
4 100 67%
5 150 100%
201612
1 9 6%
2 28 19%
3 57 39%
4 96 66%
5 145 100%
Since the running total is not a new field, but a way to show an older field (pos- show as total in quintil), the problem arises when I try to create the running % of the running total.
How can I introduce also this field (running % of running total)?
In spanish there's nothing with a name like running totals translation....
To display what you want in a Pivot Table
- Drag pos to the values area three times
- For the first, use the SUM
- For the second, use the "show as running total"
- For the third, use the "show as % running total"
Here are the results with minimal formatting
Here are the value settings for the third column:

Find total no of links to and from node based on data in csv

I have a csv with the following info
Src Rx LinkId Weight
===================================
2 1 4000 10
2 1 4056 15
3 1 4100 10
3 1 4156 15
28 1 10650 8
113 2 15051 205
113 3 15058 205
1 4 3952 9
1 4 3951 5
1 4 3950 34
2 4 4052 9
47 4 18672 44
47 4 18670 38
69 4 4701 11
69 4 4700 21
70 4 4801 11
`
The linkId is unique. Each row represents the link between two devices. For example, source 2 and rx 1 means that a link goes from 2 to 1.
I intend to compute the total weight of all the links originating from each device and coming into each device like so:
Device Out weight In weight
=============================
2 25 205
1 48 58
and so on.
I would like to know if doing this is possible in excel. If yes, how.
Using a pivot table may be the best solution here and I think that if you select this table and click pivot-table it will give you your answer.
Alternatively, you can make a column for each in and out and use =sumif(Src, 1, weight ) and then use the totals at the bottom of each column.

Resources