How do I replicate rows a specific number of times according to a condition? - python-3.x

I'm trying to create a dataframe for a game simulation to calculate how many points each player would make according to a set of parameters.
I have this dataframe:
PLAYER TYPE Quantity in my base STRENGTH POWER Number of Matches (min) \
0 A 2 15 200 3
1 B 3 80 20 0
Number of Matches (max)
0 5
1 2
df
Each row in this df represents one type of player. On column "Quantity in my base" I have the number of times each type of player appears in my base and on columns "Number of Matches" the minimum and maximum number of matches they each type of player is expected to play in one day.
I need to replicate the rows for each type of players with their respective "Strength" and "Power" a number of times that is = to "Quantity in my base" times a random number between the min and max number of matches of each one. I'm doing this so that, on the new data frame, each row will represent one match per each specific player in my base.
For instance. If
PLAYERTYPE Quantity_in_my_base Rand_Num_Matches Number_of_rows
0 A1 2 4 8
1 A2 3 3 9
Number of rows to be replicated
Than I want to create a second df like this:
PLAYERTYPE STRENGTH POWER
0 A 15 200
1 A 15 200
2 A 15 200
3 A 15 200
4 A 15 200
5 A 15 200
6 A 15 200
7 A 15 200
8 A 15 200
9 A 15 200
10 A 15 200
11 A 15 200
12 A 15 200
13 A 15 200
14 A 15 200
15 A 15 200
16 A 15 200
New df
But I want to do this for players A1, A2 and B1, B2, B3 and so on, in a way that each one will be replicating according to their respective random number.
Thank you so much!

You could use .repeat() ;
repeat_df = df.loc[df.index.repeat(df['Number of Matches'])]
repeat_df[['PLAYERTYPE', 'STRENGTH', 'POWER']]

Related

I have a table with column AGE with numbers. I wanted to cluster similar number and count

AGE
CARD
SCORE
10
1
20000
10
1
3000
25
0
2000
10
1
20000
18
1
3000
10
0
2000
12
1
20000
10
1
3000
10
0
2000
I want to count Age 10 as 4.
The first two rows (group) should be counted as 1 and 10 appearing in different rows can be added individually and the last two rows (group of age 10) should be counted as 1.
Assuming that data is in a table named "Table1":
=COUNT(1/FREQUENCY(IF(Table1[AGE]=10,ROW(Table1[AGE])),IF(Table1[AGE]<>10,ROW(Table1[AGE]))))

Returning total sum of value for each month. VBA

I need to produce a total value for each month of the year from a generated report. Data is split into colunms one with a value you the other with a date.
I need to return a total for each month.
Data is output as such:
100 21/01/2019
200 21/06/2019
150 01/01/2019
300 14/09/2019
8 08/05/2019
I need it to return as
1 2 3 4 5 6 7 8 9 10 11 12
250 0 0 0 8 200 0 0 300 0 0 0
With a further column for the following year. The original data and dates can be removed as this can be reproduced when running the next report.
You could try the below:
Add a helper column next to you date to get the month of the date:
=MONTH(B3)
and use:
=SUMPRODUCT(($C$3:$C$7=F2)*($A$3:$A$7))
Results:

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

AGGREGAT with critiera and duplicates in array

I have the following Excel spreadsheet:
A B C D E
1 ProdID Price Unique ProdID 1. Biggest 2. Biggest
2 2606639 40 2606639 50 50
3 2606639 50 4633523 45 35
4 2606639 20 3911436 25 25
5 2606639 50
6 4633523 45
7 4633523 20
8 4633523 35
9 3911436 20
10 3911436 25
11 3911436 25
12 3911436 15
In Cells D2:E4 I want to show the 1. biggest and 2. biggest price of each ProdID in Column A. Therefore, I use the following formula:
D2 =AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),1)
E2 =AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),2)
This formula works as long as the prices are unique in Column B as you can see on the second ProdID (4633523).
However, once the price is not unique in Column B (for example 50 for ProdID 26026639 and 25 for ProdID 3911436) the functions in Cells D2:E4 does not show the right results.
Do you have an idea if you can solve this issue with the AGGREGAT-Formula and wihtout using an ARRAY-Formula?
you could check number of occurences of the first ProdID-price combinations and use that in the last argument of the AGGREGAT function. So instead of
=AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),2)
you would have
=AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),2+COUNTIFS(A:A,C2,B:B,D2)-1)
of course you can just put "1+COUNTIFS..." but I put it this way so it can be better understood that it uses position 2 + number of occurences of the combination of ProdID with biggest number after the first occurence.

Using Excel to allocate values based off their rank while remaining within constraints

I am trying to create a resource calculator that can tell me how many people i need to put on each section depending on the current work waiting and work coming in. Prioritizing sections which have the most work waiting first.
Upper Limit Allocation Prod Ranking
12 [to calc] 28% 1
15 18% 2
5 17% 3
4 8% 4
2 6% 5
3 .2% 6
4 .2% 6
Similar to the other question I have a constraint that i only have so much to allocate. For this example we will use 38 as the amount that is to be allocated.
I have used the formula from the other answer:
=MIN(A2,$E$1-SUMIF($D$2:$D$8,"<"&D2,$B$2:$B$8))
Where E1 contains the total to be allocated.
I have two issues with this formula:
1)The issue that I am having is that I require a minimum value of atleast 1 person in each of these sections.
I have tried using a max function to simply set this value, however this leads to the resources allocated going over the total amount.
What equation would I need to use to make it account for both the total available to allocate, the minimum requirement for each fund and the maximum limit for each fund.
2) It only returns solid integers, would there be a way to retreive more precise results, maybe by changing it to a % distribution?
UL Alloc Rank Capacity Lower Limit
2 1 15 93 1
3 1 15
4 1 15
6 6 8
1 1 15
2 1 15
4 4 9
2 2 7
4 4 4
15 15 2
12 12 10
12 12 1
1 1 11
13 13 5
6 6 6
5 1 15
5 5 3
1 1 14
2 2 13
3 3 12
3 1 15
Reference: Using the Excel's Rank() function to calculate allocations based on ranking and constraints
Simply subtract the 100 on all sides and add them separately:
=MIN(A2-100,($E$1-100*COUNTA($A$2:$A$8))-(SUMIF($D$2:$D$8,"<"&D2,$B$2:$B$8)-COUNTIF($D$2:$D$8,"<"&D2)*100))+100
What is returned depends on your entries in Column A and in E1. You can change Column A based on a percentage distribution and the formula will return the corresponding values.
Edit:
If you set your lower threshold into F2, your Constraint into E2, using this formula
=MIN(A2-$F$2,($E$2-$F$2*COUNTA($A$2:$A$8))-(SUMIF($D$2:$D$8,"<"&D2,$B$2:$B$8)-COUNTIF($D$2:$D$8,"<"&D2)*$F$2))+$F$2
the result looks like this:

Resources