Flattern data by count - excel

I have some data like this:
# of jobs count
--------- -----
1 2
2 3
3 1
4 1
They represent a multiset {1, 1, 2, 2, 2, 3, 4}. I want to do statistics (medium, mean, mode) against the multiset. Is there any way to somehow generate this list from the 4x2 data above? That way I can use the built-in functions (MEDIUM/AVERAGE/MODE). Currently I’m using SUMPRODUCT to calculate mean, and a combination of MAX/MATCH/INDEX to get the mode, but I can’t figure out a way to calculate medium.
Note:
Of course the real data is much more than 4 rows, but the idea should be the same.
The first column is sorted integers, if that helps.
It’s OK to use some auxiliary cells to hold intermediate data.
It doesn’t have to be a formula; if pivot table is a better tool, please advise.

With access to CONCAT you could use:
=FILTERXML(CONCAT("<t><s>",REPT(A2:A5&"</s><s>",B2:B5),"</s></t>"),"//s[node()]")
This would return {1, 1, 2, 2, 2, 3, 4} and you could directly apply the other functions, e.g.:
=MEDIAN(FILTERXML.....) etc.

Related

How can I apply a function to multiple columns of grouped rows and make a column of the output?

I have a large excel file containing stock data loaded from and sorted from an API. Below is a sample of dummy data that is applicable towards solving this problem:
I would like to create a function that Scores Stocks grouped by their Industry. For this sample I would like to score a stock a 5 if its ['Growth'] value is less than its groups mean Growth (quantile, percentile) in Growth and a 10 if above the groups mean growth. The scores of all values in a column should be returned in a list
Current Unfinished Code:
import numpy as np
import pandas as pd
data = pd.DataFrame(pd.read_excel('dummydata.xlsx')
Desired input:
data['Growth'].apply(score) # Scores stock
Desired Output:
[5, 10, 10, 5, 10, 5]
If I can create a function for this sample then I will be able to make similar ones for different columns with slightly different conditions and aggregates (like percentile or quantile) that affect the score. I'd say the main problem here is accessing these grouped values and comparing them.
I don't think it's possible to convert from a Series to a list in the apply call. I may be wrong on that but if the desired output was changed slightly to
data['Growth'].apply(score).tolist()
then you can use a lambda function to do this.
score = lambda x: 5 if x < growth.mean() else 10
data['Growth'].apply(score).tolist() # outputs [5, 10, 10, 5, 10, 5]

How to get values of one column based on another column using specific match values

I have 5 columns contains [ Voltage,Bus,Load,load_Values,transmission, transmission_Values]. all the column name with Values contain numerical value based on their corresponding value.The csv files looks like that below
Voltage Bus Load load_Values transmission transmission_Values
Voltage(1) 2 load(1) 3 transmission(1) 2
Voltage(2) 2 load(2) 4 transmission(2) 3
Voltage(5) 3 load(3) 5 transmission(3) 5
I have to fetch value of Bus based on Transmission and load. for example
To get the value of bus. First, I need to fetch the value of transmission(2) which is 3. Now based on this value, I need to get the value of load which is load(3)=5.Next, Based on this value, I have to get the value of Voltage(5) which is 3.
I tried to get the value of single column based on the their corresponding column value.
total=df[df['load']=='load(1)']['load_Values']
next_total= df[df['transmission']=='transmission['total']']['transmission_Values']
v_total= df[df['Voltage']=='Voltage(5)']['Voltage_Values']
How to get all these values automatically. For example, if i have 1100 values in every column, How I can fetch all the values for 1100 in these columns.
This is how dataset looks like
So to get the Value of VRES_LD which is new column. For that I have to look for the I__ND_LD Column which has value I__ND_LD(1) and corressponding value stored in I__ND_LD_Values which is 10. Once I get the value 10 now based on that I ahve to Look for I__BS_ND column which has I__BS__ND(10) and its value is 5.0 in I__BS_ND_Values. Based on this value, I have to find the value of V_BS(5) which is 0.986009. Now this value should be store in the new column VRES_LD. Please let me know if you get it now.
I generalized your solution so you can work with as many values as you want.
I changed the name "Load_Value" to "load_value_name" to avoid confusion since there is a variable named "load_value" in lowercase.
You can start with as many values as you want; in our example we start with "1":
start_values = [1]
load_value_name = [f"^I__ND_LD({n})" for n in start_values]
#Output: but you'll have more than one if needed
['^I__ND_LD(1)']
Then we fetch all the values:
load_values=df[df['I__ND_LD'].isin(load_names)]['I__ND_LD_Values'].values.astype(np.int)
#output: again, more if needed
array([10])
let's get the bus names:
bus_names = [f"^I__BS_ND({n})" for n in load_values]
bus_values = df[df['I__BS_ND'].isin(bus_names)]['I__BS_ND_Values'].values.astype(np.int)
#output
array([5])
And finally voltage:
voltage_bus_value = [f"^V_BS({n})" for n in bus_values]
voltage_values = df[df['V_BS'].isin(voltage_names)]['V_BS_Values'].values
#output
array([0.98974069])
Notes:
Instead of rounding I downcasted to int; and .isin() method looks for all occurances so you can fetch all of the values.
If I understand correctly, you should be able to create key/value tables and use merge. The step to voltage is a little unclear, but the basic idea below should work, I think:
df = pd.DataFrame({'voltage': {0: 'Voltage(1)', 1: 'Voltage(2)', 2: 'Voltage(5)'},
'bus': {0: 2, 1: 2, 2: 3},
'load': {0: 'load(1)', 1: 'load(2)', 2: 'load(3)'},
'load_values': {0: 3, 1: 4, 2: 5},
'transmission': {0: 'transmission(1)',
1: 'transmission(2)',
2: 'transmission(3)'},
'transmission_values': {0: 2, 1: 3, 2: 5}})
load = df[['load', 'load_values']].copy()
trans = df[['transmission','transmission_values']].copy()
load['load'] = load['load'].str.extract('(\d)').astype(int)
trans['transmission'] = trans['transmission'].str.extract('(\d)').astype(int)
(df[['bus']].merge(trans, how='left', left_on='bus', right_on='transmission')
.merge(load, how='left', left_on='transmission_values', right_on='load'))
resulting in:
bus transmission transmission_values load load_values
0 2 2 3 3.0 5.0
1 2 2 3 3.0 5.0
2 3 3 5 NaN NaN
I think you need to do 3 things.
1.You need to put a number inside a string. You do it like this:
n_cookies = 3
f"I want {n_cookies} cookies"
#Output
I want 3 cookies
2.Let's say the values you need to fetch are:
transmission_values = [2,5,20]
You than need to fetch those load values:
load_values_to_fetch = [f"transmission({n})" for n in transmission_values]
#output
[transmission(2),transmission(5),transmission(20)]
3.Get all the voltage values from the df. Use .isin() method:
voltage_value= df[df['Voltage'].isin(load_values_to_fetch )]['Voltage_Values'].values
I hope I understood the problem correctly. Try and let us know because I can't try the code without data

How to use dataframe column in for loop

I am trying to implement a formula to create a new column in Dataframe using existing column but that column is a summation from 0 to a number present in some other column.
I was trying something like this:
dataset['B']=sum([1/i for i in range(dataset['A'])])
I know something like this would work
dataset['B']=sum([1/i for i in range(10)])
but I want to make this 10 dynamic based on some different column.
I keep on getting this error.
TypeError: 'Series' object cannot be interpreted as an integer
First of all, I should admit that I could not understand you question completely. However, what I understood that you want to iterate over the rows of a DataFrame and make a new column by doing some operation/s on that value.
Is that is so, then I would recommend you following link
Regarding TypeError: 'Series' object cannot be interpreted as an integer:
The init signature range() takes integers as input. i.e [i for i in range(10)] should give you [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. However, if one of the value from your dataset['A'] is float, or not integer , this might result in the error you are having. Moreover, if you notice, the first value is a zero, as a result, 1/i should result in a different error. As a result, you might have to rewrite the code as [1/i for i in range (1 , row_value_of_dataset['A'])]
It will be highly appreciate if you could make an example of what you DataFrame might look like and what is your desired output. Then perhaps it is easier to post a solution.
BTW forgot to post what I understood from your question:
#assume the data:
>>>import pandas as pd
>>>data = pd.DataFrame({'A': (1, 2, 3, 4)})
#the data
>>>data
A
0 1
1 2
2 3
3 4
#doing operation on each of the rows
>>>data['B']=data.apply(lambda row: sum([1/i for i in range(1, row.A)] ), axis=1)
# Column B is the newly added data
>>>data
A B
0 1 0.000000
1 2 1.000000
2 3 1.500000
3 4 1.833333
Perhaps explicitly use cumsum, or even apply?
Anyway trying to move an array/list item directly into a dataframe and seem to view this as a dictionary. Try something like this, I've not tested it,
array_x = [x, 1/x for x in dataset.values.tolist()] # or `dataset.A.tolist()`
df = pd.DataFrame(data=(np.asarray(array_x)))
df.columns = [A, B]
Here the idea is to break the Series apart into a list, and input the list into a dataframe. This can be explicitly done without needing to go Series->list->dataframe and is not very efficient.

Sum values based on first occurrence of other column using excel formula

Let's say I have the following two columns in excel spreadsheet
A B
1 10
1 10
1 10
2 20
3 5
3 5
and I would like to sum the values from B-column that represents the first occurrence of the value in A-column using a formula. So I expect to get the following result:
result = B1+B4+B5 = 35
i.e., sum column B where any unique value exists in the same row but Column A. In my case if Ai = Aj, then Bi=Bj, where i,j represents the row positions. It means that if two rows from A-column have the same value, then its corresponding values from B-column are the same. I can have the value sorted by column A values, but I prefer to have a formula that works regardless of sorting.
I found this post that refers to the same problem, but the proposed solution I am not able to understand.
Use SUMPRODUCT and COUNTIF:
=SUMPRODUCT(B1:B6/COUNTIF(A1:A6,A1:A6))
Here the step by step explanation:
COUNTIF(A1:A6, A1:A6) will produce an array with the frequency of the values: A1:A6. In our case it will be: {3, 3, 3, 1, 2, 2}
Then we have to do the following division: {10, 10, 10, 20, 5, 5}/{3, 3, 3, 1, 2, 2}. The result will be: {3.33, 3.33, 3.33, 20, 2.5, 2.5}. It replaces each value by the average of its group.
Summing the result we will get: (3.33+3.33+3.33) + 20 + (2.5+2.5=35)=35.
Using the above trick we can just get the same result as if we just sum the first element of each group from the column A.
To make this dynamic, so it grows and shrinks with the data set use this:
=SUMPRODUCT($B$1:INDEX(B:B,MATCH(1E+99,B:B))/COUNTIF($A$1:INDEX(A:A,MATCH(1E+99,B:B)),$A$1:INDEX(A:A,MATCH(1E+99,B:B))))
... or just SUMPRODUCT.
=SUMPRODUCT(B2:B7, --(A2:A7<>A1:A6))

Optimal distance with dynamic programming

Okay, so honestly this is a homework question, but I really did my best to find the solution, and I think I partially did.
The question:
We are given a series of cities whose positions are symbolized with only one coordinate and we are supposed to implement a given number of hospitals to cities so that the sum of each cities' distance to nearest hospital will be minimum.
That is, if we are given the cities at 1, 3, 5, 7, 9, 11, 13 and if we are going to put 3 hospitals, the hospitals will be at 3, 7, 11 (actually there could be multiple best solutions for this one, did not check).
We are advised to use dynamic programming and first check the case in which we implement only one hospital.
I've figured out finding the subsequent hospital's location. I create a table, and of cities. Then to each cell, I put either the city of the current rows' distance to closest hospital that already build or city's distance to city of the corresponding column.
For example, if we already implemented a hospital to 1, it would be like:
*-1-3-5-7-9-11-13
1|0|0|0|0|0|0||0|
3|2|0|2|2|2|2|2|
5|4|2|0|2|4|4|4|
..............
then sum the columns and find the next hospital.
The problem is, I cannot figure out the first hospital that I'm supposed to build!!
When I manually add one of the element of the actual solution, I can get the right answer so my partial solution should be true.
BTW complexity should be O(CityNum^2), hospitalNum is a constant. So I can't use bruteforce.
An example input and output (from the homework assg):
Input:
10 5 (10 is city num, 5 is hospital num)
1 2 3 6 7 9 11 22 44 50 (coordinates)
Output:
9 (sum of minimum distances)

Resources