How can I apply a function to multiple columns of grouped rows and make a column of the output? - python-3.x

I have a large excel file containing stock data loaded from and sorted from an API. Below is a sample of dummy data that is applicable towards solving this problem:
I would like to create a function that Scores Stocks grouped by their Industry. For this sample I would like to score a stock a 5 if its ['Growth'] value is less than its groups mean Growth (quantile, percentile) in Growth and a 10 if above the groups mean growth. The scores of all values in a column should be returned in a list
Current Unfinished Code:
import numpy as np
import pandas as pd
data = pd.DataFrame(pd.read_excel('dummydata.xlsx')
Desired input:
data['Growth'].apply(score) # Scores stock
Desired Output:
[5, 10, 10, 5, 10, 5]
If I can create a function for this sample then I will be able to make similar ones for different columns with slightly different conditions and aggregates (like percentile or quantile) that affect the score. I'd say the main problem here is accessing these grouped values and comparing them.

I don't think it's possible to convert from a Series to a list in the apply call. I may be wrong on that but if the desired output was changed slightly to
data['Growth'].apply(score).tolist()
then you can use a lambda function to do this.
score = lambda x: 5 if x < growth.mean() else 10
data['Growth'].apply(score).tolist() # outputs [5, 10, 10, 5, 10, 5]

Related

Python average exclude some rows

When i use .mean in python, it calculates the mean of the all rows.
What if I want to calculate the mean excluding the first few rows of the data ?
What I am able to understand from your question is that you have a 2d numpy array and want to calculate the mean of the rows except the first few.
You can use list slicing and reassign it to some other variable
b = a[n:]
Where new list if from nth element till last, including n.
Example:
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = a[1:]
print(a.mean(), b.mean())
Output:
2.5 3.5

Flattern data by count

I have some data like this:
# of jobs count
--------- -----
1 2
2 3
3 1
4 1
They represent a multiset {1, 1, 2, 2, 2, 3, 4}. I want to do statistics (medium, mean, mode) against the multiset. Is there any way to somehow generate this list from the 4x2 data above? That way I can use the built-in functions (MEDIUM/AVERAGE/MODE). Currently I’m using SUMPRODUCT to calculate mean, and a combination of MAX/MATCH/INDEX to get the mode, but I can’t figure out a way to calculate medium.
Note:
Of course the real data is much more than 4 rows, but the idea should be the same.
The first column is sorted integers, if that helps.
It’s OK to use some auxiliary cells to hold intermediate data.
It doesn’t have to be a formula; if pivot table is a better tool, please advise.
With access to CONCAT you could use:
=FILTERXML(CONCAT("<t><s>",REPT(A2:A5&"</s><s>",B2:B5),"</s></t>"),"//s[node()]")
This would return {1, 1, 2, 2, 2, 3, 4} and you could directly apply the other functions, e.g.:
=MEDIAN(FILTERXML.....) etc.

How to assign values from a list to a pandas dataframe and control the distribution/frequency each list element has in the dataframe

I am building a dataframe and need to assign values from a defined list to a new column in the dataframe. I have found an answer which gives a method to assign elements from a list randomly to a new column in a dataframe here (How to assign random values from a list to a column in a pandas dataframe?).
But I want to be able to control the distribution of the elements in my list within the new dataframe by either assigning a frequency of occurrences or some other method to control how many times each list element appears in the dataframe.
For example, if I have a list my_list = [50, 40, 30, 20, 10] how can I say that for a dataframe (df) with n number of rows assign 50 to 10% of the rows, 40 to 20%, 30 to 30%, 20 to 35% and 10 to 5% of the rows.
Any other method to control for the distribution of list elements is welcome, the above is a simple explanation to illustrate how one way to be able to control frequency may look.
You can use choice function from numpy.random, providing probability distribution.
>>> a = np.random.choice([50, 40, 30, 20, 10], size=100, p=[0.1, 0.2, 0.3, 0.35, 0.05])
>>> pd.Series(a).value_counts().sort_index(ascending=False)
50 9
40 25
30 19
20 38
10 9
dtype: int64
Just put the desired size into size parameter (dataframe's length)

Using numpy present value function

I am trying to compute the present value using numpy's pv function in pandas dataframe.
I also have 2 lists, one includes period [6,18,24] and other one includes pmt
values [100,200,300].
Present value should be computed for each value in pmt list to each value in period list.
lets say in below table column values represents period and row represents pmt
I am trying to compute the data values using a single line of code without writing multiple lines How can I do that?
Currently I hard coded the period as follows.
PRESENT_VALUE6 = np.pv(pmt=-PMT_REMAINING_PERIOD,rate=(INTEREST_RATE/12),nper=6,fv=0,when=0)
PRESENT_VALUE18 = np.pv(pmt=-PMT_REMAINING_PERIOD,rate=(INTEREST_RATE/12),nper=18,fv=0,when=0)
PRESENT_VALUE30 = np.pv(pmt=-PMT_REMAINING_PERIOD,rate=(INTEREST_RATE/12),nper=30,fv=0,when=0)
I want the python to iterate the nper from the list, currently when I do that it produces the following not the expected result
Expected result is
I don't know what interest rate you used in your example, I set it to 10% below:
INTEREST_RATE = 0.1
# Build a Cartesian product between PMT and Period
pmt = [100, 200, 300]
period = [6, 18, 24]
df = pd.DataFrame(product(pmt, period), columns=['PMT', 'Period'])
# Calculate the PV
df['PV'] = np.pv(INTEREST_RATE / 12, nper=df['Period'], pmt=-df['PMT'])
# Final pivot
df.pivot(index='PMT', columns='Period')
Result:
PV
Period 6 18 24
PMT
100 582.881717 1665.082618 2167.085483
200 1165.763434 3330.165236 4334.170967
300 1748.645151 4995.247853 6501.256450

How to use dataframe column in for loop

I am trying to implement a formula to create a new column in Dataframe using existing column but that column is a summation from 0 to a number present in some other column.
I was trying something like this:
dataset['B']=sum([1/i for i in range(dataset['A'])])
I know something like this would work
dataset['B']=sum([1/i for i in range(10)])
but I want to make this 10 dynamic based on some different column.
I keep on getting this error.
TypeError: 'Series' object cannot be interpreted as an integer
First of all, I should admit that I could not understand you question completely. However, what I understood that you want to iterate over the rows of a DataFrame and make a new column by doing some operation/s on that value.
Is that is so, then I would recommend you following link
Regarding TypeError: 'Series' object cannot be interpreted as an integer:
The init signature range() takes integers as input. i.e [i for i in range(10)] should give you [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. However, if one of the value from your dataset['A'] is float, or not integer , this might result in the error you are having. Moreover, if you notice, the first value is a zero, as a result, 1/i should result in a different error. As a result, you might have to rewrite the code as [1/i for i in range (1 , row_value_of_dataset['A'])]
It will be highly appreciate if you could make an example of what you DataFrame might look like and what is your desired output. Then perhaps it is easier to post a solution.
BTW forgot to post what I understood from your question:
#assume the data:
>>>import pandas as pd
>>>data = pd.DataFrame({'A': (1, 2, 3, 4)})
#the data
>>>data
A
0 1
1 2
2 3
3 4
#doing operation on each of the rows
>>>data['B']=data.apply(lambda row: sum([1/i for i in range(1, row.A)] ), axis=1)
# Column B is the newly added data
>>>data
A B
0 1 0.000000
1 2 1.000000
2 3 1.500000
3 4 1.833333
Perhaps explicitly use cumsum, or even apply?
Anyway trying to move an array/list item directly into a dataframe and seem to view this as a dictionary. Try something like this, I've not tested it,
array_x = [x, 1/x for x in dataset.values.tolist()] # or `dataset.A.tolist()`
df = pd.DataFrame(data=(np.asarray(array_x)))
df.columns = [A, B]
Here the idea is to break the Series apart into a list, and input the list into a dataframe. This can be explicitly done without needing to go Series->list->dataframe and is not very efficient.

Resources