How to change Pandas Column Values in List Format - python-3.x

I'm trying to multiply each value in a column by 0.01 but the column values are in list format. How do I apply it to each element of the list in each row? For example, my data looks like this:
ID Amount
156 [14587, 38581, 55669]
798 [67178, 98635]
And I'm trying to multiply each element in the lists by 0.01.
ID Amount
156 [145.87, 385.81, 556.69]
798 [671.78, 986.35]
I've tried the following code but got an error message saying "can't multiply sequence by non-int of type 'float'".
df['Amount'] = df3['Amount'].apply(lambda x: x*0.00000001 in x)

You need another loop / list comprehension in apply:
df['Amount'] = df.Amount.apply(lambda lst: [x * 0.01 for x in lst])
df
ID Amount
0 156 [145.87, 385.81, 556.69]
1 798 [671.78, 986.35]

Related

how to calculate percentile value of number in dataframe column grouped by index

I have a dataframe like this:
df:
Score
group
A 100
A 34
A 40
A 30
C 24
C 60
C 35
For every group in the data, I want to find out the percentile value of Score 35.
(i.e the percentile where the 35 fits in the grouped data)
I tried different tricks but none of them worked.
scipy.stats.percentileofscore(df['Score], 35, kind='weak')
--> This is working but this doesn't give me the percentile grouped by index
df.groupby('group')['Score].percentileofscore()
--> 'SeriesGroupBy' object has no attribute 'percentileofscore'
scipy.stats.percentileofscore(df.groupby('group')[['Score]], 35, kind='strict')
--> TypeError: '<' not supported between instances of 'str' and 'int'
My ideal output looks like this:
df:
Score Percentile
group
A 50
C 33
Can anyone suggest to me what works well here?
Inverse quantile function for a sequence at point X is the proportion of values less than X in the sequence, right? So:
In [158]: df["Score"].lt(35).groupby(df["group"]).mean().mul(100)
Out[158]:
group
A 50.000000
C 33.333333
Name: Score, dtype: float64
get a True/False Series of whether < 35 or not on "Score"
group this Series over "group"
take the mean
since True == 1 and False == 0, it will effectively give the proportion!
multiply by 100 to get percentages
To answer in a bit more general purpose way you're looking to do a custom aggregation on the group, which pandas lets you do with the agg method.
You can define the function yourself or use one from a library:
def percentileofscore(ser: pd.Series) -> float:
return 100 * (ser > 35).sum() / ser.size
df.groupby("group").agg(percentileofscore)
Output:
Score
group
A 50.000000
C 33.333333

extract values from a list that falls in intervals specified by pandas dataframe

I have a huge list of length 103237. And I have a data frame of shape (8173,6). I want to extract those values from list that fall between values specified by two columns (1 AND 2) in pandas dataframe. For example:
lst = [182,73,137,1,938]
###dataframe
0 1 2 3 4
John 150 183 NY US
Peter 30 50 SE US
Stef 900 969 NY US
Expected output list:
lst = [182,938]
Since 182 falls between 150 and 183 of first row of dataframe and 938 falls between 900 and 969 of row 3 therefore the I want new list to have 182 and 938 from original list. In order to solve this problem I converted my dataframe to numpy array:
nn = df.values()
new_list = []
for item in lst:
for i in range(nn.shape[0]):
if item >= nn[i][1] and item <= nn[i][2]:
new_list.append(item)
But above mentioned code take a long time since its O(n^2) and it doesn't scale well to my list which has 103237 items. How can do this more efficiently?
Consider the following: Assuming you have a value item, you can ask if in inside any interval by the following line
((df[1] <= item) & (df[2] >= item)).any()
the statements (df[1] <= item) and (df[2] >= item) return an boolean array of true/false. The '&' will return a single boolean array whether item is in specific interval. The add of any() in the end returns true if there is any True value in the boolean array, aka if there is an interval which is "True" (the number is inside the interval).
So for the a single item, you can get an answer by the above line.
To scan over all items you can the following:
new_list = []
for item in lst:
if ((df[1] <= item) & (df[2] >= item)).any():
new_list.append(item)
or with list comperhension:
new_list = [item for item in lst if ((df[1] <= item) & (df[2] >= item)).any()]
Edit: if this code is too slowly you can accelerate if even further with numba, but I believe using pandas vectorization (aka using df[1]<=item is good enough)
You can iterate the list and compare each element with all pairs of column 1 and 2 to see if there is any pair that would include the element.
[e for e in lst if (df['1'].lt(e) & df['2'].gt(e)).any()]
I did a test with 110000 elements in the list and 9000 rows in the dataframe and the code takes 32s to run on a macbook pro.

how to iterate over column and each iteration result save in result dataframe python

I have one dataframe with multiple columns ,i need to calculate same thing for all columns , is there any way to do this ? i have many columns so can not do one by one
df=pd.DataFrame({r'A':[1,24,69,67],r'A\0001\delta':[1,46,454,67],r'A\0002\delta':[1,46,454,67],r'A\00100\delta':[1,46,70,67]})
i want to calculate:
diff=df[r'A\0001\delta'].diff()
if diff greater than 60 save row in result dataframe
same thing i want to do for more than 100 columns and want to save results in result dataframe by rows
At least one value greater than 60 on a row
>>> df.loc[df.diff().gt(60).any(axis=1)]
A A\0001\delta A\0002\delta A\00100\delta
2 69 454 454 70
All values greater than 60 on a row:
>>> df.loc[df.diff().gt(60).all(axis=1)]
Empty DataFrame
Columns: [A, A\0001\delta, A\0002\delta, A\00100\delta]
Index: []

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

How to find the min and max elements in a list that has nan in between

I have a dataframe that has a column named "score". I am extracting all the elements from that column into a list. It has 'nan's in between. I wish to identify the min and max of elements before every 'nan' occurs.
I was looking into converting the column into a list, and traverse the list until I encounter an "nan". But how do I traverse back to find the min and max elements right before nan?
This is the code I wrote to convert a column of a dataframe into a list and then identify the "nan".
score_list = description_df['score'].tolist()
for i in score_list:
print(i)
if math.isnan(i):
print("\n")
Suppose my data looks like this,
11.03680137760893
5.351482041139766
10.10019513222711
nan
0.960990030082931
nan
6.46983084276682
32.46794015293125
nan
Then, I should be able to identify max as 11.03680137760893
and min as 5.351482041139766 before the occurrence of the first "nan", 0.960990030082931 as the min and max before the occurrence of second nan and after the occurrence of first nan, and 32.46794015293125 as max and 6.46983084276682 as min after the second 'nan' and before the third 'nan'
You can create groups by testing missing values by Series.isna with Series.cumsum, aggregate by GroupBy.agg with min and max and last remove only missing rows by DataFrame.dropna:
df = df.groupby(df['score'].isna().cumsum())['score'].agg(['min','max']).dropna()
print (df)
min max
score
0 5.351482 11.036801
1 0.960990 0.960990
2 6.469831 32.467940
You can create two variables called min and max that begin with a default value each time you find a nan and print them (or store).
import sys
score_list = description_df['score'].tolist()
max = sys.float_info.min
min = sys.float_info.max
for i in score_list:
print(i)
if math.isnan(i):
print("max =", max, "min =", min, "\n")
max = sys.float_info.min
min = sys.float_info.max
else:
if i > max:
max = i
if i < min:
min = i

Resources