How to get the median of different intervals of dataframe based on label name? [duplicate] - python-3.x

This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 3 years ago.
So I have a DataFrame with two columns, one with label names (df['Labels']) and the other with int values (df['Volume']).
df = pd.DataFrame({'Labels':
['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[10,40,20,20,50,60,40,50,50,60,10,10,10,10,20,20,10,20,80,90,90,80,100]})
I would like to identify intervals where my labels change and then calculate the median on the column 'Volume' for each of these intervals. Later I should replace every value of column 'Volume' by the respective median of each interval.
In case of label A, I would like to have the median for both intervals.
Here is how my DataFrame should looks like:
df2 = pd.DataFrame({'Labels':['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[20,20,20,20,50,50,50,50,50,50,10,10,10,10,10,10,10,10,90,90,90,90,90]})

You want to groupby the blocks and transform median:
blocks = df['Labels'].ne(df['Labels'].shift()).cumsum()
df['group_median'] = df['Volume'].groupby(blocks).transform('median')

Use Series.cumsum + Series.shift() to create groups using groupby and then use transform
df['Volume']=df.groupby(df['Labels'].ne(df['Labels'].shift()).cumsum())['Volume'].transform('median')
print(df)
Labels Volume
0 A 20
1 A 20
2 A 20
3 A 20
4 B 50
5 B 50
6 B 50
7 B 50
8 B 50
9 B 50
10 A 10
11 A 10
12 A 10
13 A 10
14 A 10
15 A 10
16 A 10
17 A 10
18 C 90
19 C 90
20 C 90
21 C 90
22 C 90

Related

Dynamically updating row values based on a condition in pandas

I am running a simulation test where I want to dynamically change some values present in rows for each column based on certain set of conditions
The Problem Statement
My dataset has 400 rows and my first test case is to update 5% of the rows in each column, so 5% of 400 = 20 rows which needs to be updated
These 20 rows should be only updated for the top 5 categories that are present in my dataset. So 4 rows each which needs to be updated
My dataframe looks like this:
A B C D Category
1 10 3 4 X
4 9 6 9 Y
9 3 7 10 XX
10 1 9 7 YY
10 1 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
The conditions are:
While updating the rows I would want to make sure that 20 rows (5% of the overall dataset) should be updated only where the top 5 categories are encountered. In my case the top 5 categories are X, Y , XX, YY and ZZ. These rows should be updated to value 7 where the previous value was 1,2,3,4,5,6
The resultant datframe should look like this:
A B C D Category
7 10 7 7 X
7 9 7 9 Y
9 7 7 10 XX
10 7 9 7 YY
10 7 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
In the resultant dataframe, there is no impact on the categories which are not the top 5 categories, this case YZZ or YYYY and to demonstrate an example I can't show all the updated rows but for example in the above dataframe, 2 rows have been updated for column A where previous value was <=6 to a new value 7 and similarly the other two rows will get updated to 7 wherever the condition is met.
How can I achieve this?
You can try the following logic:
# get only desired Categories
m = df['Category'].isin(['X', 'Y', 'XX', 'YY', 'ZZ'])
# select 20 random rows from the above
idx = df[m].sample(n=20).index
# replace the 1 ≤ values ≤ 6 by 7
df.loc[idx] = df.loc[idx].mask(df.loc[idx].ge(1)&df.loc[idx].le(6), 7)
If you rather want 4 rows per Category, use this variant for the random sampling:
idx = df[m].groupby('Category').sample(n=4).index

want to calculate the count of pass instances of data set using python pandas

x=[]
y1=[]
r1=len(df)
L1=len(df.columns)
for i in range(r1):
ll=(df.loc[i,'LL'])
ul=(df.loc[i,'UL'])
count1 =0
for j in range(5,L1):
if isinstance(df.iloc[i,j],str):
df.loc[i,j]=0
if ll<=df.iloc[i,j]<=ul:
count1=count1+1
if count1==(L1-5):
x.append('Pass')
else:
x.append('Fail')
y1.append(count1)
se = pd.Series(x)
se1=pd.Series(y1)
df['Min']=min1.values
df['Mean']=mean1.values
df['Median']=median1.values
df['Max']=max1.values
df['Pass Count']=se1.values
df['Result']=se.values
min1 = df.iloc[:,5:].min(axis=1)
mean1=df.iloc[:,5:].astype(float).mean(axis=1,skipna = True)
median1=df.iloc[:,5:].astype(float).median(axis=1,skipna = True)
max1=df.iloc[:,5:].max(axis=1)
count1=df.iloc[:,5:].count(axis=1)
yield1=[]
for i in range(len(se1)):
yd1=(se1[i]/(L1-3))*100
yield1.append(yd1)
se2=pd.Series(yield1)
df['Yield']=se2.values
df1=df.loc[:,['PARAMETER','Min','Mean','Median','Max','Result','Pass Count','Yield']]
df1
Below is my data set, it is sensor data on daily basis. Daily data should be within the Lower Limit (LL) and Upper Limit(UL). I want to count how many days sensors data is within the LL and UL.
I am not able to calculate the number of days for sensor data within LL and UL using Pandas. How can I calculate the number of days for sensor data within LL and UL?
Take a few key ideas
need a list of the columns that go into calc daycols
transpose these columns into an array then to test, gives a boolean array
sum this boolean array and you have your desired calc
df = pd.read_csv(io.StringIO("""sensor location,LL,UL,day1,day2,day3,day4,day5,day6,day7,number of days sensor data within LL and UL
A,1,10,12,6,9,4,9,7,15,5
B,1,12,4,15,7,1,11,1,7,6
C,1,15,13,13,13,10,7,13,13,7
D,1,10,12,1,14,12,15,4,4,3
E,1,20,11,15,8,14,1,14,14,7"""))
daycols = [d for i,d in enumerate(df.columns) if "day" in d and "number" not in d]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.loc[:,daycols].T>=dfa["LL"]) &
(dfa.loc[:,daycols].T<=dfa["UL"])).sum()
)
print(df.to_string(index=False))
output
sensor location LL UL day1 day2 day3 day4 day5 day6 day7 number of days sensor data within LL and UL daysBetween
A 1 10 12 6 9 4 9 7 15 5 5
B 1 12 4 15 7 1 11 1 7 6 6
C 1 15 13 13 13 10 7 13 13 7 7
D 1 10 12 1 14 12 15 4 4 3 3
E 1 20 11 15 8 14 1 14 14 7 7
speed up
It you have many columns then you can use slice capability to identify them and turn into indexes so iloc can be used. Additionally the transpose is not necessary.
dayi = [df.columns.get_loc(c) for c in df.columns[3:-1]]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.iloc[:,dayi]>=dfa["LL"]) &
(dfa.iloc[:,dayi]<=dfa["UL"])).sum()
)

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

pandas random shuffling dataframe with constraints

I have a dataframe that I need to randomise in a very specific way with a particular rule, and I'm a bit lost. A simplified version is here:
idx type time
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 a 3
9 a 3
10 b 4
11 b 4
12 b 4
13 a 5
14 a 5
15 a 5
16 b 6
17 b 6
18 b 6
19 a 7
20 a 7
21 a 7
If we consider this as containing seven "bunches", I'd like to randomly shuffle by those bunches, i.e. retaining the time column. However, the constraint is that after shuffling, a particular bunch type (a or b in this case) cannot appear more than n (e.g. 2) times in a row. So an example correct result looks like this:
idx type time
21 a 7
20 a 7
19 a 7
7 a 3
8 a 3
9 a 3
17 b 6
16 b 6
18 b 6
6 b 2
5 b 2
4 b 2
2 a 1
3 a 1
1 a 1
14 a 5
13 a 5
15 a 5
12 b 4
11 b 4
10 b 4
I was thinking I could create a separate "order" array from 1 to 7 and np.random.shuffle() it, then sort the dataframe by time in that order, which will probably work - I can think of ways to do that part, but I'm especially struggling with the rule of restricting the number of repeats.
I know roughly that I should use a while loop, shuffle it in that way, loop over the frame and track the number of consecutive types, if it exceeds my n then break out and start the while loop again until it completes without breaking out, in which case set a value to end the while loop. But this got so messy and didn't work.
Any ideas?
See if this works.
import pandas as pd
import numpy as np
n = [['a',1],['a',1],['a',1],
['b',2],['b',2],['b',2],
['a',3],['a',3],['a',3]]
df = pd.DataFrame(n)
df.columns = ['type','time']
print(df)
order = np.unique(np.array(df['time']))
print("Before Shuffling",order)
np.random.shuffle(order)
print("Shuffled",order)
n =2
for i in order:
print(df[df['time']==i].iloc[0:n])

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Resources