how do i assign top 2, middle 2 and bottom 2 values with extra in the given data frame - python-3.x

In the given below data frame. i want to insert a new column with extra and assign, top 2, middle two and below two values as "Extra"
df
A_No B_Wt
39 184.66
40 193.11
46 197.82
2 203.82
12 205.27
9 208.11
3 208.49
14 208.70
Out put
A_No B_Wt Group
39 184.66 Extra
40 193.11 Extra
46 197.82
2 203.82 Extra
12 205.27 Extra
9 208.11
3 208.49 Extra
14 208.70 Extra

I believe you can use join positions for top2, middle2 and bottom2 together and then set values to new column:
lend = len(df)
mid = lend // 2
pos = np.r_[0:2, mid-1:mid+1, lend-2:lend]
df.loc[df.index[pos], 'Group'] = 'Extra'
print (df)
A_No B_Wt Group
0 39 184.66 Extra
1 40 193.11 Extra
2 46 197.82 NaN
3 2 203.82 Extra
4 12 205.27 Extra
5 9 208.11 NaN
6 3 208.49 Extra
7 14 208.70 Extra

Related

Excel MERGE two tables

I have SET 1
CLASS
Student
TEST
SCORE
A
1
1
46
A
1
2
50
A
1
3
45
A
2
1
45
A
2
2
47
A
2
3
31
A
3
1
34
A
3
2
45
B
1
1
36
B
2
1
31
B
2
2
41
B
3
1
50
C
1
1
42
C
3
1
31
and SET 2
CLASS
SIZE
YEARS
A
39
7
B
20
12
C
31
6
and wish to COMBINE to make SET 3
CLASS
STUDENT
TEST
SCORE
SIZE
YEARS
A
1
1
46
39
7
A
1
2
50
39
7
A
1
3
45
39
7
A
2
1
45
39
7
A
2
2
47
39
7
A
2
3
31
39
7
A
3
1
34
39
7
A
3
2
45
39
7
B
1
1
36
20
12
B
2
1
31
20
12
B
2
2
41
20
12
B
3
1
50
20
12
C
1
1
42
31
6
C
3
1
31
31
6
so basically add the SIZE and YEARS columns from SET 2 and merge on CLASS onto SET 1. In excel how you can do this? I need to match on CLASS
Define both sets as tables and “left join” in PowerQuery. There you can choose the columns of the resulting table.
https://learn.microsoft.com/en-us/power-query/merge-queries-left-outer
If you have Set 1 on the top left of a worksheet "Set1" and Set 2 on the top left of a worksheet "Set2", then you can use the formula
=VLOOKUP(A2;'Set2'!$A$2:$C$4;2;FALSE), where $A$2:$C$4 is the range of Set2, and A2 is the class value from Set1, which is what is used to do the lookup in Set2. The next argument, 2, means to take the second row from Set2, and the FALSE at the end means that you only want exact matches on the CLASS. You can do auto-fill with this formula, and do similar steps for the years. If you look up the help for VLOOKUP within Excel, that should help you to understand how it works.
Your first set of data is essentially your primary set of data that you just want to add attribute columns to. I built this example on Google Sheets which should help explain. Using spill formulas, only a few cells are needed with their own formulas. You can see them as they are highlighted in yellow. When you use in Excel, obviously make sure you change the column references, but this would get you the answer.
Note you have to have SpillRange in Excel for this to work. To test, see if you have the formula =unique()
This solution may work for you if both sets start in the same column. As example in my image, both of them start at column A. You can get all data with a single VLOOKUP formula:
Formula in cell E2 is:
=VLOOKUP($A2;$A$22:$R$25;COLUMN($B22);FALSE)
Notice the mixed references at first and third argument and absolute references in the second one. Third argument is critical, because is the relational position between both sets, that's the reason it's easier if both sets start at same column. If not, you'll need to adjust this argument substracting or adding, depending on the case.
Anyways, with a single formula, you can get any number of columns. The only disavantage of this formula is that you need to manually drag to right until you got all the columns (10, 30 or whatever). You'll notice you are done because the formula will raise an error:
This error means you are trying to get a referenced outside of your column area.

Creating an aggregate columns in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "B", "B"], 'var1':[2, 3, 1, 5],'a1_bal':[1,2,3,4], 'a1c_bal':[10,22,36,41], 'b1_bal':[1,2,33,4], 'b1c_bal':[11,22,3,4], 'm1_bal':[15,2,35,4]})
df
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal
0 A 2 1 10 1 11 15
1 A 3 2 22 2 22 2
2 B 1 3 36 33 3 35
3 B 5 4 41 4 4 4
I want to create new columns as below:
a1_final_bal = sum(a1_bal, a1c_bal)
b1_final_bal = sum(b1_bal, b1c_bal)
m1_final_bal = m1_bal (since we only have m1_bal field not m1c_bal, so it will renain as it is)
I don't want to hardcode this step because there might be more such columns as "c_bal", "m2_bal", "m2c_bal" etc..
My final data should look something like below
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal a1_final_bal b1_final_bal m1_final_bal
0 A 2 1 10 1 11 15 11 12 15
1 A 3 2 22 2 22 2 24 24 2
2 B 1 3 36 33 3 35 38 36 35
3 B 5 4 41 4 4 4 45 8 4
You could try something like this. I am not sure if its exactly what you are looking for, but I think it should work.
dfforgroup = df.set_index(['ORDER','var1']) #Creates MultiIndex
dfforgroup.columns = dfforgroup.columns.str[:2] #Takes first two letters of remaining columns
df2 = dfforgroup.groupby(dfforgroup.columns,axis=1).sum().reset_index().drop(columns =
['ORDER','var1']).add_suffix('_final_bal') #groups columns by their first two letters and sums the columns up
df = pd.concat([df,df2],axis=1) #concatenates new columns to original df

Fill in missing values in DataFrame Column which is incrementing by 10

Say , Some Values in the 'Counts' column are missing. These numbers are meant to be increased by 10 with each row so '35' and '55' need to be put in place. I would want to fill in these missing values.
Counts
0 25
1 NaN
2 45
3 NaN
4 65
So my output should be :
Counts
0 25
1 35
2 45
3 55
4 65
Thanks,
We have interpolate
df=df.interpolate()
Counts
0 25.0
1 35.0
2 45.0
3 55.0
4 65.0
Since you now the pattern, you can simply recreate it:
start = df.iloc[0]['Counts'] # first row
end = df.iloc[-1]['Counts'] # last row
df['Counts'] = np.where(df['Counts'].notnull(), df['Counts'],
np.arange(start, end + 1, 10))

Select rows from with same values in one column but different value in the other column

I have some duplicates in my data that I need to correct.
This is a sample of a dataframe:
test = pd.DataFrame({'event_id':['1','1','2','3','5','6','9','3','9','10'],
'user_id':[0,0,0,1,1,3,3,4,4,4],
'index':[10,20,30,40,50,60,70,80,90,100]})
I need to select all the rows that have equal values in event_id but differing values in user_id. I tried this (based on a similar question but with no accepted answer):
test.groupby('event_id').filter(lambda g: len(g) > 1).drop_duplicates(subset=['event_id', 'user_id'], keep="first")
out:
event_id user_id index
0 1 0 10
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
But I do not need the first row where user_id is the same - 0.
The second part of the question is - what is the best way to correct the duplicate record? How could I add a suffix to event_id (_new) but only in this row:
event_id user_id index
3 3_new 1 40
6 9_new 3 70
7 3 4 80
8 9 4 90
Ummm, I try to fix your code
test.groupby('event_id').
filter(lambda x : (len(x['event_id'])==x['user_id'].nunique())&(len(x['event_id'])>1))
Out[85]:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
For Correct the duplicate row, you can do with create a new sub key , personally not recommended modify your original columns .
df['subkey']=df.groupby('event_id').cumcount()
Try:
test[test.duplicated(['event_id'], keep=False) &
~test.duplicated(['event_id','user_id'], keep=False)]
Output:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Resources