I have a df as shown below
df1:
ID Age_days N_30 N_31_90
1 201 60 15
2 20 0 15
3 800 0 0
4 100 0 0
5 600 0 6
df2:
ID Salary Speed Group
1 2000 60 A
2 600 0 C
3 9000 0 B
4 1000 0 D
5 4000 0 A
6 8000 0 A
Where I would like to merge the salary column of df2 with df1 based on the ID column values.
Expected Output:
ID Age_days N_30 N_31_90 Salary
1 201 60 15 2000
2 20 0 15 600
3 800 0 0 9000
4 100 0 0 1000
5 600 0 6 8000
I tried the below code
df3 = pd.merge(df1, df2[['ID', 'Speed']], on='ID')
Getting the expected output. Is there any other method where my execution time can be reduced
tOP
boT 0 29.99 30 60 60 89 90 1000
0.250 Hold Hold 40 40 60 60 80 80
0.290 Hold Hold 40 40 60 60 80 80
0.300 Hold Hold 60 60 80 80 110 110
0.340 Hold Hold 60 60 80 80 110 110
0.350 Hold Hold 76 76 110 110 150 150
0.399 Hold Hold 76 76 110 110 150 150
0.400 Hold Hold 90 90 130 130 180 180
0.449 Hold Hold 90 90 130 130 180 180
0.450 Hold Hold 100 100 160 160 210 210
0.490 Hold Hold 100 100 160 160 210 210
0.500 Hold Hold 130 130 190 190 250 250
0.540 Hold Hold 130 130 190 190 250 250
0.550 Hold Hold 140 140 210 210 280 280
0.590 Hold Hold 140 140 210 210 280 280
0.600 Hold Hold 250 250 375 375 500 500
this is the data table I used
=INDEX('sheet1'!$B$3:$I$18,MATCH(boT,'sheet1'!A$4:$A$18,0),MATCH($K$4,'sheet1'!$B$3:$I$3,0))
I thought 0 would give the exact match but if I hit a bot of 0.450 and a top of 60 it should be 100 but I'm getting 90 so one level below each time it hits one of the break points.
I am choosing the values in Pandas DataFrame.
I would like to choose the values in the columns 'One_T','Two_T','Three_T'(which means the total counts), based on the Ratios of the columns('One_R','Two_R','Three_R').
Comparing values is done by the columns('One_R','Two_R','Three_R') and choosing values will be done by columns ('One_T','Two_T','Three_T').
I would like to find the highest values among columns('One_R','Two_R','Three_R') and put values from columns 'One_T','Two_T','Three_T' in new column 'Highest'.
For example, the first row has the highest values in One_R than Two_R and Three_R.
Then, the values in One_T will be filled the column named Highest.
The initial data frame is test below code and the desired result is the result in the below code.
test = pd.DataFrame([[150,30,140,20,120,19],[170,31,130,30,180,22],[230,45,100,50,140,40],
[140,28,80,10,60,10],[100,25,80,27,50,23]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R'])
One_T One_R Two_T Two_R Three_T Three_R
2019-01-01 150 30 140 20 120 19
2019-02-01 170 31 130 30 180 22
2019-03-01 230 45 100 50 140 40
2019-04-01 140 28 80 10 60 10
2019-05-01 100 25 80 27 50 23
result = pd.DataFrame([[150,30,140,20,120,19,150],[170,31,130,30,180,22,170],[230,45,100,50,140,40,100],
[140,28,80,10,60,10,140],[100,25,80,27,50,23,80]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R','Highest'])
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80
Is there any way to do this?
Thank you for time and considerations.
You can solve this using df.filter to select columns with the _R suffix, then idxmax. Then replace _R with _T and use df.lookup:
s = test.filter(like='_R').idxmax(1).str.replace('_R','_T')
test['Highest'] = test.lookup(s.index,s)
print(test)
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80
I have a numeric dataset and I want to calculate the z score for 'KM' column and replace the original values with the z score values. I'm new to python and please help.
KM CC Doors Gears Quarterly_Tax Weight Guarantee_Period
46986 2000 3 5 210 1165 3
72937 2000 3 5 210 1165 3
38500 2000 3 5 210 1170 3
31461 1800 3 6 100 1185 12
32189 1800 3 6 100 1185 3
23000 1800 3 6 100 1185 3
18739 1800 3 6 100 1185 3
34000 1800 3 5 100 1185 3
21716 1600 3 5 85 1105 18
64359 1600 3 5 85 1105 3
67660 1600 3 5 85 1105 3
43905 1600 3 5 100 1170 3
Something like this should do it for you
from scipy import stats
df["KM"] = df["KM"].apply(stats.zscore)
Using as key columns 1 and 2, i want to delete all rows which the value increments by one.
input
1000 1001 140
1000 1002 140
1000 1003 140
1000 1004 140
1000 1005 140
1000 1006 140
1000 1201 140
1000 1202 140
1000 1203 140
1000 1204 140
1000 1205 140
2000 1002 140
2000 1003 140
2000 1004 140
2000 1005 140
2000 1006 140
output desired
1000 1001 140
1000 1006 140
1000 1201 140
1000 1205 140
2000 1002 140
2000 1006 140
I have tried
awk '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' <file>
But for some reason, it keeps only the maximum value.
Your problem statement doesn't describe your output. You want to print the first and last row of each contiguous range. Like this:
$ awk '$1 > A || $2 > B + 1 {
if(row){print row}; print}
{A=$1; B=$2; row=$0}
END {print}' dat
1000 1001 140
1000 1006 140
1000 1201 140
1000 1205 140
2000 1002 140
2000 1006 140
The basic problem is just to determine if a line is only 1 more than the prior one. The only way to do that is to have both lines to compare. By storing the value of each line as it's read, you can compare the current line to the prior one.