Delete rows according to condition - linux

Using as key columns 1 and 2, i want to delete all rows which the value increments by one.
input
1000 1001 140
1000 1002 140
1000 1003 140
1000 1004 140
1000 1005 140
1000 1006 140
1000 1201 140
1000 1202 140
1000 1203 140
1000 1204 140
1000 1205 140
2000 1002 140
2000 1003 140
2000 1004 140
2000 1005 140
2000 1006 140
output desired
1000 1001 140
1000 1006 140
1000 1201 140
1000 1205 140
2000 1002 140
2000 1006 140
I have tried
awk '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' <file>
But for some reason, it keeps only the maximum value.

Your problem statement doesn't describe your output. You want to print the first and last row of each contiguous range. Like this:
$ awk '$1 > A || $2 > B + 1 {
if(row){print row}; print}
{A=$1; B=$2; row=$0}
END {print}' dat
1000 1001 140
1000 1006 140
1000 1201 140
1000 1205 140
2000 1002 140
2000 1006 140
The basic problem is just to determine if a line is only 1 more than the prior one. The only way to do that is to have both lines to compare. By storing the value of each line as it's read, you can compare the current line to the prior one.

Related

Retaining bad_lines identified by pandas in the output file instead of skipping those lines

I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
​
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1

How to iterate through rows and add number increments based on whether a number in a column increases

I want to be able to iterate through the rows in my data frame and be able to replace the current number with an +1 incremental number (that starts # 1000) every time the number changes.
Here is my dataframe column with the desired number next to it:
deal_pos_code
657 > 1000
680 > 1001
694 > 1002
694 > 1002
694 > 1002
694 > 1002
695 > 1003
695 > 1003
695 > 1003
695 > 1003
696 > 1004
696 > 1004
Update
I am new but this is what I have so far:
cv = df['deal_pos_code'].iloc[0]
nv = 1000
for i, row in mbn.iterrows():
if mbn.row['deal_pos_code'] == cv:
row['deal_pos_code'] = nv
else:
nv +=
cv = row['deal_pos_code']
row['deal_pos_code'] = nv
I am getting an attribute error:
AttributeError: 'DataFrame' object has no attribute 'row'
Update.. bottom lines fixed on table
You can check the difference between each row in your 'deal' variable compared to previous row using diff, and if it is not 0 (i.e. increasing in your case), use cumsum(), and add(999):
df['pos_code'] = (df['deal'].diff() != 0).cumsum().add(999)
df
deal pos_code
0 657 1000
1 680 1001
2 694 1002
3 694 1002
4 694 1002
5 694 1002
6 695 1003
7 695 1003
8 695 1003
9 695 1003
10 696 1004
11 696 1004
As #Shubham pointed out, there's probably a typo in your last two rows.

Choosing the values in the column based on the maximum values of other column

I am choosing the values in Pandas DataFrame.
I would like to choose the values in the columns 'One_T','Two_T','Three_T'(which means the total counts), based on the Ratios of the columns('One_R','Two_R','Three_R').
Comparing values is done by the columns('One_R','Two_R','Three_R') and choosing values will be done by columns ('One_T','Two_T','Three_T').
I would like to find the highest values among columns('One_R','Two_R','Three_R') and put values from columns 'One_T','Two_T','Three_T' in new column 'Highest'.
For example, the first row has the highest values in One_R than Two_R and Three_R.
Then, the values in One_T will be filled the column named Highest.
The initial data frame is test below code and the desired result is the result in the below code.
test = pd.DataFrame([[150,30,140,20,120,19],[170,31,130,30,180,22],[230,45,100,50,140,40],
[140,28,80,10,60,10],[100,25,80,27,50,23]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R'])
One_T One_R Two_T Two_R Three_T Three_R
2019-01-01 150 30 140 20 120 19
2019-02-01 170 31 130 30 180 22
2019-03-01 230 45 100 50 140 40
2019-04-01 140 28 80 10 60 10
2019-05-01 100 25 80 27 50 23
result = pd.DataFrame([[150,30,140,20,120,19,150],[170,31,130,30,180,22,170],[230,45,100,50,140,40,100],
[140,28,80,10,60,10,140],[100,25,80,27,50,23,80]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R','Highest'])
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80
Is there any way to do this?
Thank you for time and considerations.
You can solve this using df.filter to select columns with the _R suffix, then idxmax. Then replace _R with _T and use df.lookup:
s = test.filter(like='_R').idxmax(1).str.replace('_R','_T')
test['Highest'] = test.lookup(s.index,s)
print(test)
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80

Please suggest approaches and code to solve the defined problem statement

x y z amount absolute_amount
121 abc def 500 500
131 fgh xyz -800 800
121 abc xyz 900 900
131 fgh ijk 800 800
141 obc pqr 500 500
151 mbr pqr -500 500
141 obc pqr -500 500
151 mbr pqr 900 900
I need to find the duplicate rows in the dataset where the x and y are same, with conditions being-
sum(amount) !=0
abs(sum(amount)) != absolute_amount
I tried grouping them and the code i used in R is working but i need it to work in python
logic1 <- tablename %>%
group_by('x','y')%>%
filter(n()>1 && sum(`amount`) != 0 && abs(sum(`amount`)) != absolute_amount)
Expected output
x y z amount absolute_amount
121 abc def 500 500
121 abc xyz 900 900
151 mbr pqr -500 500
151 mbr pqr 900 900
Use transform with groupby.sum() to return sum transformed for each group and then compare the 2 conditions you have:
c=df.groupby(['x','y'])['amount'].transform('sum')
df[c.ne(0) & c.abs().ne(df.absolute_amount)]
x y z amount absolute_amount
0 121 abc def 500 500
2 121 abc xyz 900 900
5 151 mbr pqr -500 500
7 151 mbr pqr 900 900

sumproduct using different criteria

I have the above excel table and i would like to calculate the total per company per departament per year. I used:
=SUMPRODUCT(--($A$2:$A$9=A12),--($B$2:$B$9=B12)*$C$2:$F$9)
dosen`t seems to work.
A B C D E F
1 COMPANY DEPART. QUARTER 1 QUARTER 2 QUARTER 3 QUARTER 4
2 AB PRO 123 223 3354 556
3 CD PIV 222 235 223 568
4 CD PRO 236 254 184 223
5 AB STA 254 221 96 265
6 EF PIV 254 112 485 256
7 CD STA 558 185 996 231
8 GH PRO 548 696 698 895
9 AB PRO 148 254 318 229
10
11 TOAL PER COMPANY PER DEPARTAMENT PER YEAR:
12 AB PRO =
Asusming that in Row 12, Col A = AB, and Row 12, Col B == PRO, then:
=SUMPRODUCT((A2:A9=A12)*(B2:B9=B12) *C2:F9)
Example:

Resources