Please suggest approaches and code to solve the defined problem statement - python-3.x

x y z amount absolute_amount
121 abc def 500 500
131 fgh xyz -800 800
121 abc xyz 900 900
131 fgh ijk 800 800
141 obc pqr 500 500
151 mbr pqr -500 500
141 obc pqr -500 500
151 mbr pqr 900 900
I need to find the duplicate rows in the dataset where the x and y are same, with conditions being-
sum(amount) !=0
abs(sum(amount)) != absolute_amount
I tried grouping them and the code i used in R is working but i need it to work in python
logic1 <- tablename %>%
group_by('x','y')%>%
filter(n()>1 && sum(`amount`) != 0 && abs(sum(`amount`)) != absolute_amount)
Expected output
x y z amount absolute_amount
121 abc def 500 500
121 abc xyz 900 900
151 mbr pqr -500 500
151 mbr pqr 900 900

Use transform with groupby.sum() to return sum transformed for each group and then compare the 2 conditions you have:
c=df.groupby(['x','y'])['amount'].transform('sum')
df[c.ne(0) & c.abs().ne(df.absolute_amount)]
x y z amount absolute_amount
0 121 abc def 500 500
2 121 abc xyz 900 900
5 151 mbr pqr -500 500
7 151 mbr pqr 900 900

Related

Retaining bad_lines identified by pandas in the output file instead of skipping those lines

I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
​
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1

How to define selection using index function in J

Let's assume I have a following tensor t:
]m=: 100 + 4 4 $ i.16
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
]t=: (m ,: m+100) , m+200
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
200 201 202 203
204 205 206 207
208 209 210 211
212 213 214 215
300 301 302 303
304 305 306 307
308 309 310 311
312 313 314 315
I would like to select diagonal plane of it, so :
100 105 110 115
200 205 210 215
300 305 310 315
How to define function that acts on indices? (and here have for any plane index let's choose ix(row) = ix (column)) Also, how to define functions working on values and indices together? So I would be interested in having something like this:
(f t) { t
Thanks!
Transpose x|:y with boxed arguments runs the axes together to produce a single axis. You can use this to produce a rather idiomatic solution:
(< 0 1) |: m
100 105 110 115
(<0 1) |:"2 t
100 105 110 115
200 205 210 215
300 305 310 315
where you use the rank " verb to apply the diagonal selection to 2-boxes.
You can convert an array of values to its corresponding array of indices with (#:i.)#$ m
To get an example f „working on values and indices together“ you can then plug it in as a dyad that takes values on the left and indices on the right:
f=.(2|[) +. ([:=/"1]) NB. odd value or diagonal index
]r=.([ f (#:i.)#$) m NB. values f indices
1 1 0 1
0 1 0 1
0 1 1 1
0 1 0 1
r #&, m NB. flatten lists & get values where bit is set
100 101 103 105 107 109 110 111 113 115
Everything wrapped into an adverb that can be applied f:
sel=.1 : '#~&, [ u (#:i.)#$`
f sel m
100 101 103 105 107 109 110 111 113 115

Pandas Convert Multiple Columns to Rows

I have a pandas dataframe as follows:
code title amount_1 amount_2 currency_1 currency_2
0 246 ex 500 550 USD GBP
1 300 am 200 250 USD GBP
2 315 ple 300 325 USD GBP
I'd like to get this into the format
code title amount currency
246 ex 500 USD
246 ex 550 GBP
All of the currencies are the same. How can I get this format? I've tried using melt and reset_index, but neither seemed to do exactly what I need.
Thank you
Use wide_to_long:
df1 = pd.wide_to_long(df,
stubnames=['amount','currency'],
i=['code','title'],
j='measure', sep='_').reset_index()
print (df1)
code title measure amount currency
0 246 ex 1 500 USD
1 246 ex 2 550 GBP
2 300 am 1 200 USD
3 300 am 2 250 GBP
4 315 ple 1 300 USD
5 315 ple 2 325 GBP

Delete rows according to condition

Using as key columns 1 and 2, i want to delete all rows which the value increments by one.
input
1000 1001 140
1000 1002 140
1000 1003 140
1000 1004 140
1000 1005 140
1000 1006 140
1000 1201 140
1000 1202 140
1000 1203 140
1000 1204 140
1000 1205 140
2000 1002 140
2000 1003 140
2000 1004 140
2000 1005 140
2000 1006 140
output desired
1000 1001 140
1000 1006 140
1000 1201 140
1000 1205 140
2000 1002 140
2000 1006 140
I have tried
awk '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' <file>
But for some reason, it keeps only the maximum value.
Your problem statement doesn't describe your output. You want to print the first and last row of each contiguous range. Like this:
$ awk '$1 > A || $2 > B + 1 {
if(row){print row}; print}
{A=$1; B=$2; row=$0}
END {print}' dat
1000 1001 140
1000 1006 140
1000 1201 140
1000 1205 140
2000 1002 140
2000 1006 140
The basic problem is just to determine if a line is only 1 more than the prior one. The only way to do that is to have both lines to compare. By storing the value of each line as it's read, you can compare the current line to the prior one.

sumproduct using different criteria

I have the above excel table and i would like to calculate the total per company per departament per year. I used:
=SUMPRODUCT(--($A$2:$A$9=A12),--($B$2:$B$9=B12)*$C$2:$F$9)
dosen`t seems to work.
A B C D E F
1 COMPANY DEPART. QUARTER 1 QUARTER 2 QUARTER 3 QUARTER 4
2 AB PRO 123 223 3354 556
3 CD PIV 222 235 223 568
4 CD PRO 236 254 184 223
5 AB STA 254 221 96 265
6 EF PIV 254 112 485 256
7 CD STA 558 185 996 231
8 GH PRO 548 696 698 895
9 AB PRO 148 254 318 229
10
11 TOAL PER COMPANY PER DEPARTAMENT PER YEAR:
12 AB PRO =
Asusming that in Row 12, Col A = AB, and Row 12, Col B == PRO, then:
=SUMPRODUCT((A2:A9=A12)*(B2:B9=B12) *C2:F9)
Example:

Resources