I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1
I want to be able to iterate through the rows in my data frame and be able to replace the current number with an +1 incremental number (that starts # 1000) every time the number changes.
Here is my dataframe column with the desired number next to it:
deal_pos_code
657 > 1000
680 > 1001
694 > 1002
694 > 1002
694 > 1002
694 > 1002
695 > 1003
695 > 1003
695 > 1003
695 > 1003
696 > 1004
696 > 1004
Update
I am new but this is what I have so far:
cv = df['deal_pos_code'].iloc[0]
nv = 1000
for i, row in mbn.iterrows():
if mbn.row['deal_pos_code'] == cv:
row['deal_pos_code'] = nv
else:
nv +=
cv = row['deal_pos_code']
row['deal_pos_code'] = nv
I am getting an attribute error:
AttributeError: 'DataFrame' object has no attribute 'row'
Update.. bottom lines fixed on table
You can check the difference between each row in your 'deal' variable compared to previous row using diff, and if it is not 0 (i.e. increasing in your case), use cumsum(), and add(999):
df['pos_code'] = (df['deal'].diff() != 0).cumsum().add(999)
df
deal pos_code
0 657 1000
1 680 1001
2 694 1002
3 694 1002
4 694 1002
5 694 1002
6 695 1003
7 695 1003
8 695 1003
9 695 1003
10 696 1004
11 696 1004
As #Shubham pointed out, there's probably a typo in your last two rows.
I am choosing the values in Pandas DataFrame.
I would like to choose the values in the columns 'One_T','Two_T','Three_T'(which means the total counts), based on the Ratios of the columns('One_R','Two_R','Three_R').
Comparing values is done by the columns('One_R','Two_R','Three_R') and choosing values will be done by columns ('One_T','Two_T','Three_T').
I would like to find the highest values among columns('One_R','Two_R','Three_R') and put values from columns 'One_T','Two_T','Three_T' in new column 'Highest'.
For example, the first row has the highest values in One_R than Two_R and Three_R.
Then, the values in One_T will be filled the column named Highest.
The initial data frame is test below code and the desired result is the result in the below code.
test = pd.DataFrame([[150,30,140,20,120,19],[170,31,130,30,180,22],[230,45,100,50,140,40],
[140,28,80,10,60,10],[100,25,80,27,50,23]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R'])
One_T One_R Two_T Two_R Three_T Three_R
2019-01-01 150 30 140 20 120 19
2019-02-01 170 31 130 30 180 22
2019-03-01 230 45 100 50 140 40
2019-04-01 140 28 80 10 60 10
2019-05-01 100 25 80 27 50 23
result = pd.DataFrame([[150,30,140,20,120,19,150],[170,31,130,30,180,22,170],[230,45,100,50,140,40,100],
[140,28,80,10,60,10,140],[100,25,80,27,50,23,80]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R','Highest'])
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80
Is there any way to do this?
Thank you for time and considerations.
You can solve this using df.filter to select columns with the _R suffix, then idxmax. Then replace _R with _T and use df.lookup:
s = test.filter(like='_R').idxmax(1).str.replace('_R','_T')
test['Highest'] = test.lookup(s.index,s)
print(test)
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80
x y z amount absolute_amount
121 abc def 500 500
131 fgh xyz -800 800
121 abc xyz 900 900
131 fgh ijk 800 800
141 obc pqr 500 500
151 mbr pqr -500 500
141 obc pqr -500 500
151 mbr pqr 900 900
I need to find the duplicate rows in the dataset where the x and y are same, with conditions being-
sum(amount) !=0
abs(sum(amount)) != absolute_amount
I tried grouping them and the code i used in R is working but i need it to work in python
logic1 <- tablename %>%
group_by('x','y')%>%
filter(n()>1 && sum(`amount`) != 0 && abs(sum(`amount`)) != absolute_amount)
Expected output
x y z amount absolute_amount
121 abc def 500 500
121 abc xyz 900 900
151 mbr pqr -500 500
151 mbr pqr 900 900
Use transform with groupby.sum() to return sum transformed for each group and then compare the 2 conditions you have:
c=df.groupby(['x','y'])['amount'].transform('sum')
df[c.ne(0) & c.abs().ne(df.absolute_amount)]
x y z amount absolute_amount
0 121 abc def 500 500
2 121 abc xyz 900 900
5 151 mbr pqr -500 500
7 151 mbr pqr 900 900
I have the above excel table and i would like to calculate the total per company per departament per year. I used:
=SUMPRODUCT(--($A$2:$A$9=A12),--($B$2:$B$9=B12)*$C$2:$F$9)
dosen`t seems to work.
A B C D E F
1 COMPANY DEPART. QUARTER 1 QUARTER 2 QUARTER 3 QUARTER 4
2 AB PRO 123 223 3354 556
3 CD PIV 222 235 223 568
4 CD PRO 236 254 184 223
5 AB STA 254 221 96 265
6 EF PIV 254 112 485 256
7 CD STA 558 185 996 231
8 GH PRO 548 696 698 895
9 AB PRO 148 254 318 229
10
11 TOAL PER COMPANY PER DEPARTAMENT PER YEAR:
12 AB PRO =
Asusming that in Row 12, Col A = AB, and Row 12, Col B == PRO, then:
=SUMPRODUCT((A2:A9=A12)*(B2:B9=B12) *C2:F9)
Example: