I have the above excel table and i would like to calculate the total per company per departament per year. I used:
=SUMPRODUCT(--($A$2:$A$9=A12),--($B$2:$B$9=B12)*$C$2:$F$9)
dosen`t seems to work.
A B C D E F
1 COMPANY DEPART. QUARTER 1 QUARTER 2 QUARTER 3 QUARTER 4
2 AB PRO 123 223 3354 556
3 CD PIV 222 235 223 568
4 CD PRO 236 254 184 223
5 AB STA 254 221 96 265
6 EF PIV 254 112 485 256
7 CD STA 558 185 996 231
8 GH PRO 548 696 698 895
9 AB PRO 148 254 318 229
10
11 TOAL PER COMPANY PER DEPARTAMENT PER YEAR:
12 AB PRO =
Asusming that in Row 12, Col A = AB, and Row 12, Col B == PRO, then:
=SUMPRODUCT((A2:A9=A12)*(B2:B9=B12) *C2:F9)
Example:
Related
I have to convert text files into csv's after processing the contents of the text file as pandas dataframe.
Below is the code i am using. out_txt is my input text file and out_csv is my output csv file.
df = pd.read_csv(out_txt, sep='\s', header=None, on_bad_lines='warn', encoding = "ANSI")
df = df.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
df.to_csv(out_csv, header=None)
If "on_bad_lines = 'warn'" is not decalred the csv files are not created. But if i use this condition those bad lines are getting skipped (obviously) with the warning
Skipping line 6: Expected 8 fields in line 7, saw 9. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
I would like to retain these bad lines in the csv. I have highlighted the bad lines detected in the below image (my input text file).
Below is the contents of the text file which is getting saved. In this content i would like to remove characters like #, &, (, ).
75062 220 8 6 110 220 250 <1
75063 260 5 2 584 878 950 <1
75064 810 <2 <2 456 598 3700 <1
75065 115 5 2 96 74 5000 <1
75066 976 <5 2 5 68 4200 <1
75067 22 210 4 348 140 4050 <1
75068 674 5 4 - 54 1130 3850 <1
75069 414 5 y) 446 6.6% 2350 <1
75070 458 <5 <2 548 82 3100 <1
75071 4050 <5 2 780 6430 3150 <1
75072 115 <7 <1 64 5.8% 4050 °#&4«x<i1
75073 456 <7 4 46 44 3900 <1
75074 376 <7 <2 348 3.8% 2150 <1
75075 378 <6 y) 30 40 2000 <1
I would split on \s later with str.split rather than read_csv :
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
.replace(r'[^\w\s]|_]/()|~"{}="', '', regex=True)
.squeeze().str.split(expand=True)
)
Another variant (skipping everything that comes in-between the numbers):
df = (
pd.read_csv(out_txt, header=None, encoding='ANSI')
[0].str.findall(r"\b(\d+)\b"))
.str.split(expand=True)
)
Output :
print(df)
0 1 2 3 4 5 6 7
0 375020 1060 115 38 440 350 7800 1
1 375021 920 80 26 310 290 5000 1
2 375022 1240 110 28 460 430 5900 1
3 375023 830 150 80 650 860 6200 1
4 375024 185 175 96 800 1020 2400 1
5 375025 680 370 88 1700 1220 172 1
6 375026 550 290 72 2250 1460 835 2
7 375027 390 120 60 1620 1240 158 1
8 375028 630 180 76 820 1360 180 1
9 375029 460 280 66 380 790 3600 1
10 375030 660 260 62 11180 1040 300 1
11 375031 530 200 84 1360 1060 555 1
I have two dataframes that need to be mapped (or joined?) based on some condition. These are the dataframes:
df_1
img_names img_array
0 1_rel 253
1 1_rel_right 255
2 1_rel_top 250
3 4_rel 180
4 4_rel_right 182
5 4_rel_top 189
6 7_rel 217
7 7_rel_right 183
8 7_rel_top 196
df_2
List_No time
0 1 38
1 4 23
2 7 32
After mapping I would like to get the following dataframe:
df_3
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
Basically, df_2's each row is populated 3 times to match the number of rows in df_1 and the mapping (if we can say so) is done by the split string in each row of df_1's img_name column. The names of row elements in img_names may have different names, but each of them always starts with the some number (1,4,7 in this case) and an undescore, etc. So I need to split the correspongding number in each row and map it with the row elements of List_No.
I hope the example above is clear.
Thank you.
Looks like you can just extract the digit parts and merge:
df_1['List_No'] = df_1['img_names'].str.split('_').str[0].astype(int)
df_3 = df_1.merge(df_2, on='List_No')
Output:
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
An alternative to #QuangHoang's answer (which I believe you should pick, as it is more robust). This uses the map method, and assumes every value in df2's time is in df1:
df1.assign(
List_No=df1.img_names.str.extract(r"(\d)", expand=False).astype(int),
time=lambda x: x.List_No.map(df2["time"]),
)
img_names img_array List_No time
0 1_rel 253 1 38
1 1_rel_right 255 1 38
2 1_rel_top 250 1 38
3 4_rel 180 4 23
4 4_rel_right 182 4 23
5 4_rel_top 189 4 23
6 7_rel 217 7 32
7 7_rel_right 183 7 32
8 7_rel_top 196 7 32
I've got a dataset containing data values associated with times (amongst other categories), and I'd like to add an accumulated value column - that is, the sum of all values up to and including the time. So, taking something like this:
ID YEAR VALUE
0 A 2018 144
1 B 2018 147
2 C 2018 164
3 D 2018 167
4 A 2019 167
5 B 2019 109
6 C 2019 183
7 D 2019 121
8 A 2020 136
9 B 2020 187
10 C 2020 170
11 D 2020 188
and adding a column like this:
ID YEAR VALUE CUMULATIVE_VALUE
0 A 2018 144 144
1 B 2018 147 147
2 C 2018 164 164
3 D 2018 167 167
4 A 2019 167 311
5 B 2019 109 256
6 C 2019 183 347
7 D 2019 121 288
8 A 2020 136 447
9 B 2020 187 443
10 C 2020 170 517
11 D 2020 188 476
Where e.g. in row 7 the CUMULATIVE_VALUE is the sum of the 2 VALUE for ID="D" in years 2018 and 2019 (and not 2020).
I've looked at cumsum() but can't see how I could use it in this specific case so the best I've come up with is this:
import numpy as np
import pandas as pd
np.random.seed(0)
ids=["A","B","C","D"]
years=[2018,2019,2020]
df = pd.DataFrame({"ID": np.tile(ids, 3),
"YEAR": np.repeat(years, 4),
"VALUE": np.random.randint(100,200,12)})
print(df)
df["CUMULATIVE_VALUE"] = None
for id in ids:
for year in years:
df.loc[(df.ID==id) & (df.YEAR==year), "CUMULATIVE_VALUE"] = \
df[(df.ID==id) & (df.YEAR <= year)].VALUE.sum()
print(df)
but I'm sure there must be a better and more efficient way of doing it. Anyone?
You can use pd.Groupby to group by ID and aggregate with cumsum:
df['CUMULATIVE_VALUE'] = df('ID').VALUE.cumsum()
ID YEAR VALUE CUMULATIVE_VALUE
0 A 2018 144 144
1 B 2018 147 147
2 C 2018 164 164
3 D 2018 167 167
4 A 2019 167 311
5 B 2019 109 256
6 C 2019 183 347
7 D 2019 121 288
8 A 2020 136 447
9 B 2020 187 443
10 C 2020 170 517
11 D 2020 188 476
In the case the years are not sorted instead do:
df = df.sort_values(['ID','YEAR']).reset_index(drop=True)
df['cumsum'] = df.groupby('ID').agg({'VALUE':'cumsum'})
I have to use groupby() on a dataframe in python 3.x. Column name is Origin, then based upon the origin, I have to find out the destination with maximum occurrences.
Sample df is like:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay origin dest
0 2013 1 1 517 515 2 830 819 11 EWR IAH
1 2013 1 1 533 529 4 850 830 20 LGA IAH
2 2013 1 1 542 540 2 923 850 33 JFK MIA
3 2013 1 1 544 545 -1 1004 1022 -18 JFK BQN
4 2013 1 1 554 600 -6 812 837 -25 LGA ATL
5 2013 1 1 554 558 -4 740 728 12 EWR ORD
6 2013 1 1 555 600 -5 913 854 19 EWR FLL
7 2013 1 1 557 600 -3 709 723 -14 LGA IAD
8 2013 1 1 557 600 -3 838 846 -8 JFK MCO
9 2013 1 1 558 600 -2 753 745 8 LGA ORD
You can use the following to find out the max number of occurrences of another column:
df.groupby(['origin'])['dest'].size().reset_index()
origin dest
0 EWR 3
1 JFK 3
2 LGA 4
you can use aggregate functions to make your life simpler and plot graphs onto it as well.
fun={'dest':{'Count':'count'}
df= df.groupby(['origin','dest']).agg(fun).reset_index()
df.columns=df.columns.droplevel(1)
df
This is a sample data
Polling_Booth INC SAD BSP PS_NO
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1
4 356 355 25 2
5 290 333 9 2
6 144 143 4 3
7 327 196 1 4
8 370 235 1 5
And this is what I'm trying to achieve
Polling_Booth INC SAD BSP PS_NO OP_INC OP_SAD OP_BSP
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1 437 871 16
4 356 355 25 2
5 290 333 9 2 646 688 34
6 144 143 4 3 144 143 4
7 327 196 1 4 327 196 1
8 370 235 1 5 370 235 1
This is achieved adding up rows which has the same PS_NO, This is what I have tried
=if(E2=E3,sum(B2,B3),0) #same for all the rows
Any help would be much appreciated..Thanks
You could get it to look like your table by adding another condition to check if it's the last occurrence of the PS_No in column E and setting the result to an empty string if not
=IF(COUNTIF($E$2:$E2,$E2)=COUNTIF($E$2:$E$10,$E2),SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
If the data is sorted by PS_No, you can do it more easily by
=IF($E3<>$E2,SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
which I think is what you were trying in your question