How to join all Cities belonging to their state in python - python-3.x

Actually, I have a dataframe that contains some states and I have a list of their few cities and I want to add those cities to that dataset and want to group each city with their state names.
Eg.
#I have entered some random city names for example purpose
city = ['Akola','Aurangabad','Dhule','Jalgaon','Mumbai','Mumbai Suburban','Nagpur']
State Cases Active Recovered Death
0 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123
1 Andhra Pradesh 4223 143 1613 67 2539 73 71 3
2 Karnataka 4320 257 2653 157 1610 96 57 4
3 Goa 166 87 109 87 57 0
4 Tamil Nadu 27256 1384 12134 786 14902 586 220 12
and I want those cities to add in a data frame in a new column like
State Cases Active Recovered Death |CITY
0 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |AKOLA
1 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |DHULE
2 Maharashtra 77793 2933 41402 1458 33681 1352 2710 123 |MUMBAI
3 Andhra Pradesh 4223 143 1613 67 2539 73 71 3 |JALGAON
4 Andhra Pradesh 4223 143 1613 67 2539 73 71 3 |NAGPUR
5 Karnataka 4320 257 2653 157 1610 96 57 4
6 Goa 166 87 109 87 57 0
7 Tamil Nadu 27256 1384 12134 786 14902 586 220 12 |AURANGABAD
8 Tamil Nadu 27256 1384 12134 786 14902 586 220 12 |MUMBAI SUBURBAN
# data is wrong so please focus in format

Related

Sum up Specific columns in a Dataframe from sqlite

im relatively new to Dataframes in Python and running into an Issue I cant find.
im having a Dataframe with the following column layout
print(list(df.columns.values)) returns:
['iccid', 'system', 'last_updated', '01.01', '02.01', '03.01', '04.01', '05.01', '12.01', '18.01', '19.01', '20.01', '21.01', '22.01', '23.01', '24.01', '25.01', '26.01', '27.01', '28.01', '29.01', '30.01', '31.01']
normally i should have a column for each day in a specific month. in the example above its December 2022. Sometimes Days are missing which isnt an issue.
i tried to first get all given columns that are relevant by filtering them:
# Filter out the columns that are not related to the data
data_columns = [col for col in df.columns if '.' in col]
Now comes the issue:
Sometimes the column "system" could also be empty so i need to put the iccid inside the system value:
df.loc[df['system'] == 'Nicht benannt!', 'system'] = df.loc[df['system'] == 'Nicht benannt!', 'iccid'].iloc[0]
df.loc[df['system'] == '', 'system'] = df.loc[df['system'] == '', 'iccid'].iloc
grouped = df.groupby('system').sum(numeric_only=False)
then i tried to create that needed 'data_usage' column.
grouped['data_usage'] = grouped[data_columns[-1]]
grouped.reset_index(inplace=True)
By that line i should normally only get the result of the last column in the dataframe (which was a workaround that also didnt work as expected)
Now what im trying to get is the sum of all columns which contain a date in their name and add this sum to a new column named data_usage.
the issue im having here is im getting results for systems which dont have an initial system value which have a data_usage of 120000 (which is just value that represents the megabytes used) and if i check the sqlite file the system in total only used 9000 mb in that particular month.
For Example:
im having this column in the sqlite file:
iccid
system
last_updated
06.02
08.02
8931080320014183316
Nicht benannt!
2023-02-06
1196
1391
and in the dataframe i get the following result:
8931080320014183316 48129.0
I cant find the issue and would be very happy if someone can point me into the right direction.
Here are some example data as requested:
iccid
system
last_updated
01.12
02.12
03.12
04.12
05.12
06.12
07.12
08.12
09.12
10.12
11.12
12.12
13.12
14.12
15.12
16.12
17.12
18.12
19.12
20.12
21.12
22.12
23.12
28.12
29.12
30.12
31.12
8945020184547971966
U-O-51
2022-12-01
2
32
179
208
320
509
567
642
675
863
1033
1055
1174
2226
2277
2320
2466
2647
2679
2713
2759
2790
2819
2997
3023
3058
3088
8945020855461807911
L-O-382
2022-12-01
1
26
54
250
385
416
456
481
506
529
679
772
802
832
858
915
940
1019
1117
1141
1169
1193
1217
1419
1439
1461
1483
8945020855461809750
C-O-27
2022-12-01
1
123
158
189
225
251
456
489
768
800
800
800
800
800
800
2362
2386
2847
2925
2960
2997
3089
3116
3448
3469
3543
3586
8931080019070958450
L-O-123
2022-12-02
0
21
76
313
479
594
700
810
874
1181
1955
2447
2527
2640
2897
3008
3215
3412
3554
3639
3698
3782
3850
4741
4825
4925
5087
8931080453114183282
Nicht benannt!
2022-12-02
0
6
45
81
95
98
101
102
102
102
102
102
102
103
121
121
121
121
149
164
193
194
194
194
194
194
194
8931080894314183290
C-O-16 N
2022-12-02
0
43
145
252
386
452
532
862
938
1201
1552
1713
1802
1855
2822
3113
3185
3472
3527
3745
3805
3880
3938
4221
4265
4310
4373
8931080465814183308
L-O-83
2022-12-02
0
61
169
275
333
399
468
858
1094
1239
1605
1700
1928
2029
3031
4186
4333
4365
4628
4782
4842
4975
5265
5954
5954
5954
5954
8931082343214183316
Nicht benannt!
2022-12-02
0
52
182
506
602
719
948
1129
1314
1646
1912
1912
1912
1912
2791
3797
3944
4339
4510
4772
4832
5613
5688
6151
6482
6620
6848
8931087891314183324
L-O-119
2022-12-02
0
19
114
239
453
573
685
800
1247
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1423
2722
3563
4132
4385

How can I select only the rows In file 1 that match column values in file 2?

I have multiple measurements per 'Subject' in file 1. I only want to use the highest quality, singular measurement per Subject. In my second file I have the exact list of which measurement is the best for each Subject. This information is contained in the column 'seriesnumber'. The number in the 'seriesnumber' column in file 2 corresponds to the best measurement for a Subject. I Need to extract only these rows from my file 1.
I have tried to use awk, join, and merge to try and accomplish this but came up with errors and strange incomplete files.
join code:
join -j2 file1 file2
awk code:
awk ' FILENAME=="file1" {arr[$2]=$0; next}
FILENAME=="file2" {print arr[$2]} ' file1 file2 > newfile
File 1 Example
Subject Seriesnumber
19-1-1001 2 8655 661 15250 60747 8005 3919 7393 2264 1479 1663 22968 4180 1712 689 781 4255 90 1260 7233 154 15643 63421 7361 4384 6932 2062 4526 1742 686 4575 100 1684 0 1194 0 0 5 0 0 147 699 315 305 317 565 1361200 1338210 1338690 304258 308180 612438 250614 255920 506534 66645 802424 1206450 1187010 1185180 1816840 1 1 21 17 38 1765590
19-1-1001 10 8992 507 15722 64032 8728 3929 7208 2075 1529 1529 22503 3993 1819 710 764 3870 87 1247 7361 65 16128 66226 8165 4384 6669 1805 4405 1752 779 4039 103 1705 0 1280 0 0 10 0 0 186 685 300 318 320 598 1370490 1347160 1347520 306588 307188 613775 251704 256521 508225 65808 808802 1208880 1189150 1187450 1827880 1 1 22 26 48 1778960
19-1-1103 2 3303 317 12146 57569 7008 3617 6910 2018 811 1593 18708 4708 1429 408 668 3279 14 1289 2351 85 13730 60206 6731 4137 7034 2038 4407 1483 749 3576 85 1668 0 948 0 0 7 0 0 129 602 288 291 285 748 1250030 1238540 1238820 301810 301062 602872 215029 218080 433108 61555 781150 1107360 1098510 1097220 1635560 1 1 32 47 79 1555850
19-1-1103 9 3236 286 12490 59477 7000 3558 6782 2113 894 1752 19338 4818 1724 387 649 3345 56 1314 2077 133 13885 60414 6628 4078 7063 2031 4269 1709 610 3707 112 1947 0 990 0 0 8 0 0 245 604 279 280 284 693 1269820 1258050 1258320 306856 309614 616469 215658 220876 436534 61859 796760 1124870 1115990 1114510 1630740 1 1 32 42 74 1556790
19-10-1010 2 3344 608 14744 59165 8389 4427 6962 2008 716 1496 21980 4008 1474 769 652 3715 61 1400 3049 1072 15767 61919 8325 4824 7117 1936 4001 1546 684 3935 103 1434 0 1624 0 0 3 0 0 316 834 413 520 517 833 1350760 1337040 1336840 311985 312592 624577 246800 251133 497933 65699 809736 1200320 1189410 1188280 1731270 1 1 17 13 30 1606700
19-10-1010 6 3242 616 15205 61330 8019 4520 6791 2093 735 1558 22824 3981 1546 653 614 3672 96 1227 2992 1070 16450 64189 8489 4407 6953 2099 4096 1668 680 4116 99 1449 0 2161 0 0 19 0 0 263 848 387 525 528 824 1339090 1325830 1325780 309464 311916 621380 239958 244616 484574 65493 810887 1183120 1172600 1171430 1720000 1 1 16 26 42 1587100
File 2 Example
Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6
The desired output would like something like this:
Where I no longer have duplicate entries per subject. The second column will look different because the preferred series number will differ per subject.
19-10-1010 2 3344 608 14744 59165 8389 4427 6962 2008 716 1496 21980 4008 1474 769 652 3715 61 1400 3049 1072 15767 61919 8325 4824 7117 1936 4001 1546 684 3935 103 1434 0 1624 0 0 3 0 0 316 834 413 520 517 833 1350760 1337040 1336840 311985 312592 624577 246800 251133 497933 65699 809736 1200320 1189410 1188280 1731270 1 1 17 13 30 1606700
19-10-1166 2 3699 312 15373 61787 8026 4248 6385 1955 608 2194 21394 4260 1563 886 609 3420 25 1101 3415 417 16909 63040 7236 4264 5933 1852 4156 1213 654 4007 53 1336 5 1597 0 0 18 0 0 110 821 300 514 466 854 1193020 1179470 1179420 282241 273236 555477 204883 203228 408111 61343 740736 1036210 1026080 1024910 1563950 1 1 39 40 79 1415890
19-102-10005 2 8733 514 13024 50735 7729 3775 4955 1575 1045 1141 20415 3924 1537 990 651 3515 134 1259 8571 232 13487 51374 7150 4169 5192 1664 3760 1620 596 3919 189 1958 0 1479 0 0 36 0 0 203 837 459 409 439 1072 1224350 1200010 1200120 287659 290445 578104 216976 220545 437521 57457 737161 1095770 1074440 1073050 1637570 1 1 31 22 53 1618600
19-102-10006 2 8347 604 13735 42231 7266 3836 6473 2057 1099 1007 18478 3769 1351 978 639 3332 125 1197 8207 454 13774 43750 6758 4274 6148 1921 3732 1584 614 3521 180 1611 0 1241 0 0 25 0 0 254 813 410 352 372 833 1092800 1069450 1069190 244104 245787 489891 202201 205897 408098 59170 634640 978807 958350 957462 1485600 1 1 19 19 38 1472020
19-103-10009 2 4222 596 14702 52038 7428 4065 6598 2166 835 1854 22613 3397 1387 879 568 3729 93 1315 3414 222 14580 52639 7316 3997 6447 1986 4067 1529 596 3778 113 1689 0 2097 0 0 23 0 0 260 761 326 400 359 772 1204670 1190100 1189780 256560 260381 516941 237316 243326 480642 60653 681040 1070620 1059370 1058440 1605990 1 1 25 23 48 1593730
19-103-10010 2 5254 435 14688 47120 7772 3130 5414 1711 741 1912 20643 3594 1449 882 717 3663 41 999 6465 605 14820 49390 6361 3826 5527 1523 3513 1537 639 3596 80 1261 0 1475 0 0 18 0 0 283 827 383 414 297 627 1135490 1117320 1116990 243367 245896 489263 221809 227084 448893 55338 639719 1009370 994519 993639 1568140 1 1 14 11 25 1542210
19-104-10013 2 7276 341 11836 53018 7912 3942 6105 2334 795 2532 21239 4551 1258 1176 430 3636 83 1184 8811 396 12760 53092 7224 4361 6306 1853 4184 1278 543 3921 175 1814 0 2187 0 0 8 0 0 266 783 381 382 357 793 1011640 987712 987042 206633 228397 435031 170375 191222 361597 61814 601948 879229 859619 859103 1586150 1 1 224 162 386 1557120
19-104-10014 2 5964 355 13297 55439 8599 4081 5628 1730 970 1308 20196 4519 1363 992 697 3474 62 1232 6830 472 14729 59478 7006 4443 6156 1825 4492 1726 827 4017 122 1804 0 1412 0 0 17 0 0 259 672 299 305 319 779 1308470 1288970 1288910 284018 285985 570003 258525 257355 515880 62485 746108 1166160 1149700 1148340 1826660 1 1 33 24 57 1630580
19-105-10017 2 7018 307 13848 53855 8345 3734 6001 2095 899 1932 20712 4196 1349 645 823 4212 72 1475 3346 1119 13970 55202 7411 3975 5672 1737 3778 1490 657 4089 132 1689 0 1318 0 0 23 0 0 234 745 474 367 378 760 1122360 1104380 1104520 235806 233881 469687 217939 220736 438675 61471 639143 985718 970903 969619 1583800 1 1 51 51 102 1558470
19-105-10018 2 16454 1098 12569 52521 8215 3788 5858 1805 788 1147 21028 3496 1492 665 634 3796 39 1614 10700 617 12813 52098 8091 3901 5367 1646 3544 1388 723 3938 47 1819 0 1464 0 0 42 0 0 330 832 301 319 400 788 1148940 1114080 1113560 225179 227218 452397 237056 237295 474351 59172 614884 1019300 986820 986144 1607900 1 1 19 28 47 1591480
19-105-10020 2 4096 451 13042 48597 7601 3228 5665 1582 778 1670 19769 3612 1187 717 617 3672 103 962 2627 467 13208 48466 6619 3461 5217 1360 3575 1388 718 3783 90 1370 0 862 0 0 6 0 0 216 673 386 439 401 682 1081580 1068850 1068890 233290 235396 468686 209666 214472 424139 54781 619447 958522 948737 947554 1493740 1 1 16 11 27 1452900
For file1 containing (I removed long useless lines):
Subject Seriesnumber
19-1-1001 2 8655 661 15250 60747 800
19-1-1001 10 8992 507 15722 64032 872
19-1-1103 2 3303 317 12146 57569 700
19-1-1103 9 3236 286 12490 59477 700
19-10-1010 2 3344 608 14744 59165 838
19-10-1010 6 3242 616 15205 61330 801
and file2 containig:
Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6
The following awk will output:
$ awk 'NR==FNR{a[$1, $2];next} ($1, $2) in a' file2 file1
19-10-1010 2 3344 608 14744 59165 838
Note that the first file argument to awk is file2 not file1 (small optimization)! How it works:
NR == FNR - if line number is file line number. Ie. choose only first file passed to awk.
a[$1, $2] - remember index $1,$2 in associative array a
next - do not parse rest of script and restart with next line
($1, $2) in a - check if $1, $2 is in associative array a
because of next this is run only for the second file as passed to awk
if this expression returns with true, then the line will be printed (this is how awk works).
Alternatively you could do the follow, but it will store the whole file1 in memory, which is... memory consuming..., the code above only stores $1, $2 indexes in memory.
awk 'NR==FNR{arr[$1, $2]=$0} NR!=FNR{print arr[$1, $2]}' file1 file2

Unable to extract table data

import requests
from bs4 import BeautifulSoup
URL = 'https://www.mohfw.gov.in/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_body = table.find_all('tr')
print(table_body)
This is my code and I'm unable to extract table data even after extracting HTML content what am I doing wrong?
The data in the table is stored inside HTML comment (<!-- ... -->). To parse it, you can use this example:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.mohfw.gov.in/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
soup2 = BeautifulSoup(soup.table.find(text=lambda t: isinstance(t, Comment)), 'html.parser')
for row in soup2.select('tr'):
tds = [td.get_text(strip=True) for td in row.select('td')]
print('{:<5}{:<60}{:<10}{:<10}{:<10}'.format(*tds))
Prints:
1 Andaman and Nicobar Islands 47 133 0
2 Andhra Pradesh 18159 19393 492
3 Arunachal Pradesh 387 153 3
4 Assam 6818 12888 48
5 Bihar 7549 14018 197
6 Chandigarh 164 476 11
7 Chhattisgarh 1260 3451 21
8 Dadra and Nagar Haveli and Daman and Diu 179 371 2
9 Delhi 17407 97693 3545
10 Goa 1272 1817 19
11 Gujarat 11289 32103 2089
12 Haryana 5495 18185 322
13 Himachal Pradesh 382 984 11
14 Jammu and Kashmir 5488 6446 222
15 Jharkhand 2069 2513 42
16 Karnataka 30661 19729 1032
17 Kerala 5376 4862 37
18 Ladakh 176 970 1
19 Madhya Pradesh 5562 14127 689
20 Maharashtra 114947 158140 11194
21 Manipur 635 1129 0
22 Meghalaya 309 66 2
23 Mizoram 112 160 0
24 Nagaland 525 391 0
25 Odisha 4436 10877 79
26 Puducherry 774 947 22
27 Punjab 2587 6277 230
28 Rajasthan 6666 19970 538
29 Sikkim 155 88 0
30 Tamil Nadu 46717 107416 2236
31 Telangana 13327 27295 396
32 Tripura 676 1604 3
33 Uttarakhand 937 2995 50
34 Uttar Pradesh 15720 26675 1046
35 West Bengal 13679 21415 1023
Cases being reassigned to states 531
Total# 342473 635757 25602

How do I create a loop for reading lines then printing it out with a format (Python 3.x)

Hi I'm trying to create a readline loop and then print it out individually with a format but everytime i do it just repeats itself.
here is my code:
LF = open('fees.txt', 'r')
print('Now the final table\n')
print("Airline", format("1st bag",">15"),format("2nd bag",">15"), \
format("Change Fee",">15"),format("Other Fee",">15"), \
format("Feel Like",">15"),'\n')
line = LF
while line != '':
line = str(line)
line = LF.readline()
line = line.rstrip('\n')
print(line, format(line,'>10'),format(line,'>15'), format(line,'>15'), \
format(line,'>15'), format(line,'>15'),'\n')
LF.close()
print('===================================================\n')
and the result always turns like this:
Now the final table
Airline 1st bag 2nd bag Change Fee Other Fee Feel Like
Southwest Southwest Southwest Southwest Southwest Southwest
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Yes! Yes! Yes! Yes! Yes! Yes!
JetBlue JetBlue JetBlue JetBlue JetBlue JetBlue
20 20 20 20 20 20
35 35 35 35 35 35
75 75 75 75 75 75
125 125 125 125 125 125
Yikes Yikes Yikes Yikes Yikes Yikes
Alaska Airlines Alaska Airlines Alaska Airlines Alaska Airlines Alaska Airlines Alaska Airlines
25 25 25 25 25 25
25 25 25 25 25 25
125 125 125 125 125 125
155 155 155 155 155 155
Ooof Ooof Ooof Ooof Ooof Ooof
Delta Delta Delta Delta Delta Delta
25 25 25 25 25 25
35 35 35 35 35 35
200 200 200 200 200 200
150 150 150 150 150 150
We Lost Track We Lost Track We Lost Track We Lost Track We Lost Track We Lost Track
United United United United United United
25 25 25 25 25 25
35 35 35 35 35 35
200 200 200 200 200 200
250 250 250 250 250 250
Whaaaaat? Whaaaaat? Whaaaaat? Whaaaaat? Whaaaaat? Whaaaaat?
Am. Airlines Am. Airlines Am. Airlines Am. Airlines Am. Airlines Am. Airlines
25 25 25 25 25 25
35 35 35 35 35 35
200 200 200 200 200 200
205 205 205 205 205 205
Arrrgh! Arrrgh! Arrrgh! Arrrgh! Arrrgh! Arrrgh!
Spirit Spirit Spirit Spirit Spirit Spirit
30 30 30 30 30 30
40 40 40 40 40 40
100 100 100 100 100 100
292 292 292 292 292 292
Really?? Really?? Really?? Really?? Really?? Really??
===================================================
how do I fix it that it would turn out like this.
Southwest 0 0 0 0 Yes !
Jetblue 20 35 75 125 Yikes!
and so on and so forth.

Have one query regarding sum if formula

I am working in excel using SUMIF formula, my data is as follows:
Region Opr Qty Cost Combo(col B&A)
192 114 50 500 104192
192 104 453 548 104192
192 114 125 54654 114192
192 114 155 1545 114192
192 124 12 1553 124192
192 134 12222 1554545 134192
192 174 256 15478 174192
192 104 12 1555 104192
192 104 210 1156 104192
192 114 47 448953 114192
192 114 29 59479 114192
192 124 124 32451 124192
192 134 114 290240 134192
4192 10 210 115656 104192
4192 10 47 44896 104192
4192 11 29 12866 114192
4192 11 549 290240 114192
4192 12 124 59480 124192
4192 13 114 61343 134192
4192 17 310 45339 174192
4192 10 56 32451 104192
4192 10 103 82483 104192
4192 11 685 111380 114192
4192 11 646 201858 114192
4192 12 26 6489 124192
4192 13 87 44543 134192
If you see the last column it's giving same combination result but the operator and region are not always the same. I want to do SUMIF against Region which is throwing wrong values.
You can try SUMPRODUCT:
=SUMPRODUCT(((B2:B27&A2:A27)*1<>E2:E27)*1)
If the concatenation of column B to A is not equal to the Combo, count as 1, then add all the 1 together in SUMPRODUCT.
Change the range accordingly.
The *1 convert any text to number.

Resources