how to separate paragraphs in a textfile into multiple textfiles? - linux

I want to separate this textfile to 3 textfile that each paragraph makes a textfile.(my Os is ubuntu12.04)
Input
2008 2 2 1120 31.2 L 34.031 48.515 16.7 INS 5 0.3 4.0LINS 1
GAP=145 0.67 4.1 2.9 6.6 0.2283E-01 -0.1718E+00 0.1289E+02E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080202112031 L I
2008-02-02-1120-39S.IN____006 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
SNGE BZ EPg 1120 57.69 91 0.0210 159 318
SNGE BZ AML 1121 24.50 2880.9 0.55 159 318
SHGR BZ EPN5 1121 5.17 52 -0.0510 215 173
GHVR BZ EPn 1121 10.84 52 0.3610 256 78
GHVR BZ ESg 1121 43.50 91 -0.0210 256 78
CHTH BZ EPn 1121 18.26 52 0.1210 317 48
CHTH BZ AML 1122 8.01 494.0 0.68 317 48
DAMV BZ EPn 1121 23.36 52 -0.49 9 362 60
DAMV BZ AML 1122 7.03 382.0 0.48 362 60
2008 211 1403 46.2 L 27.659 55.544 14.1 INS 4 0.1 4.0LINS 1
GAP=171 0.38 1.7 1.2 3.3 -0.8271E-01 -0.3724E-01 0.4284E+00E
2008-02-11-1403-37S.INSN__048 6
ACTION:NEW 08-12-28 13:25 OP:moh STATUS: ID:20080211140346 L I
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
BNDS BZ EPg 14 3 58.14 90 -0.0710 68.3 115
BNDS BN AML 14 4 26.39 8461.0 0.52 68.3 115
GHIR BZ EPn 14 4 26.40 52 0.0310 261 286
GHIR BN ESg 14 4 59.85 90 -0.0110 261 286
GHIR BN AML 14 5 25.22 1122.4 0.56 261 286
GHIR BE AML 14 5 43.83 769.3 0.64 261 286
KRBR BZ EPn 14 4 29.25 52 -0.1110 284 24
KRBR BN ESg 14 5 6.28 90 0.0010 284 24
KRBR BN AML 14 5 18.89 552.4 0.64 284 24
KRBR BE AML 14 5 19.22 574.0 0.60 284 24
ZHSF BZ EPn 14 5 3.24 52 0.25 8 555 66
2008 213 2055 31.5 L 31.713 51.180 14.1 INS 9 0.5 4.2LINS 1
GAP=127 1.21 4.6 6.5 9.6 0.7570E+01 -0.1161E+02 0.9944E+01E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080213205531 L I
2008-02-13-2054-59S.NSN___048 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
NASN BZ EPg 2056 3.15 90 -0.6410 195 51
SHGR BZ EPg 2056 8.57 90 -0.3810 229 282
SHGR BN AML 2056 49.27 2371.2 0.77 229 282
SHGR BE AML 2056 51.00 2484.4 0.77 229 282
GHVR BZ EPn 2056 18.39 52 1.0110 307 1
GHVR BE AML 2057 11.42 734.2 0.85 307 1
ASAO BZ EPn 2056 20.35 52 -0.36 9 332 341
ASAO BE ESg 2057 5.23 90 0.27 9 332 341
ASAO BN AML 2057 15.86 723.3 0.64 332 341
GHIR BZ EPn 2056 31.68 52 0.48 9 418 155
GHIR BN AML 2057 51.30 259.1 0.79 418 155
DAMV BZ EPn 2056 33.90 52 -0.27 9 441 9
DAMV BN AML 2057 43.30 237.4 0.65 441 9
THKV BZ EPn 2056 37.71 52 0.33 8 467 357
THKV BE AML 2057 51.62 205.7 0.72 467 357
ZNJK BZ EPn 2056 53.12 52 -0.35 7 596 338
BNDS BZ EPn 2057 3.72 52 -0.06 7 680 133
output1.txt
2008 2 2 1120 31.2 L 34.031 48.515 16.7 INS 5 0.3 4.0LINS 1
GAP=145 0.67 4.1 2.9 6.6 0.2283E-01 -0.1718E+00 0.1289E+02E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080202112031 L I
2008-02-02-1120-39S.IN____006 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
SNGE BZ EPg 1120 57.69 91 0.0210 159 318
SNGE BZ AML 1121 24.50 2880.9 0.55 159 318
SHGR BZ EPN5 1121 5.17 52 -0.0510 215 173
GHVR BZ EPn 1121 10.84 52 0.3610 256 78
GHVR BZ ESg 1121 43.50 91 -0.0210 256 78
CHTH BZ EPn 1121 18.26 52 0.1210 317 48
CHTH BZ AML 1122 8.01 494.0 0.68 317 48
DAMV BZ EPn 1121 23.36 52 -0.49 9 362 60
DAMV BZ AML 1122 7.03 382.0 0.48 362 60
output2.txt
2008 211 1403 46.2 L 27.659 55.544 14.1 INS 4 0.1 4.0LINS 1
GAP=171 0.38 1.7 1.2 3.3 -0.8271E-01 -0.3724E-01 0.4284E+00E
2008-02-11-1403-37S.INSN__048 6
ACTION:NEW 08-12-28 13:25 OP:moh STATUS: ID:20080211140346 L I
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
BNDS BZ EPg 14 3 58.14 90 -0.0710 68.3 115
BNDS BN AML 14 4 26.39 8461.0 0.52 68.3 115
GHIR BZ EPn 14 4 26.40 52 0.0310 261 286
GHIR BN ESg 14 4 59.85 90 -0.0110 261 286
GHIR BN AML 14 5 25.22 1122.4 0.56 261 286
GHIR BE AML 14 5 43.83 769.3 0.64 261 286
KRBR BZ EPn 14 4 29.25 52 -0.1110 284 24
KRBR BN ESg 14 5 6.28 90 0.0010 284 24
KRBR BN AML 14 5 18.89 552.4 0.64 284 24
KRBR BE AML 14 5 19.22 574.0 0.60 284 24
ZHSF BZ EPn 14 5 3.24 52 0.25 8 555 66
output3.txt
2008 213 2055 31.5 L 31.713 51.180 14.1 INS 9 0.5 4.2LINS 1
GAP=127 1.21 4.6 6.5 9.6 0.7570E+01 -0.1161E+02 0.9944E+01E
ACTION:UPD 08-12-28 13:25 OP:moh STATUS: ID:20080213205531 L I
2008-02-13-2054-59S.NSN___048 6
STAT SP IPHASW D HRMM SECON CODA AMPLIT PERI AZIMU VELO AIN AR TRES W DIS CAZ7
NASN BZ EPg 2056 3.15 90 -0.6410 195 51
SHGR BZ EPg 2056 8.57 90 -0.3810 229 282
SHGR BN AML 2056 49.27 2371.2 0.77 229 282
SHGR BE AML 2056 51.00 2484.4 0.77 229 282
GHVR BZ EPn 2056 18.39 52 1.0110 307 1
GHVR BE AML 2057 11.42 734.2 0.85 307 1
ASAO BZ EPn 2056 20.35 52 -0.36 9 332 341
ASAO BE ESg 2057 5.23 90 0.27 9 332 341
ASAO BN AML 2057 15.86 723.3 0.64 332 341
GHIR BZ EPn 2056 31.68 52 0.48 9 418 155
GHIR BN AML 2057 51.30 259.1 0.79 418 155
DAMV BZ EPn 2056 33.90 52 -0.27 9 441 9
DAMV BN AML 2057 43.30 237.4 0.65 441 9
THKV BZ EPn 2056 37.71 52 0.33 8 467 357
THKV BE AML 2057 51.62 205.7 0.72 467 357
ZNJK BZ EPn 2056 53.12 52 -0.35 7 596 338
BNDS BZ EPn 2057 3.72 52 -0.06 7 680 133

I give you an idea, just one method: iterate your file, row by row.
Save in a buffer all the row while row!="" or row!='\n': in this case, save buffer in a differente file.
buffer=""
id=0
cat test | \
while read row; do
#check row value, save in buffer
.....
cat buffer > fileName_${id}.txt
id=$((id+1))
done

Related

Sum up Specific columns in a Dataframe from sqlite

im relatively new to Dataframes in Python and running into an Issue I cant find.
im having a Dataframe with the following column layout
print(list(df.columns.values)) returns:
['iccid', 'system', 'last_updated', '01.01', '02.01', '03.01', '04.01', '05.01', '12.01', '18.01', '19.01', '20.01', '21.01', '22.01', '23.01', '24.01', '25.01', '26.01', '27.01', '28.01', '29.01', '30.01', '31.01']
normally i should have a column for each day in a specific month. in the example above its December 2022. Sometimes Days are missing which isnt an issue.
i tried to first get all given columns that are relevant by filtering them:
# Filter out the columns that are not related to the data
data_columns = [col for col in df.columns if '.' in col]
Now comes the issue:
Sometimes the column "system" could also be empty so i need to put the iccid inside the system value:
df.loc[df['system'] == 'Nicht benannt!', 'system'] = df.loc[df['system'] == 'Nicht benannt!', 'iccid'].iloc[0]
df.loc[df['system'] == '', 'system'] = df.loc[df['system'] == '', 'iccid'].iloc
grouped = df.groupby('system').sum(numeric_only=False)
then i tried to create that needed 'data_usage' column.
grouped['data_usage'] = grouped[data_columns[-1]]
grouped.reset_index(inplace=True)
By that line i should normally only get the result of the last column in the dataframe (which was a workaround that also didnt work as expected)
Now what im trying to get is the sum of all columns which contain a date in their name and add this sum to a new column named data_usage.
the issue im having here is im getting results for systems which dont have an initial system value which have a data_usage of 120000 (which is just value that represents the megabytes used) and if i check the sqlite file the system in total only used 9000 mb in that particular month.
For Example:
im having this column in the sqlite file:
iccid
system
last_updated
06.02
08.02
8931080320014183316
Nicht benannt!
2023-02-06
1196
1391
and in the dataframe i get the following result:
8931080320014183316 48129.0
I cant find the issue and would be very happy if someone can point me into the right direction.
Here are some example data as requested:
iccid
system
last_updated
01.12
02.12
03.12
04.12
05.12
06.12
07.12
08.12
09.12
10.12
11.12
12.12
13.12
14.12
15.12
16.12
17.12
18.12
19.12
20.12
21.12
22.12
23.12
28.12
29.12
30.12
31.12
8945020184547971966
U-O-51
2022-12-01
2
32
179
208
320
509
567
642
675
863
1033
1055
1174
2226
2277
2320
2466
2647
2679
2713
2759
2790
2819
2997
3023
3058
3088
8945020855461807911
L-O-382
2022-12-01
1
26
54
250
385
416
456
481
506
529
679
772
802
832
858
915
940
1019
1117
1141
1169
1193
1217
1419
1439
1461
1483
8945020855461809750
C-O-27
2022-12-01
1
123
158
189
225
251
456
489
768
800
800
800
800
800
800
2362
2386
2847
2925
2960
2997
3089
3116
3448
3469
3543
3586
8931080019070958450
L-O-123
2022-12-02
0
21
76
313
479
594
700
810
874
1181
1955
2447
2527
2640
2897
3008
3215
3412
3554
3639
3698
3782
3850
4741
4825
4925
5087
8931080453114183282
Nicht benannt!
2022-12-02
0
6
45
81
95
98
101
102
102
102
102
102
102
103
121
121
121
121
149
164
193
194
194
194
194
194
194
8931080894314183290
C-O-16 N
2022-12-02
0
43
145
252
386
452
532
862
938
1201
1552
1713
1802
1855
2822
3113
3185
3472
3527
3745
3805
3880
3938
4221
4265
4310
4373
8931080465814183308
L-O-83
2022-12-02
0
61
169
275
333
399
468
858
1094
1239
1605
1700
1928
2029
3031
4186
4333
4365
4628
4782
4842
4975
5265
5954
5954
5954
5954
8931082343214183316
Nicht benannt!
2022-12-02
0
52
182
506
602
719
948
1129
1314
1646
1912
1912
1912
1912
2791
3797
3944
4339
4510
4772
4832
5613
5688
6151
6482
6620
6848
8931087891314183324
L-O-119
2022-12-02
0
19
114
239
453
573
685
800
1247
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1341
1423
2722
3563
4132
4385

How can I select only the rows In file 1 that match column values in file 2?

I have multiple measurements per 'Subject' in file 1. I only want to use the highest quality, singular measurement per Subject. In my second file I have the exact list of which measurement is the best for each Subject. This information is contained in the column 'seriesnumber'. The number in the 'seriesnumber' column in file 2 corresponds to the best measurement for a Subject. I Need to extract only these rows from my file 1.
I have tried to use awk, join, and merge to try and accomplish this but came up with errors and strange incomplete files.
join code:
join -j2 file1 file2
awk code:
awk ' FILENAME=="file1" {arr[$2]=$0; next}
FILENAME=="file2" {print arr[$2]} ' file1 file2 > newfile
File 1 Example
Subject Seriesnumber
19-1-1001 2 8655 661 15250 60747 8005 3919 7393 2264 1479 1663 22968 4180 1712 689 781 4255 90 1260 7233 154 15643 63421 7361 4384 6932 2062 4526 1742 686 4575 100 1684 0 1194 0 0 5 0 0 147 699 315 305 317 565 1361200 1338210 1338690 304258 308180 612438 250614 255920 506534 66645 802424 1206450 1187010 1185180 1816840 1 1 21 17 38 1765590
19-1-1001 10 8992 507 15722 64032 8728 3929 7208 2075 1529 1529 22503 3993 1819 710 764 3870 87 1247 7361 65 16128 66226 8165 4384 6669 1805 4405 1752 779 4039 103 1705 0 1280 0 0 10 0 0 186 685 300 318 320 598 1370490 1347160 1347520 306588 307188 613775 251704 256521 508225 65808 808802 1208880 1189150 1187450 1827880 1 1 22 26 48 1778960
19-1-1103 2 3303 317 12146 57569 7008 3617 6910 2018 811 1593 18708 4708 1429 408 668 3279 14 1289 2351 85 13730 60206 6731 4137 7034 2038 4407 1483 749 3576 85 1668 0 948 0 0 7 0 0 129 602 288 291 285 748 1250030 1238540 1238820 301810 301062 602872 215029 218080 433108 61555 781150 1107360 1098510 1097220 1635560 1 1 32 47 79 1555850
19-1-1103 9 3236 286 12490 59477 7000 3558 6782 2113 894 1752 19338 4818 1724 387 649 3345 56 1314 2077 133 13885 60414 6628 4078 7063 2031 4269 1709 610 3707 112 1947 0 990 0 0 8 0 0 245 604 279 280 284 693 1269820 1258050 1258320 306856 309614 616469 215658 220876 436534 61859 796760 1124870 1115990 1114510 1630740 1 1 32 42 74 1556790
19-10-1010 2 3344 608 14744 59165 8389 4427 6962 2008 716 1496 21980 4008 1474 769 652 3715 61 1400 3049 1072 15767 61919 8325 4824 7117 1936 4001 1546 684 3935 103 1434 0 1624 0 0 3 0 0 316 834 413 520 517 833 1350760 1337040 1336840 311985 312592 624577 246800 251133 497933 65699 809736 1200320 1189410 1188280 1731270 1 1 17 13 30 1606700
19-10-1010 6 3242 616 15205 61330 8019 4520 6791 2093 735 1558 22824 3981 1546 653 614 3672 96 1227 2992 1070 16450 64189 8489 4407 6953 2099 4096 1668 680 4116 99 1449 0 2161 0 0 19 0 0 263 848 387 525 528 824 1339090 1325830 1325780 309464 311916 621380 239958 244616 484574 65493 810887 1183120 1172600 1171430 1720000 1 1 16 26 42 1587100
File 2 Example
Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6
The desired output would like something like this:
Where I no longer have duplicate entries per subject. The second column will look different because the preferred series number will differ per subject.
19-10-1010 2 3344 608 14744 59165 8389 4427 6962 2008 716 1496 21980 4008 1474 769 652 3715 61 1400 3049 1072 15767 61919 8325 4824 7117 1936 4001 1546 684 3935 103 1434 0 1624 0 0 3 0 0 316 834 413 520 517 833 1350760 1337040 1336840 311985 312592 624577 246800 251133 497933 65699 809736 1200320 1189410 1188280 1731270 1 1 17 13 30 1606700
19-10-1166 2 3699 312 15373 61787 8026 4248 6385 1955 608 2194 21394 4260 1563 886 609 3420 25 1101 3415 417 16909 63040 7236 4264 5933 1852 4156 1213 654 4007 53 1336 5 1597 0 0 18 0 0 110 821 300 514 466 854 1193020 1179470 1179420 282241 273236 555477 204883 203228 408111 61343 740736 1036210 1026080 1024910 1563950 1 1 39 40 79 1415890
19-102-10005 2 8733 514 13024 50735 7729 3775 4955 1575 1045 1141 20415 3924 1537 990 651 3515 134 1259 8571 232 13487 51374 7150 4169 5192 1664 3760 1620 596 3919 189 1958 0 1479 0 0 36 0 0 203 837 459 409 439 1072 1224350 1200010 1200120 287659 290445 578104 216976 220545 437521 57457 737161 1095770 1074440 1073050 1637570 1 1 31 22 53 1618600
19-102-10006 2 8347 604 13735 42231 7266 3836 6473 2057 1099 1007 18478 3769 1351 978 639 3332 125 1197 8207 454 13774 43750 6758 4274 6148 1921 3732 1584 614 3521 180 1611 0 1241 0 0 25 0 0 254 813 410 352 372 833 1092800 1069450 1069190 244104 245787 489891 202201 205897 408098 59170 634640 978807 958350 957462 1485600 1 1 19 19 38 1472020
19-103-10009 2 4222 596 14702 52038 7428 4065 6598 2166 835 1854 22613 3397 1387 879 568 3729 93 1315 3414 222 14580 52639 7316 3997 6447 1986 4067 1529 596 3778 113 1689 0 2097 0 0 23 0 0 260 761 326 400 359 772 1204670 1190100 1189780 256560 260381 516941 237316 243326 480642 60653 681040 1070620 1059370 1058440 1605990 1 1 25 23 48 1593730
19-103-10010 2 5254 435 14688 47120 7772 3130 5414 1711 741 1912 20643 3594 1449 882 717 3663 41 999 6465 605 14820 49390 6361 3826 5527 1523 3513 1537 639 3596 80 1261 0 1475 0 0 18 0 0 283 827 383 414 297 627 1135490 1117320 1116990 243367 245896 489263 221809 227084 448893 55338 639719 1009370 994519 993639 1568140 1 1 14 11 25 1542210
19-104-10013 2 7276 341 11836 53018 7912 3942 6105 2334 795 2532 21239 4551 1258 1176 430 3636 83 1184 8811 396 12760 53092 7224 4361 6306 1853 4184 1278 543 3921 175 1814 0 2187 0 0 8 0 0 266 783 381 382 357 793 1011640 987712 987042 206633 228397 435031 170375 191222 361597 61814 601948 879229 859619 859103 1586150 1 1 224 162 386 1557120
19-104-10014 2 5964 355 13297 55439 8599 4081 5628 1730 970 1308 20196 4519 1363 992 697 3474 62 1232 6830 472 14729 59478 7006 4443 6156 1825 4492 1726 827 4017 122 1804 0 1412 0 0 17 0 0 259 672 299 305 319 779 1308470 1288970 1288910 284018 285985 570003 258525 257355 515880 62485 746108 1166160 1149700 1148340 1826660 1 1 33 24 57 1630580
19-105-10017 2 7018 307 13848 53855 8345 3734 6001 2095 899 1932 20712 4196 1349 645 823 4212 72 1475 3346 1119 13970 55202 7411 3975 5672 1737 3778 1490 657 4089 132 1689 0 1318 0 0 23 0 0 234 745 474 367 378 760 1122360 1104380 1104520 235806 233881 469687 217939 220736 438675 61471 639143 985718 970903 969619 1583800 1 1 51 51 102 1558470
19-105-10018 2 16454 1098 12569 52521 8215 3788 5858 1805 788 1147 21028 3496 1492 665 634 3796 39 1614 10700 617 12813 52098 8091 3901 5367 1646 3544 1388 723 3938 47 1819 0 1464 0 0 42 0 0 330 832 301 319 400 788 1148940 1114080 1113560 225179 227218 452397 237056 237295 474351 59172 614884 1019300 986820 986144 1607900 1 1 19 28 47 1591480
19-105-10020 2 4096 451 13042 48597 7601 3228 5665 1582 778 1670 19769 3612 1187 717 617 3672 103 962 2627 467 13208 48466 6619 3461 5217 1360 3575 1388 718 3783 90 1370 0 862 0 0 6 0 0 216 673 386 439 401 682 1081580 1068850 1068890 233290 235396 468686 209666 214472 424139 54781 619447 958522 948737 947554 1493740 1 1 16 11 27 1452900
For file1 containing (I removed long useless lines):
Subject Seriesnumber
19-1-1001 2 8655 661 15250 60747 800
19-1-1001 10 8992 507 15722 64032 872
19-1-1103 2 3303 317 12146 57569 700
19-1-1103 9 3236 286 12490 59477 700
19-10-1010 2 3344 608 14744 59165 838
19-10-1010 6 3242 616 15205 61330 801
and file2 containig:
Subject seriesnumber
19-10-1010 2
19-10-1166 2
19-102-10005 2
19-102-10006 2
19-103-10009 2
19-103-10010 2
19-104-10013 11
19-104-10014 2
19-105-10017 6
19-105-10018 6
The following awk will output:
$ awk 'NR==FNR{a[$1, $2];next} ($1, $2) in a' file2 file1
19-10-1010 2 3344 608 14744 59165 838
Note that the first file argument to awk is file2 not file1 (small optimization)! How it works:
NR == FNR - if line number is file line number. Ie. choose only first file passed to awk.
a[$1, $2] - remember index $1,$2 in associative array a
next - do not parse rest of script and restart with next line
($1, $2) in a - check if $1, $2 is in associative array a
because of next this is run only for the second file as passed to awk
if this expression returns with true, then the line will be printed (this is how awk works).
Alternatively you could do the follow, but it will store the whole file1 in memory, which is... memory consuming..., the code above only stores $1, $2 indexes in memory.
awk 'NR==FNR{arr[$1, $2]=$0} NR!=FNR{print arr[$1, $2]}' file1 file2

Why doesn't the seaborn plot show a confidence interval?

I am using sns.lineplot to show the confidence intervals in a plot.
sns.lineplot(x = threshold, y = mrl_array, err_style = 'band', ci=95)
plt.show()
I'm getting the following plot, which doesn't show the confidence interval:
What's the problem?
There is probably only a single observation per x value.
If there is only one observation per x value, then there is no confidence interval to plot.
Bootstrapping is performed per x value, but there needs to be more than one obsevation for this to take effect.
ci: Size of the confidence interval to draw when aggregating with an estimator. 'sd' means to draw the standard deviation of the data. Setting to None will skip bootstrapping.
Note the following examples from seaborn.lineplot.
This is also the case for sns.relplot with kind='line'.
The question specifies sns.lineplot, but this answer applies to any seaborn plot that displays a confidence interval, such as seaborn.barplot.
Data
import seaborn as sns
# load data
flights = sns.load_dataset("flights")
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
# only May flights
may_flights = flights.query("month == 'May'")
year month passengers
4 1949 May 121
16 1950 May 125
28 1951 May 172
40 1952 May 183
52 1953 May 229
64 1954 May 234
76 1955 May 270
88 1956 May 318
100 1957 May 355
112 1958 May 363
124 1959 May 420
136 1960 May 472
# standard deviation for each year of May data
may_flights.set_index('year')[['passengers']].std(axis=1)
year
1949 NaN
1950 NaN
1951 NaN
1952 NaN
1953 NaN
1954 NaN
1955 NaN
1956 NaN
1957 NaN
1958 NaN
1959 NaN
1960 NaN
dtype: float64
# flight in wide format
flights_wide = flights.pivot("year", "month", "passengers")
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
# standard deviation for each year
flights_wide.std(axis=1)
year
1949 13.720147
1950 19.070841
1951 18.438267
1952 22.966379
1953 28.466887
1954 34.924486
1955 42.140458
1956 47.861780
1957 57.890898
1958 64.530472
1959 69.830097
1960 77.737125
dtype: float64
Plots
may_flights has one observation per year, so no CI is shown.
sns.lineplot(data=may_flights, x="year", y="passengers")
sns.barplot(data=may_flights, x='year', y='passengers')
flights_wide shows there are twelve observations for each year, so the CI shows when all of flights is plotted.
sns.lineplot(data=flights, x="year", y="passengers")
sns.barplot(data=flights, x='year', y='passengers')

SVG issue: make the transparent path to white color

In my project, I need a SVG icon, but I am not good at SVG stuff. So far, I get the SVG image as following:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="768.000000pt" height="766.000000pt" viewBox="0 0 768.000000 766.000000"
preserveAspectRatio="xMidYMid meet">
<metadata>
Created by potrace 1.15, written by Peter Selinger 2001-2017
</metadata>
<g fill="#b7e68f" transform="translate(0.000000,766.000000) scale(0.100000,-0.100000)"
stroke="none">
<path d="M3580 7650 c-14 -4 -48 -8 -76 -9 -53 -1 -200 -20 -344 -45 -324 -57
-682 -173 -977 -318 -473 -231 -843 -512 -1200 -906 -279 -309 -514 -682 -676
-1072 -152 -366 -252 -776 -278 -1138 -5 -81 -14 -166 -19 -190 -12 -52 -13
-289 -2 -296 5 -3 12 -61 15 -128 18 -341 111 -754 249 -1108 122 -311 312
-658 489 -890 187 -245 384 -456 604 -646 53 -46 228 -179 310 -236 271 -189
657 -380 970 -481 410 -132 755 -186 1186 -187 534 0 941 75 1422 264 481 188
888 453 1266 821 204 200 290 301 500 590 57 79 185 301 262 455 44 87 79 161
79 165 0 3 9 25 19 48 57 125 142 377 181 538 47 197 52 220 85 449 30 204 39
650 17 850 -36 331 -86 571 -177 845 -376 1142 -1274 2047 -2412 2432 -270 91
-567 153 -878 185 -128 13 -582 19 -615 8z m422 -1601 c35 -12 81 -34 103 -50
22 -15 367 -355 766 -756 791 -793 765 -763 790 -904 10 -57 10 -77 -5 -134
-23 -93 -49 -139 -112 -202 -72 -72 -145 -105 -250 -111 -94 -5 -169 13 -242
58 -26 16 -208 191 -405 390 -198 198 -361 360 -363 360 -2 0 -5 -606 -6
-1347 l-3 -1348 -32 -67 c-85 -180 -286 -276 -466 -223 -121 35 -239 146 -279
263 -10 29 -14 329 -18 1379 l-5 1342 -366 -363 c-201 -199 -381 -373 -400
-387 -148 -111 -375 -92 -508 42 -90 91 -134 226 -112 345 24 133 5 111 799
907 406 406 757 751 782 767 93 60 225 75 332 39z"/>
</g>
</svg>
Currently, the arrow is transparent, in my case, I want to make it red color, how to modify this svg image to support it?
In your example, the SVG circle and arrow are drawn in one patch.
It is necessary to divide one patch into two separate patches for a circle and an arrow.
Then it will be possible to paint them in different colors.
#circle {
fill:#b7e68f;
}
#arrow {
fill:red;
}
<svg xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" version="1" width="768" height="766" viewBox="0 0 768 766" preserveAspectRatio="xMidYMid meet">
<metadata>
Created by potrace 1.15, written by Peter Selinger 2001-2017
</metadata>
<g transform="translate(0.000000,766.000000) scale(0.100000,-0.100000)" stroke="none">
<path id="arrow" d="m3580 7650c-14-4-48-8-76-9-53-1-200-20-344-45-324-57-682-173-977-318C1710 7047 1340 6766 983 6372 704 6063 469 5690 307 5300 155 4934 55 4524 29 4162 24 4081 15 3996 10 3972-2 3920-3 3683 8 3676c5-3 12-61 15-128 18-341 111-754 249-1108 122-311 312-658 489-890 187-245 384-456 604-646 53-46 228-179 310-236C1946 479 2332 288 2645 187 3055 55 3400 1 3831 0c534 0 941 75 1422 264 481 188 888 453 1266 821 204 200 290 301 500 590 57 79 185 301 262 455 44 87 79 161 79 165 0 3 9 25 19 48 57 125 142 377 181 538 47 197 52 220 85 449 30 204 39 650 17 850-36 331-86 571-177 845-376 1142-1274 2047-2412 2432-270 91-567 153-878 185-128 13-582 19-615 8z"/>
</g>
<path id="circle" d="m358 1c-1.4 0.4-4.8 0.8-7.6 0.9-5.3 0.1-20 2-34.4 4.5-32.4 5.7-68.2 17.3-97.7 31.8C171 61.3 134 89.4 98.3 128.8 70.4 159.7 46.9 197 30.7 236c-15.2 36.6-25.2 77.6-27.8 113.8-0.5 8.1-1.4 16.6-1.9 19-1.2 5.2-1.3 28.9-0.2 29.6 0.5 0.3 1.2 6.1 1.5 12.8 1.8 34.1 11.1 75.4 24.9 110.8 12.2 31.1 31.2 65.8 48.9 89 18.7 24.5 38.4 45.6 60.4 64.6 5.3 4.6 22.8 17.9 31 23.6 27.1 18.9 65.7 38 97 48.1 41 13.2 75.5 18.6 118.6 18.7 53.4 0 94.1-7.5 142.2-26.4 48.1-18.8 88.8-45.3 126.6-82.1 20.4-20 29-30.1 50-59 5.7-7.9 18.5-30.1 26.2-45.5 4.4-8.7 7.9-16.1 7.9-16.5 0-0.3 0.9-2.5 1.9-4.8 5.7-12.5 14.2-37.7 18.1-53.8 4.7-19.7 5.2-22 8.5-44.9 3-20.4 3.9-65 1.7-85-3.6-33.1-8.6-57.1-17.7-84.5C710.9 149.3 621.1 58.8 507.3 20.3 480.3 11.2 450.6 5 419.5 1.8 406.7 0.5 361.3-0.1 358 1Zm42.2 160.1c3.5 1.2 8.1 3.4 10.3 5 2.2 1.5 36.7 35.5 76.6 75.6 79.1 79.3 76.5 76.3 79 90.4 1 5.7 1 7.7-0.5 13.4-2.3 9.3-4.9 13.9-11.2 20.2-7.2 7.2-14.5 10.5-25 11.1-9.4 0.5-16.9-1.3-24.2-5.8-2.6-1.6-20.8-19.1-40.5-39-19.8-19.8-36.1-36-36.3-36-0.2 0-0.5 60.6-0.6 134.7l-0.3 134.8-3.2 6.7c-8.5 18-28.6 27.6-46.6 22.3C365.6 591 353.8 579.9 349.8 568.2c-1-2.9-1.4-32.9-1.8-137.9l-0.5-134.2-36.6 36.3c-20.1 19.9-38.1 37.3-40 38.7-14.8 11.1-37.5 9.2-50.8-4.2-9-9.1-13.4-22.6-11.2-34.5 2.4-13.3 0.5-11.1 79.9-90.7 40.6-40.6 75.7-75.1 78.2-76.7 9.3-6 22.5-7.5 33.2-3.9z" stroke-width="0.1"/>
</svg>
To adjust the size of the image and make it adaptive:
Remove width="768" and height="766" from the svg header
Wrap the svg in the container <div class =" container ">
.container {
width:15%;
height:15%;
}
#circle {
fill:#b7e68f;
}
#arrow {
fill:red;
}
<div class="container">
<svg xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" version="1" viewBox="0 0 768 766" preserveAspectRatio="xMidYMid meet">
<metadata>
Created by potrace 1.15, written by Peter Selinger 2001-2017
</metadata>
<g transform="translate(0.000000,766.000000) scale(0.100000,-0.100000)" stroke="none">
<path id="arrow" d="m3580 7650c-14-4-48-8-76-9-53-1-200-20-344-45-324-57-682-173-977-318C1710 7047 1340 6766 983 6372 704 6063 469 5690 307 5300 155 4934 55 4524 29 4162 24 4081 15 3996 10 3972-2 3920-3 3683 8 3676c5-3 12-61 15-128 18-341 111-754 249-1108 122-311 312-658 489-890 187-245 384-456 604-646 53-46 228-179 310-236C1946 479 2332 288 2645 187 3055 55 3400 1 3831 0c534 0 941 75 1422 264 481 188 888 453 1266 821 204 200 290 301 500 590 57 79 185 301 262 455 44 87 79 161 79 165 0 3 9 25 19 48 57 125 142 377 181 538 47 197 52 220 85 449 30 204 39 650 17 850-36 331-86 571-177 845-376 1142-1274 2047-2412 2432-270 91-567 153-878 185-128 13-582 19-615 8z"/>
</g>
<path id="circle" d="m358 1c-1.4 0.4-4.8 0.8-7.6 0.9-5.3 0.1-20 2-34.4 4.5-32.4 5.7-68.2 17.3-97.7 31.8C171 61.3 134 89.4 98.3 128.8 70.4 159.7 46.9 197 30.7 236c-15.2 36.6-25.2 77.6-27.8 113.8-0.5 8.1-1.4 16.6-1.9 19-1.2 5.2-1.3 28.9-0.2 29.6 0.5 0.3 1.2 6.1 1.5 12.8 1.8 34.1 11.1 75.4 24.9 110.8 12.2 31.1 31.2 65.8 48.9 89 18.7 24.5 38.4 45.6 60.4 64.6 5.3 4.6 22.8 17.9 31 23.6 27.1 18.9 65.7 38 97 48.1 41 13.2 75.5 18.6 118.6 18.7 53.4 0 94.1-7.5 142.2-26.4 48.1-18.8 88.8-45.3 126.6-82.1 20.4-20 29-30.1 50-59 5.7-7.9 18.5-30.1 26.2-45.5 4.4-8.7 7.9-16.1 7.9-16.5 0-0.3 0.9-2.5 1.9-4.8 5.7-12.5 14.2-37.7 18.1-53.8 4.7-19.7 5.2-22 8.5-44.9 3-20.4 3.9-65 1.7-85-3.6-33.1-8.6-57.1-17.7-84.5C710.9 149.3 621.1 58.8 507.3 20.3 480.3 11.2 450.6 5 419.5 1.8 406.7 0.5 361.3-0.1 358 1Zm42.2 160.1c3.5 1.2 8.1 3.4 10.3 5 2.2 1.5 36.7 35.5 76.6 75.6 79.1 79.3 76.5 76.3 79 90.4 1 5.7 1 7.7-0.5 13.4-2.3 9.3-4.9 13.9-11.2 20.2-7.2 7.2-14.5 10.5-25 11.1-9.4 0.5-16.9-1.3-24.2-5.8-2.6-1.6-20.8-19.1-40.5-39-19.8-19.8-36.1-36-36.3-36-0.2 0-0.5 60.6-0.6 134.7l-0.3 134.8-3.2 6.7c-8.5 18-28.6 27.6-46.6 22.3C365.6 591 353.8 579.9 349.8 568.2c-1-2.9-1.4-32.9-1.8-137.9l-0.5-134.2-36.6 36.3c-20.1 19.9-38.1 37.3-40 38.7-14.8 11.1-37.5 9.2-50.8-4.2-9-9.1-13.4-22.6-11.2-34.5 2.4-13.3 0.5-11.1 79.9-90.7 40.6-40.6 75.7-75.1 78.2-76.7 9.3-6 22.5-7.5 33.2-3.9z" />
</svg>
</div>

Pandas Computing On Multidimensional Data

I have two data frames storing tracking data of offensive and defensive players during an nfl game. My goal is to calculate the maximum distance between an offensive player and the nearest defender during the course of the play.
As a simple example, I've made up some data where there are only three offensive players and two defensive players. Here is the data:
Defense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 117 20.2 20.0
1 2 1 117 21.0 19.1
2 3 1 117 21.3 18.3
3 4 1 117 22.0 17.5
4 5 1 117 22.5 17.2
5 6 1 117 23.0 16.9
6 7 1 117 23.6 16.7
7 8 2 117 25.1 34.1
8 9 2 117 25.9 34.2
9 10 2 117 24.1 34.5
10 11 2 117 22.7 34.2
11 12 2 117 21.5 34.5
12 13 2 117 21.1 37.3
13 14 3 117 21.2 44.3
14 15 3 117 20.4 44.6
15 16 3 117 21.9 42.7
16 17 3 117 21.1 41.9
17 18 3 117 20.1 41.7
18 19 3 117 20.1 41.3
19 1 1 555 40.1 17.0
20 2 1 555 40.7 18.3
21 3 1 555 41.0 19.6
22 4 1 555 41.5 18.4
23 5 1 555 42.6 18.4
24 6 1 555 43.8 18.0
25 7 1 555 44.2 15.8
26 8 2 555 41.2 37.1
27 9 2 555 42.3 36.5
28 10 2 555 45.6 36.3
29 11 2 555 47.9 35.6
30 12 2 555 47.4 31.3
31 13 2 555 46.8 31.5
32 14 3 555 47.3 40.3
33 15 3 555 47.2 40.6
34 16 3 555 44.5 40.8
35 17 3 555 46.5 41.0
36 18 3 555 47.6 41.4
37 19 3 555 47.6 41.5
Offense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 751 30.2 15.0
1 2 1 751 31.0 15.1
2 3 1 751 31.3 15.3
3 4 1 751 32.0 15.5
4 5 1 751 31.5 15.7
5 6 1 751 33.0 15.9
6 7 1 751 32.6 15.7
7 8 2 751 51.1 30.1
8 9 2 751 51.9 30.2
9 10 2 751 51.1 30.5
10 11 2 751 49.7 30.6
11 12 2 751 49.5 30.9
12 13 2 751 49.1 31.3
13 14 3 751 12.2 40.3
14 15 3 751 12.4 40.5
15 16 3 751 12.9 40.7
16 17 3 751 13.1 40.9
17 18 3 751 13.1 41.1
18 19 3 751 13.1 41.3
19 1 1 419 41.3 15.0
20 2 1 419 41.7 15.3
21 3 1 419 41.8 15.4
22 4 1 419 42.9 15.6
23 5 1 419 42.6 15.6
24 6 1 419 44.8 16.0
25 7 1 419 45.2 15.8
26 8 2 419 62.2 30.1
27 9 2 419 63.3 30.5
28 10 2 419 62.6 31.0
29 11 2 419 63.9 30.6
30 12 2 419 67.4 31.3
31 13 2 419 66.8 31.5
32 14 3 419 30.3 40.3
33 15 3 419 30.2 40.6
34 16 3 419 30.5 40.8
35 17 3 419 30.5 41.0
36 18 3 419 31.6 41.4
37 19 3 419 31.6 41.5
38 1 1 989 10.1 15.0
39 2 1 989 10.2 15.5
40 3 1 989 10.4 15.4
41 4 1 989 10.5 15.8
42 5 1 989 10.6 15.9
43 6 1 989 10.1 15.5
44 7 1 989 10.9 15.3
45 8 2 989 25.8 30.1
46 9 2 989 25.2 30.1
47 10 2 989 21.8 30.2
48 11 2 989 25.8 30.2
49 12 2 989 25.6 30.5
50 13 2 989 25.5 31.0
51 14 3 989 50.3 40.3
52 15 3 989 50.3 40.2
53 16 3 989 50.2 40.4
54 17 3 989 50.1 40.8
55 18 3 989 50.6 41.2
56 19 3 989 51.4 41.6
The data is essentially multidimensional with GameTime, PlayId, and PlayerId as independent variables and x-coord and y-coord as dependent variables. How can I go about calculating the maximum distance from the nearest defender during the course of a play?
My guess is I would have to create columns containing the distance from each defender for each offensive player, but I don't know how to name those and be able to account for an unknown amount of defensive/offensive players (the full data set contains thousands of players).
Here is a possible solution , I think there is a way to making it more efficient :
Assuming you have a dataframe called offense_df and a dataframe called defense_df:
In the merged dataframe you'll get the answer to your question, basically it will create the following dataframe:
from scipy.spatial import distance
merged_dataframe = pd.merge(offense_df,defense_df,on=['GameTime','PlayId'],suffixes=('_off','_def'))
GameTime PlayId PlayerId_off x-coord_off y-coord_off PlayerId_def x-coord_def y-coord_def
0 1 1 751 30.2 15.0 117 20.2 20.0
1 1 1 751 30.2 15.0 555 40.1 17.0
2 1 1 419 41.3 15.0 117 20.2 20.0
3 1 1 419 41.3 15.0 555 40.1 17.0
4 1 1 989 10.1 15.0 117 20.2 20.0
The next two lines are here to create a unique column for the coordinates , basically it will create for the offender (coord_off) and the defender a column (coord_def) that contains a tuple (x,y) this will simplify the computation of the distance.
merged_dataframe['coord_off'] = merged_dataframe.apply(lambda x: (x['x-coord_off'], x['y-coord_off']),axis=1)
merged_dataframe['coord_def'] = merged_dataframe.apply(lambda x: (x['x-coord_def'], x['y-coord_def']),axis=1)
We compute the distance to all the defender at a given GameTime,PlayId.
merged_dataframe['distance_to_def'] = merged_dataframe.apply(lambda x: distance.euclidean(x['coord_off'],x['coord_def']),axis=1)
For each PlayerId,GameTime,PlayId we take the distance to the nearest defender.
smallest_dist = merged_dataframe.groupby(['GameTime','PlayId','PlayerId_off'])['distance_to_def'].min()
Finally we take the maximum distance (of these minimum distances) for each PlayerId.
smallest_dist.groupby('PlayerId_off').max()

Resources