I have a df like this,
ID Machine 17-Dec 18-Jan 18-Feb 18-Mar 18-Apr 18-May
160 Car 348 280 274 265 180 224
163 Var 68248 72013 55441 64505 71097 78006
165 Assus 1337 1279 1536 1461 1555 1700
215 Owen 118 147 104 143 115 153
I calculates the Mean and Std. Dev like this,
df['Avg'] = np.mean(all_np_values, axis=1)
df['Std.Dev'] = np.std(all_np_values, axis=1)
Then I get the following data frame.
ID Machine 17-Dec 18-Jan 18-Feb 18-Mar 18-Apr 18-May Mean Std.Dev
160 Car 348 280 274 265 180 224 261.83 51.70
163 Var 68248 72013 55441 64505 71097 78006 68218.33 7018.24
165 Assus 1337 1279 1536 1461 1555 1700 1478 140.44
215 Owen 118 147 104 143 115 153 130 18.40
Now, I want to have a final dataframe that looks like below, which I would like to look at MAY 18 and say yes or no based on its value Above or Below 2 standard deviation.
ID Machine 17-Dec 18-Jan 18-Feb 18-Mar 18-Apr 18-May Mean Std.Dev Above Below
160 Car 348 280 274 265 180 224 261.83 51.70 No No
163 Var 68248 72013 55441 64505 71097 78006 68218.33 7018.24 No No
165 Assus 1337 1279 1536 1461 1555 1700 1478 140.44 No No
215 Owen 118 147 104 143 115 153 130 18.40 No No
I tried to do the following,
for value in df['18-May']:
if value > (df['Avg'] + 2 * df['Std.Dev']):
df['Above'] = 'Yes'
else:
df['Above'] = 'No'
This gives me an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I understand the error after reading some older posts. My conclusion is, it returns bool values for comparison.
Not sure, how to mask in creating a new df column to create that 'Yes' and 'No' in my 'Above' or 'Below' column. How can I add that into my code above?
Any thoughts would be helpful.
Related
Let's assume I have a following tensor t:
]m=: 100 + 4 4 $ i.16
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
]t=: (m ,: m+100) , m+200
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
200 201 202 203
204 205 206 207
208 209 210 211
212 213 214 215
300 301 302 303
304 305 306 307
308 309 310 311
312 313 314 315
I would like to select diagonal plane of it, so :
100 105 110 115
200 205 210 215
300 305 310 315
How to define function that acts on indices? (and here have for any plane index let's choose ix(row) = ix (column)) Also, how to define functions working on values and indices together? So I would be interested in having something like this:
(f t) { t
Thanks!
Transpose x|:y with boxed arguments runs the axes together to produce a single axis. You can use this to produce a rather idiomatic solution:
(< 0 1) |: m
100 105 110 115
(<0 1) |:"2 t
100 105 110 115
200 205 210 215
300 305 310 315
where you use the rank " verb to apply the diagonal selection to 2-boxes.
You can convert an array of values to its corresponding array of indices with (#:i.)#$ m
To get an example f „working on values and indices together“ you can then plug it in as a dyad that takes values on the left and indices on the right:
f=.(2|[) +. ([:=/"1]) NB. odd value or diagonal index
]r=.([ f (#:i.)#$) m NB. values f indices
1 1 0 1
0 1 0 1
0 1 1 1
0 1 0 1
r #&, m NB. flatten lists & get values where bit is set
100 101 103 105 107 109 110 111 113 115
Everything wrapped into an adverb that can be applied f:
sel=.1 : '#~&, [ u (#:i.)#$`
f sel m
100 101 103 105 107 109 110 111 113 115
I'm attempting to reverse engineer an SVG animation in JavaScript to better understand the animation and I'm seeing the following SVG code representing an "Up" motion in JavaScript. However the SVG itself doesn't look like any typical SVG code I am used to using. Can you help identify how this SVG is structured? Or how I can adjust the follwing code so I can open it in an image editing software?
d 601 9aAaAaAnBkNnUaNaN"/D 18 10bAaAnAnBuXaN"/F 22 10W7AaAaBaEaGiAW-6NiNnXaNbUaNaNaN"/D 30 10bAaEuUnU"/D 114 10bAaAnAnBuXaN"/F 117 10W7AaAaBaBaAaGkAn0NkNnKaNaNaUaNaU"/D 125 10eAaGnAnUnUnU"/D 66 12eBnAnAnNnUaN"/F 70 12W6AaAaAbEaGkAn2NuKaNaUaNaNaNaN"/D 76 12gEuNnNnN"/D 593 12eBnAnAuKaN"/F 596 12eAaAeUbAnEbKeAnAbJnAiAxNxAkAnUaXnNbNaNaU"/D 604 13bEuK"/D 166 14eEnAkKaN"/D 608 14aAnN"/F 169 15eAeAaAaBaAaAnAnBn0NkNnNnKbNgNaU"/D 222 15aAbBaGxKnKaN"/D 175 16gEnAnUnNnN"/D 308 16aAaAaAaBuAnAnEaAaAaEnAuNnNuNnUnNaUbw-7bN"/D 314 16gAaGkNuX"/D 268 17eAaAaAaAaAaGnBnNuUuUaUiNaNaU"/D 501 17bAaAuAnAnKaN"/D 548 17eAnAnAnAnKaN"/F 552 17jEW-6AkNaNaNbN"/D 557 17gExK"/D 209 18bAaEeAaBnAuAeAW8NnNnKgGnBn1NkGaAuNnNnXaUnw-6aN"/F 216 18jAaEgAxEaAaAW-8NkNbNaUnNkUaAeK"/D 260 18eAaBnAuAnEaBnAnBeAaEnAkNnXnNnNnXaNbXaNaN"/D 364 18eAaBuNkNaN"/D 509 18bAaBnAaEnAeAaAaBnAaEnAxXaNuNuNW-8AnAuUaUaNbNW6AeKnNnUaN"/D 159 19bAaBaAaAeAa0UaNaNeGnAnAnAiNn3NnNnNnXaN"/F 213 19aAnN"/D 214 19bBkNaN"/D 356 19eAnAnAnAnBbNW9NeAaAbAaAaBaBnBkBxUnNaNeUaNnUuNW-9AkAnAnEaAeAaBnAkUkNnXaNaNaw-6aNaN"/D 460 19eAaBuAnNnK"/F 502 19W6BaAaEkNW-6AuAnUaUaNaNaN"/F 312 20gAgAnAnBnAaAaEaAaNaGuAkAW-7NuNnKaAbNaKnNnNnKaNaNbN"/F 358 20W8AbAaAkAW-9AuUaNaNaN"/D 403 20eBnBnNnK"/F 407 20jAbAaAaAaGnAkAW-8NnNnUaUaUaNaN"/D 412 20gEnNnNuN"/D 451 20gAnAnAnBnI"/F 455 20jBaAaEbNaNaGuAnAn1NnXaUaNaNaN"/D 551 20W6AeAaAaAaEnNuNuNW-9AuAnKaNaNeN"/F 557 20gAaBnNnNkN"/D 17 21jAW6NjAaAaBaw6nJnBnAxNnUnNaNbNaNaKnKnNnNW-8AuAnAnGaBbAaAbAnAuAuNnNnw-7nIaUaN"/D 113 21eAa0NeAaAaBaBnEuKnNuNW-7AuAnBnBnBaAaAeBnAnAxNnw-6nNnKaKaUaU"/F 263 21W8BnBbBnNuEbNaNbAaBnNuAnEaBnAW-9NaKnNkUaNaBeUnNnAnUnKaNbN"/D 320 21ew9uAnNnKnNnNaUaNaN"/F 511 21aAnN"/D 596 21gAgNjAbBaEaBnBnJnBuAuNnNnXaNbNbNbNnNnNnNnNkNW-8AuAuKaNeN"/D 65 22aAa2NeAaBaw6uUnUnNuNW-6AkAnAnAnBuw-6aUaN"/F 462 22bAuN"/D 462 23bAaAnAuK"/F 464 23aAnN"/F 512 23aEuNaU"/F 21 24W8AaAaEaEnAnAuAW-7NnNuUnXaNaNbN"/D 417 24aAaBaAnBnAiAW-9NuNnKaNaBaAaAW8NeNaX"/F 549 24W9AbAbAaBxAnBnAW-6NuNuUnKaNbN"/F 596 24W8AeAaAaAaAaAuAuAuAnExNuNuNnNnNnKnUbNbN"/F 71 25W6AbAaBaBaAnAnAnAkAiNkNnNnNnKaNaNaNeN"/F 119 25W7AbAaEaBnAuAnAkAxNkNnNnUaUaUaNbN"/D 265 25bAuN"/F 359 25W9AbBaAnBkAnAaBuAW-7NaUnNkNnKaNaNeN"/D 403 25aAnN"/D 449 25bGaAa1NaNbNaAbJnNuNW-6AW-8NnNnNnX"/D 269 26bAaAnAuK"/D 361 26bAuN"/D 365 26bAaAnAuK"/F 161 27a3AgGnBuAkAW-6NkNnNnXnNaN"/D 262 27aAaBkUaN"/D 357 27bAaBnAuUnNaN"/F 497 27aAnN"/F 211 28eAa1GnBnNnAkAxNkNnNaNnX"/F 500 28W8AbAbAnGkAW-6NuNnNnUnNbNaN"/D 592 28aEaAaAaAbEnAkNnNnw-8"/D 158 29eGaAuNuX"/D 272 29bAaAaAnAnAuNnKaN"/D 546 29aBbAbEnAuNnNnI"/D 559 29gJnAxNnUaUaN"/F 418 30aGuAkAW-8NuKW9NjN"/F 403 31aAnN"/F 460 31W6AbAuAuAxAiNuNnUW8N"/D 65 32aAaAaAeAnAnAuNuI"/D 81 32aGnBxNnUeNaNaN"/D 129 32aEnAnAxNnNeNaNbN"/D 177 32aw6uAuNnNnUeNbU"/F 275 32aAnN"/F 274 33aAnN"/D 222 34aBnAkUeN
I am using sns.lineplot to show the confidence intervals in a plot.
sns.lineplot(x = threshold, y = mrl_array, err_style = 'band', ci=95)
plt.show()
I'm getting the following plot, which doesn't show the confidence interval:
What's the problem?
There is probably only a single observation per x value.
If there is only one observation per x value, then there is no confidence interval to plot.
Bootstrapping is performed per x value, but there needs to be more than one obsevation for this to take effect.
ci: Size of the confidence interval to draw when aggregating with an estimator. 'sd' means to draw the standard deviation of the data. Setting to None will skip bootstrapping.
Note the following examples from seaborn.lineplot.
This is also the case for sns.relplot with kind='line'.
The question specifies sns.lineplot, but this answer applies to any seaborn plot that displays a confidence interval, such as seaborn.barplot.
Data
import seaborn as sns
# load data
flights = sns.load_dataset("flights")
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
# only May flights
may_flights = flights.query("month == 'May'")
year month passengers
4 1949 May 121
16 1950 May 125
28 1951 May 172
40 1952 May 183
52 1953 May 229
64 1954 May 234
76 1955 May 270
88 1956 May 318
100 1957 May 355
112 1958 May 363
124 1959 May 420
136 1960 May 472
# standard deviation for each year of May data
may_flights.set_index('year')[['passengers']].std(axis=1)
year
1949 NaN
1950 NaN
1951 NaN
1952 NaN
1953 NaN
1954 NaN
1955 NaN
1956 NaN
1957 NaN
1958 NaN
1959 NaN
1960 NaN
dtype: float64
# flight in wide format
flights_wide = flights.pivot("year", "month", "passengers")
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
# standard deviation for each year
flights_wide.std(axis=1)
year
1949 13.720147
1950 19.070841
1951 18.438267
1952 22.966379
1953 28.466887
1954 34.924486
1955 42.140458
1956 47.861780
1957 57.890898
1958 64.530472
1959 69.830097
1960 77.737125
dtype: float64
Plots
may_flights has one observation per year, so no CI is shown.
sns.lineplot(data=may_flights, x="year", y="passengers")
sns.barplot(data=may_flights, x='year', y='passengers')
flights_wide shows there are twelve observations for each year, so the CI shows when all of flights is plotted.
sns.lineplot(data=flights, x="year", y="passengers")
sns.barplot(data=flights, x='year', y='passengers')
Date Issue redmeption App Date Issue redmeption App
21-Nov 891 200 523 28-Nov 660 179 302
22-Nov 607 125 423 29-Nov 712 165 420
23-Nov 456 165 422 30-Nov 499 128 331
24-Nov 510 115 391 1-Dec 596 170 392
25-Nov 525 120 400 2-Dec 573 169 397
26-Nov 585 158 396 3-Dec 450 120 350
27-Nov 582 88 410 4-Dec 650 150 360
Try creating you chart with the x & y axis data then using the "add data" function in the chart menu.
I'm currently generating an extremely large data set on a remote HPC (high performace computer). We are talking about 3 TB at the moment, and it could reach up to 10 TB once I'm done.
Each of the 450 000 files ranges from a few KB to about 100 MB and contains lines of integers with no repetitive/predictable patterns. Moreover they are split among 150 folders (I use the path to classify them according to the input parameters). Now that could be fine, but my research group is technically limited to 1TB of disk space on the remote server, although the admin are willing to close their eyes until the situation gets sorted out.
What would you recommend to compress such a dataset?
A limitation is that tasks can't run more than 48 hours at a time on this computer. So long but efficient compression methods are possible only if 48 hours is enough... I really have no other options as neither me, neither my group own enough disk space on other machines.
EDIT: Just to clarify, this a remote computer that runs on some variation of linux. All standard compression protocols are available. I don't have super user rights.
EDIT2: As request by Sergio, here is a sample output (first 10 lines of a files)
27 42 46 63 95 110 205 227 230 288 330 345 364 367 373 390 448 471 472 482 509 514 531 533 553 617 636 648 667 682 703 704 735 740 762 775 803 813 882 915 920 936 939 942 943 979 1018 1048 1065 1198 1219 1228 1513 1725 1888 1944 2085 2190 2480 5371 5510 5899 6788 7728 9514 10382 11946 13063 13808 16070 23301 23511 24538
93 94 106 143 157 164 168 181 196 293 299 334 369 372 439 457 508 527 547 557 568 570 573 592 601 668 701 704 799 838 848 870 875 882 890 913 953 959 1022 1024 1037 1046 1169 1201 1288 1615 1684 1771 2043 2204 2348 2387 2735 3149 4319 4890 4989 5321 5588 6453 7475 9277 9649 9654 11433 16966
1463
183 469 514 597 792
25 50 143 152 205 244 253 424 433 446 461 476 486 545 552 570 632 642 647 665 681 682 718 735 746 772 792 811 830 851 891 903 925 1037 1115 1147 1171 1612 1979 2749 3074 3158 6042 12709 20571 20859
24 30 86 312 726 875 1023 1683 1799
33 36 42 65 110 112 122 227 241 262 274 284 305 328 353 366 393 414 419 449 462 488 489 514 635 690 732 744 767 772 812 820 843 844 855 889 893 925 936 939 981 1015 1020 1060 1064 1130 1174 1304 1393 1477 1939 2004 2200 2205 2208 2216 2234 3284 4456 5209 6810 6834 8067 10811 10895 12771 15291
157 761 834 875 1001 2492
21 141 146 169 181 256 266 337 343 367 397 402 405 433 454 466 513 527 656 684 708 709 732 743 811 883 913 938 947 986 987 1013 1053 1190 1215 1288 1289 1333 1513 1524 1683 1758 2033 2684 3714 4129 6015 7395 8273 8348 9483 23630
1253
All integers are separated by one whitespace, and each line corresponds to a given element. I use implicit line numbers to store this information, because my data is assosiative i.e. the 0th element is associated to elements 27 42 46 63 110.. etc. I believe that there is no extra information whatsoever.
A few points that may help:
It looks like your numbers are sorted. If this is always the case, then it will be more efficient to compress the differences between adjacent numbers rather than the numbers themselves (since the differences will be somewhat smaller on average)
There are good ways of encoding small integer values in binary format, that are probably better than encoding them in text format. See the technique used by Google in their protocol buffers: (https://developers.google.com/protocol-buffers/docs/encoding)
Once you have applied the above techniques, then zipping / some standard form of compression should improve everything even further.
There is some research done at this LINK that breaks down the pro/cons of using gzip, bzip2, and lzma. Hopefully this can let you make an informed decision on your best approach.
All your numbers seem to be increasing in size (each line). A rather common approach in database technology would be to only store the size difference, making a line like
24 30 86 312 726 875 1023 1683 1799
to something like
6 56 226 414 149 148 660 116
Other lines of your example would even show more benefit, as the differences are smaller. This also works when the numbers decrease in-between, but you have to be able to deal with negative differences then.
Second thing to do would be changing the encoding. While compression will reduce this overhead, you're currently using 8 bit per digit, whereas you only need 4 bit of those (0-9, space as divisor). Implementing your own "4 bit character set" will already cut your storage requirements to half of the current size! In the end, this would be some kind of binary encoding of numbers of arbitrary length.