I quantized resent18 model of mine, using graph quantization which lead to a forward pass of about 100 ms (on my cpu) and a size reduction of 40MBs (from initial 85 to 45 MBs).
I then went on and pruned this model to 14.5M parameters from the initial 25M and its size shrunk from 85 to 58 MB. I then quantized the resulting model hoping for further improvements but instead faced diminishing results.
That is, I don't see what I expected for inference speed. Instead of being faster, the newer model simply is worse than the initial model (quantized from initial model).
Is this expected behavior? Here are the two models for comparison:
https://gofile.io/d/zCzHAF
And this is their runtime benchmark results:
Quantized (no pruning done beforehand):
1>[ RUN ] EmbedderModelForwardFixture.ModelEmbedderBench (10 runs, 10 iterations per run)
1>[ DONE ] EmbedderModelForwardFixture.ModelEmbedderBench (5672.357520 ms)
1>[ RUNS ] Average time: 567235.752 us (~31674.053 us)
1> Fastest time: 530900.462 us (-36335.290 us / -6.406 %)
1> Slowest time: 640024.562 us (+72788.810 us / +12.832 %)
1> Median time: 561095.762 us (1st quartile: 548392.562 us | 3rd quartile: 577176.062 us)
1>
1> Average performance: 1.76294 runs/s
1> Best performance: 1.88359 runs/s (+0.12066 runs/s / +6.84409 %)
1> Worst performance: 1.56244 runs/s (-0.20050 runs/s / -11.37282 %)
1> Median performance: 1.78223 runs/s (1st quartile: 1.82351 | 3rd quartile: 1.73257)
1>
1>[ITERATIONS] Average time: 56723.575 us (~3167.405 us)
1> Fastest time: 53090.046 us (-3633.529 us / -6.406 %)
1> Slowest time: 64002.456 us (+7278.881 us / +12.832 %)
1> Median time: 56109.576 us (1st quartile: 54839.256 us | 3rd quartile: 57717.606 us)
1>
1> Average performance: 17.62935 iterations/s
1> Best performance: 18.83592 iterations/s (+1.20657 iterations/s / +6.84409 %)
1> Worst performance: 15.62440 iterations/s (-2.00495 iterations/s / -11.37282 %)
1> Median performance: 17.82227 iterations/s (1st quartile: 18.23511 | 3rd quartile: 17.32574)
Pruned-Quantized model:
1>[ RUN ] EmbedderModelForwardFixture.ModelEmbedderBench (10 runs, 10 iterations per run)
1>[ DONE ] EmbedderModelForwardFixture.ModelEmbedderBench (5778.116020 ms)
1>[ RUNS ] Average time: 577811.602 us (~26296.168 us)
1> Fastest time: 537246.562 us (-40565.040 us / -7.020 %)
1> Slowest time: 617859.662 us (+40048.060 us / +6.931 %)
1> Median time: 585275.362 us (1st quartile: 554311.262 us | 3rd quartile: 594753.362 us)
1>
1> Average performance: 1.73067 runs/s
1> Best performance: 1.86134 runs/s (+0.13067 runs/s / +7.55054 %)
1> Worst performance: 1.61849 runs/s (-0.11218 runs/s / -6.48174 %)
1> Median performance: 1.70860 runs/s (1st quartile: 1.80404 | 3rd quartile: 1.68137)
1>
1>[ITERATIONS] Average time: 57781.160 us (~2629.617 us)
1> Fastest time: 53724.656 us (-4056.504 us / -7.020 %)
1> Slowest time: 61785.966 us (+4004.806 us / +6.931 %)
1> Median time: 58527.536 us (1st quartile: 55431.126 us | 3rd quartile: 59475.336 us)
1>
1> Average performance: 17.30668 iterations/s
1> Best performance: 18.61343 iterations/s (+1.30675 iterations/s / +7.55054 %)
1> Worst performance: 16.18491 iterations/s (-1.12177 iterations/s / -6.48174 %)
1> Median performance: 17.08597 iterations/s (1st quartile: 18.04041 | 3rd quartile: 16.81369)
To put it simply after 10 iterations, each take:
r18_default : 805.72 ms (mean)
quantized_model : 560 ms (mean)
r18_pruned : 7,466.78 ms
pruned_then_quantized: 578 ms (mean)
Not only the second model is not faster, it's worse, it has become slower! You can also see that the pruned model is extremely slow. 10x slower than the default model.
Note:
In case it matters, training (pruning, and finetuning the model) is done using pytorch 1.5.1 and the final graph quantization is done under windows using pytorch 1.6.
Note 2:
This is being tested and evaluated using libtorch (1.6) on Windows 10 machine, but there shouldn't be any discrepancy between Python and C++.
Related
I am running catboost classifier with catboost using settings:
model = CatBoostClassifier(iterations=1000, learning_rate=0.05, depth=7, loss_function='MultiClass',calc_feature_importance=True)
I have 5 classes and it starts from -ve values and increases as below while fitting model:
0: learn: -1.5036342 test: -1.5039740 best: -1.5039740 (0) total: 18s remaining: 4h 59m 46s
1: learn: -1.4185548 test: -1.4191364 best: -1.4191364 (1) total: 37.8s remaining: 5h 14m 24s
2: learn: -1.3475387 test: -1.3482641 best: -1.3482641 (2) total: 56.3s remaining: 5h 12m 1s
3: learn: -1.2868831 test: -1.2877465 best: -1.2877465 (3) total: 1m 15s remaining: 5h 12m 32s
4: learn: -1.2342138 test: -1.2351585 best: -1.2351585 (4) total: 1m 34s remaining: 5h 13m 56s
Is this normal behaviour? While in most of the machine learning algorithms, logloss is positive and decreases with training. What am I missing here?
Yes, this is normal behaviour.
When you specify loss_function='MultiClass' in parameters of your model, it uses another loss function, not LogLoss, for optimisation. The definition can be found here.
To understand the sign of that function, you can think of the best-case scenario and the worst-case scenario. In the best-case the value of the target function for object ai is all concentrated in the correct class t, so the fraction in the log (in the formula on the linked page) would equal 1, and the log would be 0. However, when you diverge from that best-case, the fraction in the log would decrease towards 0, and the log itself would get more and more negative.
Here I demonstrated a survival model with rcs term. I was wondering whether the anova()under rms package is the way to test the linearity association? And How can I interpret the P-value of the Nonlinear term (see 0.094 here), does that support adding a rcs() term in the cox model?
library(rms)
data(pbc)
d <- pbc
rm(pbc, pbcseq)
d$status <- ifelse(d$status != 0, 1, 0)
dd = datadist(d)
options(datadist='dd')
# rcs model
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
albumin 82.80 3 <.0001
Nonlinear 4.73 2 0.094
TOTAL 82.80 3 <.0001
The proper way to test is with model comparison of the log-likelihood (aka deviance) across two models: a full and reduced:
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
m <- cph(Surv(time, status) ~ albumin, data=d)
p.val <- 1- pchisq( (m2$loglik[2]- m$loglik[2]), 2 )
You can see the difference in the inference using the less accurate Wald statistic (which in your case was not significant anyway since the p-value was > 0.05) versus this more accurate method in the example that Harrell used in his ?cph help page. Using his example:
> anova(f)
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 57.75 3 <.0001
Nonlinear 8.17 2 0.0168
sex 18.75 1 <.0001
TOTAL 75.63 4 <.0001
You would incorrectly conclude that the nonlinear term was "significant" at conventional 0.05 level. This despite the fact that code creating the model was constructed as entirely linear in age (on the log-hazard scale):
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
Create a reduced mode and compare:
f0 <- cph(S ~ age + sex, x=TRUE, y=TRUE)
anova(f0)
#-------------
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 56.64 1 <.0001
sex 16.26 1 1e-04
TOTAL 75.85 2 <.0001
The difference in deviance is not significant with 2 degrees of freedom difference:
1-pchisq((f$loglik[2]- f0$loglik[2]),2)
[1] 0.1243212
I don't know why Harrell leaves this example in, because I've taken his RMS course and know that he endorses the cross-model comparison of deviance as the more accurate approach.
I'm new to StackOverflow and looking for tips and helps tuning Pharo 5.0 in Windows Environment.
My PC is running Windows 10, CPU I5-4670K at 4 Ghz, and a Plextor 512G SSD as the C drive where Pharo 5.0 is loaded and running.
Below is a set of codes I used to understand Pharo's behaviors both in performance and computation precision.
| x y |
x := 0.
y := 400000000.
[1 to: 2000000000 do: [ :i |
x := x + 0.2]] timeToRun
took 2 minutes 8.281 seconds to execute. If I knock out one zero from the number of iterations, it took only 1.762 seconds to finish. 10 times more of the iterations used more than 70 times in execution time. It seems that I hit a system boundary such that the time for execution grow much higher than 10 times between the two cases.
Appreciate any tips and helps fine tuning Pharo 5.0 VM so I could reduce such unexpected and undesirable system behaviors?
p.s. During the execution, Windows Task Manager didn't report any changes of the disk activities. Pretty much the execution are all RAM and CPU operations. By the way, unless you have a much faster PC, please don't try to add another zero to the number of iterations, it took so long that I had to break the execution.
Welcome to SO (and to the smalltalk tag!)
First note that temporary y is not playing any role here, so we can simplify a little bit the snippet to
| x |
x := 0.
[1 to: 200000000 * 10 do: [ :i |
x := x + 0.2]] timeToRun
which you compared against
| x |
x := 0.
[1 to: 200000000 * 1 do: [ :i |
x := x + 0.2]] timeToRun
The reason why the first version is not just 10 times slower that the second is that in the former the block variable i moves from the SmallInteger domain to the LargeInteger one. So, every time the block increments i, when i has surpassed the SmallInteger boundary, the addition i := i + 1 that takes place here involves LargeInteger arithmetic, which is slower than the SmallInteger one.
And how many times does the LargeInteger arithmetics take place? Well, to calculate that we only need to subtract SmallInteger maxVal from 200000000 * 10:
(200000000 * 10) - SmallInteger maxVal = 926,258,177
meaning that Pharo is performing that number of operations on the large integer i.
Note that if we had had instead
| x |
x := 0.
[
10 timesRepeat: [1 to: 200000000 * 1 do: [ :i | x := x + 0.2]]
] timeToRun
we would have spent about 10 times the speed of one iteration.
ADDENDUM
Please do not take the explanation above as suggesting that LargeInteger arithmetics has a poor performance in Pharo. On the contrary, Pharo does a great job at making such computations efficient.
Behinds the scenes Pharo uses primitives for these operations while presenting the programmer with a unique and consistent view and API of Integer arithmetics. In fact, if you try to do the same in another dialect with no VM support for LargeInteger you will have to wait much (much) longer for the result to get calculated.
I was trying to bench parMap vs map with a very simple example:
import Control.Parallel.Strategies
import Criterion.Main
sq x = x^2
a = whnf sum $ map sq [1..1000000]
b = whnf sum $ parMap rseq sq [1..1000000]
main = defaultMain [
bench "1" a,
bench "2" b
]
My results seem to indicate zero speedup from parMap and I was wondering why this might be?
benchmarking 1
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
time 177.7 ms (165.5 ms .. 186.1 ms)
0.997 R² (0.992 R² .. 1.000 R²)
mean 185.1 ms (179.9 ms .. 194.1 ms)
std dev 8.265 ms (602.3 us .. 10.57 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 2
time 182.7 ms (165.4 ms .. 199.5 ms)
0.993 R² (0.976 R² .. 1.000 R²)
mean 189.4 ms (181.1 ms .. 195.3 ms)
std dev 8.242 ms (5.896 ms .. 10.16 ms)
variance introduced by outliers: 14% (moderately inflated)
The problem is that parMap sparks a parallel computation for each individual list element. It doesn't chunk the list at all as you seem to think from your comments—that would require the use of the parListChunk strategy.
So parMap has high overheads, so the fact that each spark simply squares one number means that its cost is overwhelmed by that overhead.
have two algorithm implementations:
average(List) -> sum(List) / len(List).
sum([]) -> 0;
sum([Head | Tail]) -> Head + sum(Tail).
len([]) -> 0;
len([_ | Tail]) -> 1 + len(Tail).
average1(List) -> average_acc(List, 0,0).
average_acc([], Sum, Length) -> Sum / Length;
average_acc([H | T], Sum, Length) -> average_acc(T, Sum + H, Length + 1).
and output for trace GC events gc_start gc_end (gc started and stopped):
here every next value for process is a sum of preceding value and last gc time
average: 5189
average: 14480
average: 15118
average1: 594
Why so big difference?
PS. I use wall clock time.
You should not use wall clock time (timestamp flag) to measure what GC takes because even GC is not rescheduled in Erlang scheduler thread the thread self can be rescheduled by underlying OS. So you should use cpu_timestamp instead.
Your average/1 is using sum/1 and count/1 implementation where both are not tail recursive so you allocate 2*N stack frames which makes big difference in performance to average1/1 which uses tail recursive average_acc/3. So average1/1 performs exactly as loop in other languages.
Edit:
Edited according Yola stated he is using timestamp flag for trace messages.