GAN diverges after 3rd warm restart

GAN diverges after 3rd warm restart - pytorch

I've been working with a ClusterGAN, originally based on the one here: https://github.com/zhampel/clusterGAN
While trying to stabilize training on my data I integrated a bunch of current strategies, like making it fully convolutional, including spectral norm, using warm restarts, and adding instance noise. Though I was able to solve problems I was having with mode collapse, and I can get useful output images, it will eventually diverge and I'm wondering what might be going on (see chart below). To be clear, the output of the model is actually quite usable around epoch 69 (in this run), right before the last restart, but I'm still curious what might be wrong, partly for my own understanding, but also to inform future exploration. Is this overfitting, perhaps?
Worth noting, perhaps, is that a larger version of the model—just using more filters—did tend to diverge later in the training, but would still diverge eventually (and didn't necessarily produce clearly "better" images). I'm running on mobile, so I'd like to make the model as small as possible, both to improve inference speed and keep app size down.
UPDATE: Increasing the initial learning rate stabilized training for this number of epochs (though I'm curious whether it would still diverge if I ran it through another warm restart cycle).

Related

What technique(s) could I use to determine lag time?

I am trying to determine the lag time between seeing a dip (or rise) in a predictor metric and when we would see the resulting dip (or rise) in a known response metric. I am not sure where to start, could someone put me on the right path?
For context, I would like to use R or Python and am familiar with statistics and machine learning. I am just searching for what method or modeling technique would be best to use and less about the code.

Is SRM in Google Optimize (Bayesian Model) a thing

So checking for Sample Ratio Mismatch is good for data quality.
But in Google Optimize i can't influence the sample size or do something against it.
My problem is, out of 15 A/B Tests I only got 2 Experiment with no SRM.
(Used this tool https://www.lukasvermeer.nl/srm/microsite/)
In the other hand the bayesian model deals with things like different sample sizes and I dont need to worry about, but the opinions on this topic are different.
Is SRM really a problem in Google Optimize or can I ignore it?

SRM affects Bayesian experiments just as much as it affects Frequentist. SRM happens when you expect a certain traffic split, but end up with a different one. Google Optimize is a black box, so it's impossible to tell if the uneven sample sizes you are experiencing are intentional or not.
Lots of things can cause a SRM, for example if your variation's javascript code has a bug in some browsers those users may not be tracked properly. Another common cause is if your variation causes page load times to increase, more people will abandon the page and you'll see a smaller sample size than expected.
That lack of statistical rigor and transparency is one of the reasons I built Growth Book, which is an open source A/B testing platform with a Bayesian stats engine and automatic SRM checks for every experiment.

Did I test the ArrayFire performance incorrectly?

I cannot figure out what's wrong. I mean, the speed is way too fast, like 1 million items vs 10 million items basically have the same 0.0005 second computation on my machine. So fast, it looks like it wasn't doing anything. But the result of the data is actually correct.
It is mind boggling because if I make similar computation on sequential loop without storing the result in an array, it is not only number of cores slower, but, like 1000 times slower than ArrayFire.
So, maybe I wasn't using the timer correctly?
Do you think they didn't actually compute the data right away? Maybe it just sets up some kind of shadow marker? And when I call the myArray.host(), it will start doing all the actual computations?
From their website, it says there is some kind of JIT to bundle the computations.
ArrayFire uses Just In Time compilation to combine many light weight functions into a single kernel launch. This along with our easy-to-use API allows users to not only quickly prototype their algorithms, but also get the best out of the underlying hardware.
I start/stop my timer right before/after few ArrayFire computations. And it is just insanely fast. Maybe I test it wrong? What's the proper way to test ArrayFire performance?
Never mind, I found out what to do,
Based on the examples, I should be using af::timeit(function) instead of using the af::timer. Using af::timeit will be very slow, but, the result scale more reasonably when I increase the size 10x. It doens't actually compute right away, that's why using af::timer myself wouldn't work.
thank you

Unknown events in nodejs/v8 flamegraph using perf_events

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.
Workflow is following:
run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
Fold stacks using stackcollapse-perf.pl script from this repository
Generate svg flame graph using flamegraph.pl script
I get following result (which look really nice at the beginning):
Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.
How to solve this problem?
I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.

FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function.
If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.
Actually, I'm suggesting this is one of those XY questions.
Even if you get a direct answer to what you asked, it is likely to be of little use.
Here are the reasons why:
1. CPU Profiling is of little use in an I/O bound program
The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right.
If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:
What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant.
On the other hand, the I/O towers are very wide.
Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.
2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs
Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix.
Suppose in the flame graph, it is a dark red color.
Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph.
Rather it appears in multiple small places shown here by black outlines:
Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at.
If you could cut its time in half, you could save 5.5% overall.
If what it's doing could actually be avoided entirely, you could save 11% overall.
Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.
Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up.
That corresponds to taking samples in the flame graph like so:
The slender vertical lines represent twenty random-time stack samples.
As you can see, three of them are marked with an X.
Those are the ones that go through Foo.
That's about the right number, because 11% times 20 is 2.2.
(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)
(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)
Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep.
However, you can basically ignore all the code that isn't yours.
Just examine them for your code.
You'll see precisely how you are spending time,
and you'll have a very rough measurement of how much.
Deep stacks are both bad news and good news -
they mean the code may well have lots of room for speedups, and they show you what those are.
Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed.
The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small.
Here are the statistics.

Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!
SO urges the answer to do the following:
"Please be sure to answer the question. Provide details and share your research!"
So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)?
This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.
The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.
That's it!
Writing some new scripts today I hit apon this directing me to this SO question.
Here's a couple of additional references:
https://yunong.io/2015/11/23/generating-node-js-flame-graphs/
https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22
[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

How large is the average delay from key-presses

I am currently helping someone with a reaction time experiment. For this experiment reaction times on the keyboard are measured. For this experiment it might be important to know, how much error could be introduced because of the delay between the key-press and the processing in the software.
Here are some factors that I found out using google already:
The USB-bus is polled at 125Hz at minimum and 1000Hz at maximum (depending on settings, see this link).
There might be some additional keyboard buffers in Windows that might delay the keypresses further, but I do not know about the logic behind those.
Unfortunately it is not possible to control the low level logic of the experiment. The experiment is written in E-Prime a software that is often used for this kind of experiments. However the company that offers E-Prime also offers additional hardware, that they advertise for precise reaction-timing. Hence they seem to be aware about this effect (but do not tell how large it is).
Unfortunately it is necessary to use a standart keyboard, so I need to provide ways to reduce the latency.

any latency from key presses can be attributed to the debounce routine (i usually use 30ms to be safe) and not to the processing algorithms themselves (unless you are only evaluating the first press).

If you are running an experiment where millisecond timing is important you may want to use http://www.blackboxtoolkit.com/ to find sources of error.
Your needs also depend on the nature of your study. I've run RT experiments in Eprime with a keyboard. Since any error should be consistent on average across participants, for some designs it is not a big problem. If you need to sync up the data though with something else (like Eye tracking or EEG) or want to draw conclusions about RT where specific magnitude is important then E-Primes serial resp box (or another brand, though I have had compatibility issues in the past with other brand boxes and eprime) is a must.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string