True random number generator (TRNG), Haskell and an empirical / formal method - haskell

I want to produce verifications to a true random number generator (TRNG) numbers generated by specific hardware, but I'm not used to this.
Firstly, I want to test the consistency of the True Random Number Generator (TRNG) via empiric methods (AKA, I want to check if they are really true random numbers (TRNs)); and I don't know if I can check this with formal methods.
Are there some specific lectures on this topic? What about some tips? Are there tools for this empiric method testing?

I'd suggest that you not try to duplicate existing tools, since it would be a lot of work. Marsaglia's Diehard tests should work, or you can use dieharder, which is a GPL reimplementation. From the webpage:
The primary point of dieharder (like diehard before it) is to make it easy to time and test (pseudo)random number generators, both software and hardware, for a variety of purposes in research and cryptography. The tool is built entirely on top of the GSL's random number generator interface and uses a variety of other GSL tools (e.g. sort, erfc, incomplete gamma, distribution generators) in its operation.

Related

Calculations of factorials

Working over a problem connected with analytic number theory, I want to make some simple computer experiments in order to examine some theoretical conjectures. The algorithms are very simple: they contain standard arithmetic operations and factorials, but I would like to find values depending on a parameter. For instance, if I understand correctly, the problem with such calculations at WolframAlpha service is that I cannot write an expression depending on a parameter and then change the value of the parameter by typing it only once. But that is what I need. I am new in programming, long ago I used some old languages like Algol, but I am not aware of the modern situation with simple computer experiments. So, my goal is to calculate some simple expressions for multiple values of a parameter, preferably with installing some simple software or by using an online machinery. How could this be done?
Assuming that my question can be perceived as off topic, if so, I would much appreciate any further recommendations before closing.

Dynamic Topic Modeling with Gensim / which code?

I want to use Dynamic Topic Modeling by Blei et al. (http://www.cs.columbia.edu/~blei/papers/BleiLafferty2006a.pdf) for a large corpus of nearly 3800 patent documents.
Does anybody has experience in using the DTM in the gensim package?
I identified two models:
models.ldaseqmodel – Dynamic Topic Modeling in Python Link
models.wrappers.dtmmodel – Dynamic Topic Models (DTM) Link
Which one did you use, of if you used both, which one is "better"? In better words, which one did/do you prefer?
Both packages work fine, and are pretty much functionally identical. Which one you might want to use depends on your use case. There are small differences in the functions each model comes with, and small differences in the naming, which might be a little confusing, but for most DTM use cases, it does not matter very much which you pick.
Are the model outputs identical?
Not exactly. They are however very, very close to being identical (98%+) - I believe most of the differences come from slightly different handling of the probabilities in the generative process. So far, I've not yet come across a case where a difference in the sixth or seventh digit after the decimal point has any significant meaning. Interpreting the topics your models finds matters much more than one version finding a higher topic loading for some word by 0.00002
The big difference between the two models: dtmmodel is a python wrapper for the original C++ implementation from blei-lab, which means python will run the binaries, while ldaseqmodel is fully written in python.
Why use dtmmodel?
the C++ code is faster than the python implementation
supports the Document Influence Model from Gerrish/Blei 2010 (potentially interesting for your research, see this paper for an implementation.
Why use ldaseqmodel?
easier to install (simple import statement vs downloading binaries)
can use sstats from a pretrained LDA model - useful with LdaMulticore
easier to understand the workings of the code
I mostly use ldaseqmodel but thats for convenience. Native DIM support would be great to have, though.
What should you do?
Try each of them out, say, on a small sample set and see what the models return. 3800 documents isn't a huge corpus (assuming the patents aren't hundreds of pages each), and I assume that after preprocessing (removing stopwords, images and metadata) your dictionary won't be too large either (lots of standard phrases and legalese in patents, I'd assume). Pick the one that works best for you or has the capabilities you need.
Full analysis might take hours anyway, if you let your code run overnight there is little practical difference, after all, do you care if it finishes at 3am or 5am? If runtime is critical, I would assume the dtmmodel will be more useful.
For implementation examples, you might want to take a look at these notebooks: ldaseqmodel and dtmmodel

What statistical distribution is used to benchmark an algorithm?

I have benchmarked my algorithm, it run for 1000 times. Now I have all time values and at this point it would be interesting to know the mean, standard deviation, median. The problem is that I don't know what is correct statistics to use to estimate these parameters. I'm not sure about using Normal distribution.
Learn about statistics. There are lots of books, guides, papers and introductions out there (1,2,3, 4)
There are also lots of libraries which implements default statistical methods:
Java Commons Math,
C++ Libs,
and there are certainly lots of others for the language you use...
And also one last hint: For a quick (initial) result I often use excel and its diagram functions. It supports some statistical methods with which you can play around a bit to see in which direction you may continue....
That really depends on what distribution your workload experiences, so you would not be able to answer generically to this.
But there is a trick: if you go one step forward, and do several iterations, each consisting of N calls, and compute, say, average time/throughput for the entire iteration. Then, for a large N and consistent workload behavior across the calls, the iteration scores may be subject to Central Limit Theorem, which can turn them into normally distributed.

How can the reliability of Software be checked through analysis?

How can we analyze the software reliability? How to check the reliabilty of any application or product?
First try to define "software reliability" and the way to quantify it.
If you accomplish this task, you will probably be able to "check" this characteristic.
The most effective way to check reliability is going to be to run your software and gather statistics on its actual reliability. There are too many variables in play, both at the hardware and software levels, to realistically analyze reliability prior to execution, with the possible exception of groups with massive resources like NASA.
There are various methods for determining whether a piece of software meets a specification, but most of the really productive ones do this by construction, i.e., by constraining the way in which the software is written so that it can be easily shown to be correct. Check out VDM, Z and the B toolkit for schemes for doing this sort of thing. Note that these tend to be expensive ways to program if you're not in a safety-critical systems environment.
Proving the correctness of the specification itself is really non-trivial!
Reliability is about continuity of correct service.
The best approach to assess reliability of a software is by dynamic analysis, in other words: testing.
In order to reduce your testing time you may want to apply input profiles different from operational one.
Apply various input distributions, measure how much time your software runs without failure. Then find out how far your input distributions are from operational profile and draw your conclusion about how much time the software would have run with operational profile.
This involves modeling techniques such as Markov chains or stochastic Petri nets.
For further digging, useful keywords are: fault forecasting and statistical testing.

Why do you use a random number generator/extractor?

I am dealing with some computer security issues at the school at the moment and I am interested in general programming public preferences, customs, ideas etc. If you have to use a random number generator or extractor, which one do you choose? Why do you choose it? The mathematical properties, already implemented as a package or for what reason? Do you write your own or use some package?
If computational time is no object, then you can't go wrong with Blum Blum Shub (http://en.wikipedia.org/wiki/Blum_blum_shub). Informally speaking, it's at least as secure (hard to predict) as integer factorization.
dev/random, or equivalent on your platform.
It returns bits from an entropy pool fed by device drivers. No need to worry about mathematical properties.
If you're after a cryptographically secure PRNG, then repeated application of a secure hash to a large seed array is generally the way to go. Don't invent your own algorithm, though, go for a version of Fortuna or something else reasonably well reviewed.
The keys for encryption of phone calls between presidents of the USA and USSR were said to be generated from cosmic rays. We checked it in the physics lab at out univercity -- their energies yield true Gaussian distribution. ;-) So for the best encryption you should use these, because such random sequence can not be replayed. Unless, of course, your adversary covertly builds a particle accelerator near your random number generator.
Ah... about computers... Well, acquire a stream that comes from something physical, not computed. /dev/random is an easiest solution, but your hand-made Geiger-counter attached to USB would give the best randomness ever.
For a little school project, I'd use whatever the OS provides for random number generation.
For a serious security application (eg: COMSEC-level encryption), I use a hardware random number generator. Pure algorithms with no hardware access by definition don't produce random numbers.
HotBits.

Resources