Descriptor size of ORB - feature-detection

I am currently working on ORB algorithm for feature extraction. By default the size of the descriptor is taken as 32 bytes as mentioned in the paper. But i need to check the performance of the descriptor with the reduced descriptor size say 16 bytes. How can I do this?

OpenCV has a fixed implementation of the orb descriptor. The binary test pattern is pre-generated (you can view the source at /modules/features2d/src/orb.cpp). I would recommend you to re-implement orb with a new test pattern. How the test pattern is trained, please refer to the paper brief and orb.

Related

Gpt2 generation of text larger than 1024

I know the context supported by GPT2 is 1024, but I assume there's some technique they utilized to train and generate text longer than that in their results. Also, I saw many gpt2-based repos training text with length longer than 1024. But when I tried generating text using run_generation.py to generate text longer than 1024 it throws a runtime error :The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3. I have the following questions:
Shouldn't it be possible to generate longer text since a sliding window is used?
Can you please explain what's necessary to generate longer text? What changes will I have to make to the run_generation.py code?

How would I be able to acquire data using a picture?

I need to find a way to acquire data from a picture for a new project I am trying to do. This involves tracking eye movements.
Have you checked out OpenCV? If that's not an option, consider any number of the image libraries available in Python -- just about all of them have a number of encoding representations, such as RGBA, HSV, Bayer, and LVU. Note that the four encodings that I've mentioned has different channels are are uncompressed, which would give you the full data from each frame you're analyzing.

Paragraph Vector or Doc2vec model size

I am using deeplearning4j java library to build paragraph vector model (doc2vec) of dimension 100. I am using a text file. It has around 17 million lines, and size of the file is 330 MB.
I can train the model and calculate paragraph vector which gives reasonably good results.
The problem is that when I try to save the model (by writing to disk) with WordVectorSerializer.writeParagraphVectors (dl4j method) it takes around 20 GB of space. And around 30GB when I use native java serializer.
I'm thinking may be the model is size is too big for that much data. Is the model size 20GB reasonable for the text data of 300 MB?
Comments are also welcome from people who have used doc2vec/paragraph vector in other library/language.
Thank you!
I'm not familiar with the dl4j implementation, but model size is dominated by the number of unique word-vectors/doc-vectors, and the chosen vector size.
(330MB / 17 million) means each of your documents averages only 20 bytes – very small for Doc2Vec!
But if for example you're training up a 300-dimensional doc-vector for each doc, and each dimension is (as typical) a 4-byte float, then (17 million * 300 dims * 4 bytes/dim) = 20.4GB. And then there'd be more space for word-vectors and model inner-weights/vocabulary/etc, so the storage sizes you've reported aren't implausible.
With the sizes you've described, there's also a big risk of overfitting - if using 300-dimensions, you'd be modeling docs of <20 bytes source material as (300*4=) 1200-byte doc-vectors.
To some extent, that makes the model tend towards a giant, memorized-inputs lookup table, and thus less-likely to capture generalizable patterns that help understand training docs, or new docs. Effective learning usually instead looks somewhat like compression: modeling the source materials as something smaller but more salient.

What is Web Audio API's bit depth?

What is the bit depth of Web Audio API's audio context?
For example if you want to create a custom curve to use with WaveShaperNode what is the appropriate Float32Array size?
I have seen developers using 65536 which is for 16-Bit audio, but i cant find any info in the spec.
Actually, internally the system uses Float32, which has a significand of 23 bits. Using floating point allows the ability to avoid most clipping problems, while enabling good precision. This means technically there is little point in ever attempting to create a waveshaping curve larger than 8388608 (2^23) samples; but in reality, a 16-bit curve is pretty high-resolution (signal-to-noise is ~96dB). A lot of the reason for 32-bit audio processing was to avoid clipping problems, not improving SNR of input/output; use of floating point helps this dramatically. The WaveShaperNode specifically clips to [-1, +1] (most nodes don't), incidentally.
So in short - just use 16-bit (65535), but make sure your signal is in the -1,+1 range.

FFTW for exponential frequency axis

I have a group of related questions regarding FFTW and audio analysis on Linux.
What is the easiest-to-use, most comprehensive audio library in Linux/Ubuntu that will allow me to decode any of a variety of audio formats (MP3, etc.) and acquire a buffer of raw 16-bit PCM values? gstreamer?
I intend on taking that raw buffer and feeding it to FFTW to acquire frequency-domain data (without complex information or phase information). I think I should use one of their "r2r" methods, probably the DHT. Is this correct?
It seems that FFTW's output frequency axis is discretized in linear increments that are based on the buffer length. It further seems that I can't change this discretization within FFTW so I must do it after the DHT. Instead of a linear frequency axis, I need an exponential axis that follows 2^(i/12). I think I'll have to take the DHT output and run it through some custom anti-aliasing function. Is there a Linux library to do such anti-aliasing? If not, would a basic cosine-based anti-aliasing function work?
Thanks.
This is an age old problem with FFTs and working with audio - ideally we want a log frequency scale for audio but the DFT/FFT has a linear scale. You will need to choose an FFT size that gives sufficient resolution at the low end of your frequency range, and then accumulate bins across the frequency range of interest to give yourself a pseudo-logarithmic representation. There are more complex schemes, but essentially it all boils down to the same thing.
I've seen libsndfile used all over the place:
http://www.mega-nerd.com/libsndfile/
It's LGPL too. It can read pretty much all the open source and lossless audio format you would care about. It doesn't do MP3, however, because of licensing costs.

Resources