What is the "n" parameter in the JPEG spec's DQT segment? - jpeg

I'm in the process of writing a JPEG file decoder to learn about the workings of JPEG files. In working through ITU-T81, which specifies JPEG, I ran into the following regarding the DQT segment for quantization tables:
In many of JPEG's segments, there is an n parameter which you read from the segment, which then indicates how many iterations of the following item there are. However, in the DQT case, it just says "multiple", and its not defined how many multiples there are. One can possibly infer from Lq, but the way this multiple is defined is a bit of an anomaly compared to the other segments.
For anyone who is familiar with this specification, what is the right way to determine how many multiples, or n, of (Pq, Tq, Q0..Q63) there should be?

Take the length field (LQ), subtract the length of the Pq/Tq field (one byte if I remember), and that is N.

Related

Recognizing license plate characters using template characters in Python

For a university project I have to recognize characters from a license plate. I have to do this using python 3. I am not allowed to use OCR functions or use functions that use deep learning or neural networks. I have reached the point where I am able to segment the characters from a license plate and transform them to a uniform format. A few examples of segmented characters are here.
The format of the segmented characters is very dependent on the input. However, I can easily convert this to uniform dimensions using opencv. Additionally, I have a set of template characters and numbers that I can use to predict what character / number it is.
I therefore need a metric to express the similarity between the segmented character and the reference image. In this way, I can say that the reference image with the highest similarity score matches the segmented character. I have tried the following ways to compute the similarity.
For these operations I have made sure that the reference characters and the segmented characters have the same dimensions.
A bitwise XOR-operator
Inverting the reference characters and comparing them pixel by pixel. If a pixel matches increment the similarity score, if a pixel does not match decrement the similarity score.
hash both the segmented character and the reference character using 'imagehash'. Consequently comparing the hashes and see which ones are most similar.
None of these methods succeed to give me an accurate prediction for all characters. Most characters are usually correctly predicted. However, the program confuses characters like 8-B, D-0, 7-Z, P-R consistently.
Does anybody have an idea how to predict the segmented characters? I.e. defining a better similarity score.
Edit: Unfortunately, cv2.matchTemplate and cv2.matchShapes are not allowed for this assignment...
The general procedure for comparing two images consists in the extraction of features from the two images and their subsequent comparison. What you are actually doing in the first two methods is considering the value of every pixel as a feature. The similarity measure is therefore a distance-computation on a space of very high dimension. This methods are, however, subject to noise and this requires very big datasets in order not to obtain acceptable results.
For this reason, usually one attempts to reduce the space dimensionality. I'm not familiar with the third method, but it seems to go in this direction.
A way to reduce the space dimensionality consists in defining some custom features meaningful for the problem you are facing.
A possibility for the character classification problem could be to define features that measure the response of the input image on strategic subshapes of the characters (an upper horizontal line, a lower one, a circle in the upper part of the image, a diagonal line, etc.).
You could define a minimal set of shapes that, combined together, can generate every character. Then you should retrieve one feature for each shape, by measuring the response (i.e., integrating the signal of the input image inside the shape) of the original image on that particular shape. Finally, you should determine the class which the image belongs to by taking the nearest reference point in this, smaller, space of the features.

How to find what time a part of audio starts and ends in another audio?

I have two audio files in which a sentence is read (like singing a song) by two different people. So they have different lengths. They are just vocal, no instrument in it.
A1: Audio File 1
A2: Audio File 2
Sample sentence : "Lorem ipsum dolor sit amet, ..."
I know the time every word starts and ends in A1. And I need to find automatically that what time every word starts and ends in A2. (Any language, preferably Python or C#)
Times are saved in XML. So, I can split A1 file by word. So, how to find sound of a word in another audio that has different duration (of word) and different voice?
So from what I read, it seems you would want to use Dynamic Time Warping (DTW). Of course, I'll leave the explanation for wikipedia, but it is generally used to recognize speech patterns without getting noise from different pronunciation.
Sadly, I am more well versed in C, Java and Python. So I will be suggesting python Libraries.
fastdtw
pydtw
mlpy
rpy2
With rpy2 you can actually use R's library and use their implementation of DTW in your python code. Sadly, I couldn't find any good tutorials for this but there are good examples if you choose to use R.
Please let me know if that doesn't help, Cheers!
My approach for this would be to record the dB volume at a constant interval (such as every 100 milliseconds) store this volume in a list or array. I found a way of doing this on java here: Decibel values at specific points in wav file. It is possible in other languages. Meanwhile, take note of the max volume:
max = 0;
currentVolume = f(x)
if currentVolume > max
{
max = currentVolume
}
Then divide the maximum volume by an editable threshold, in my example I went for 7. Say the maximum volume is 21, 21/7 = 3dB, let's call this measure X.
We second threshold, such as 1 and multiply it by X. Whenever the volume is greater than this new value (1*x), we consider that to be the start of a word. When it is less than the given value, we consider it to be the end of a word.
Visual explanation
Without knowing how sophisticated your understanding of the problem space is it isn't easy to know whether to point you in a direction or provide detail on why this problem is non-trivial.
I'd suggest that you start with something like https://cloud.google.com/speech/ and try to convert the speech blocks to text and then perform a similarity comparison on these.
If you really want to try to do the processing yourself you could look at doing some spectrographic analysis. Take the wave form data and perform an FFT to get frequency distributions and look for marker patterns that align your samples.
With only single word comparison of different speakers you are probably not going to be able to apply any kind of neural network unless you are able to train them on the 2 speakers entire speech set and use the network to then try to compare the individual word chunks.
It's been a few years since I did any of this so maybe it's easier these days but my recollection is that although this sounds conceptually simple it might prove to be more difficult than you realise.
The Dynamic Time Warping looks like the most promising suggestion.
secret sauce of below : pointA - pointB is zero if both points have same value ... that is numerically do a pointA minus pointB ... below leverages this to identify at what file byte index offset gives us this zero value when comparing the raw audio curves from a pair of input files ... or an close to zero in a relative sense if both source audio are different even slightly
approach is open up both files and pluck out the raw audio curve of each file ... define two variables bestSum and currentSum, set both to MAX_INT_VALUE ( any arbitrary high value ) ... iterate across the both files simultaneously and obtain the integer value of the current raw audio curve level of file A do same on other file B ... for each such integer just subtract the integer from file A from integer from file B ... continue this loop until you have reached end of one file ... inside of above loop add to currentSum variable the current value of the above mentioned subtraction ... at bottom of above loop update bestSum to become currentSum if currentSum < bestSum also store current file index offset ...
create an outer loop which does a repeat all of above by introducing an offset in time of one file then relaunch above inner loop ... your common audio is when you are using the offset which has the minimum total sum value .. that is the offset when you encountered bestSum
do not start coding until you have gained intuition that above makes perfect sense
I highly encourage you to plot out the curve of the raw audio for one file to confirm you are accessing this sequence of integers ... do this before attempting above algorithm
it will help to visualize above by viewing each input source audio as a curve and you simply keep one curve steady as you slide the other audio curve left or right until you see the curve shapes match or get very close to matching

Adding noise to genomic data having discrete values (A, G, T, C)

Since genomic sequences vary greatly in length, I have been trying to work on using denoising autoencoders to get a compact representation for any given sequence. My expected input is a sequence of nucleotides (letters - A, G, T, C), for example, "AAAAGGAATTTCTCTGGGG....".
For images, adding a noise is easy since it's a continuous space. But in a discrete scenario such as this, what would be a good strategy to add noise to my input?
My first thought is to randomly replace some of the nucleotides with "N", which means that the nucleotide at that position couldn't be identified accurately during sequencing. But changing even one nucleotide leads to a completely different sequence altogether, unlike images where adding a small noise doesn't change how the image looks visually. Please let me know if this is right or there's a better way that I am not aware of.
I'm not sure if this will help you or further complicate your issue, but in biology people normally use FASTQ files to store biological sequences and their corresponding Phred quality scores. A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.
For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.
Public domain image from Wikipedia
So you can add noise to the Phred quality scores (i.e. the probabilities that the base calling is correct) without changing the sequence.
Also see this paragraph about current work done on compressing FASTQ files.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

What exactly is a "Sample"?

From the OpenAL documentation it looks like if an Sample is one single floating point value like lets say 1.94422
Is that correct? Or is a sample an array of a lot of values? What are audio programming dudes talking about when they say "Sample"? Is it the smallest possible snippet of an audio file?
I imagine an uncompressed audio file to look like a giant array with millions of floating point values, where every value is a point in a graph that forms the sound wave. So every little point is a sample?
Exactly. A sample is a value.
When you convert and analog signal to its digital representation, you convert a continuous function to a discrete and quantized one.
It means that you have a grid of vertical and horizontal lines and all the possible values lie on the intersection of the lines. The gap between vertical lines represents the distance between two consecutive samples, the gap between horizontal one is the minimum differences you may represent.
In every vertical line you have a sample, which (in linear encoding) is equal to n-times k where k is the quantum, minimum differences references above.
I imagine an uncompressed audio file
to look like a giant array with
millions of floating point values,
where every value is a point in a
graph that forms the sound wave. So
every little point is a sample?
Yes, that is right. A sample is the value calculated by your A/D converter for that particular point in time. There's a sample for each channel (e.g. left and right in stereo mode. Both samples form a frame.
According to the Wikipedia article on signal processing:
A sample refers to a value or set of values at a point in time and/or space.
So yes, it could just be a single floating point value. Although as Johannes pointed out, if there are multiple channels of audio (EG: right/left), you would expect one value for each channel.
In audio programming, the term "sample" does indeed refer to a single measurement value. Among audio engineers and producers, however, the term "sample" normally refers to an entire snippet of sound taken (or sampled) from a famous song or movie or some other original audio source.

Resources