What is the meaning of XOR sum in this paper? - sensors

In this formulation: j is an aggregator node in wireless sensor network, $ B_{j}$ is the final aggregated result from the right side, $B_{i}^{'}$ is the data that send by sensor node i. According to the author, this ⊕ sign represents the XOR sum, so that I think the meaning of right side is xor sum all received data of aggregator j . Here is my confusion. If the aggregator needs to sum all of received data, why the author uses the XOR-Sum for aggregator to computes the aggregated data?

XOR sum is just repeatedly applied XOR. See this: What is an XOR sum?
By "sum all of received data", the author probably simply means "XOR-sum all received data."
The XOR-sum carries some information about the data that were summed, though the information it yields is different from that of a typical sum.
Feel free to comment if you have any questions

Related

Data Structure Question: Is there a link between the size of a list in a chaining implementation of hash maps and its load factor?

For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.

Additional bit per sample for 2nd channel in FLAC__CHANNEL_ASSIGNMENT_MID_SIDE

I'm trying to understand the following section (link) of libFLAC
case FLAC__CHANNEL_ASSIGNMENT_MID_SIDE:
FLAC__ASSERT(decoder->private_->frame.header.channels == 2);
if(channel == 1)
bps++;
Why are the warm-up samples for the 2nd channel stored with an additional bit-per-sample compare to the 1st channel? Also, I don't find any mention of this in the format spec - am I missing something?
I also asked on hydrogenaudio, and got an answer there: https://hydrogenaud.io/index.php?topic=121900.0
From the spec (not the one on xiph.org):
The side channel needs one extra bit of bit depth as the subtraction can produce sample values twice as large as the maximum possible in any given bit depth. The mid channel in mid-side stereo does not need one extra bit, as it is shifted left one bit. The left shift of the mid channel does not lead to non-lossless behavior, because an uneven sample in the mid subframe must always be accompanied by a corresponding uneven sample in the side subframe, which means the lost least significant bit can be restored by taking it from the sample in the side subframe.
And from user ktf:
Simply put: if you have 8-bit audio, and the left sample is +127 and the right sample -128, the diff channel becomes 127 - -128 = 255. So, you need 9 bits to code for that number.

Verilog XOR operation with "z" input

Say, I have a 4 bit test_expr formed by concatenating 4 1b input ports. The input port corresponding to the LSB of test_expr is connected at TB top with a wire but the wire is undriven.
I have an assert that fires if(^test_expr === 1'bx). I am using Synopsys VCS and Verdi to simulate my design.
Now when i pull out the test_expr in verdi, i see a "z" on LSB. And my assert also fires.
Does this mean that XOR operation with a "z" bit always gives out an "x"? Or is this something that is simulator dependent and can vary?
Thanks
Does this mean that XOR operation with a "z" bit always gives out an "x"?
Yes. Inputs to logic gates must always be driven; a floating ("z") input to a gate will result in erroneous, unpredictable behavior ("x").
The xor operation gives an output of x when any of the inputs is z. Please find the truth table for the basic gate operations here. As Matthew Taylor has pointed out in a previous comment, that is logically the correct output for a floating input.
As to give an answer to your comment, Z means 'high impedance'. Hence when any of the inputs is not clear (be it an X or Z), we can only expect the result to be unknown 'x' and not 'z'.

Detection of similar sequences in ordered event lists

I have logs from a bunch (millions) of small experiments.
Each log contains a list (tens to hundreds) of entries. Each entry is a timestamp and an event ID (there are several thousands of event IDs, each of may occur many times in logs):
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
I know that one event may trigger other events later.
I am researching this dataset. I am looking for "stable" sequences of events that occur often enough in the experiments.
Is there a way to do this without writing too much code and without using proprietary software? The solution should be scalable enough, and work on large datasets.
I think that this task is similar to what bioinformatics does — finding sequences in a DNA and such. Only my task includes many more than four letters in an alphabet... (Update, thanks to #JayInNyc: proteomics deals with larger alphabets than mine.)
(Note, BTW, that I do not know beforehand how stable and similar I want my sequences, what is the minimal sequence length etc. I'm researching the dataset, and will have to figure this out on the go.)
Anyway, any suggestions on the approaches/tools/libraries I could use?
Update: Some answers to the questions in comments:
Stable sequences: found often enough across the experiments. (How often is enough? Don't know yet. Looks like I need to calculate a top of the chains, and discard rarest.)
Similar sequences: sequences that look similar. "Are the sequences 'A B C D E' and 'A B C E D' (minor difference in sequence) similar according to you? Are the sequences 'A B C D E' and 'A B C 1 D E' (sequence of occurrence of selected events is same) also similar according to you?" — Yes to both questions. More drastic mutations are probably also OK. Again, I'd like to be able to calculate a top and discard the most dissimilar...
Timing: I can discard timing information for now (but not order). But it would be cool to have it in a similarity index formula.
Update 2: Expected output.
In the end I would like to have a rating of most popular longest stablest chains. A combination of all three factors should have effect in the calculation of the rating score.
A chain in such rating is, obviously, rather a cluster of similar enough chains.
A synthetic example of a chain-cluster:
alpha
beta
gamma
[garbage]
[garbage]
delta
another:
alpha
beta
gamma|zeta|epsilon
delta
(or whatever variant did not came to my mind right now.)
So, the end output would be something like that (numbers are completely random in this example):
Chain cluster ID | Times found | Time stab. factor | Chain stab. factor | Length | Score
A | 12345 | 123 | 3 | 5 | 100000
B | 54321 | 12 | 30 | 3 | 700000
I have thought about this setup for the past day or so -- how to do it in a sane scalable way in bash, etc.. The answer is really driven by the relational information you are wanting to draw from the data and the apparent size of the dataset you currently have. The xleanest solution will be to load you datasets into a relational database (MariaDB would by my recommendation)
Since your data already exists in an fairly clean format, your options for getting the data into a database are 2. (1) if the files have the data in a usable rowxcol setup, then you can simply use LOAD DATA INFILE to bring your data into the database; or (2) parse the files with bash in a while read line; do scenario, parse the data to get the data in the table format you desire, and use mysql batch mode to directly load the information into mysql in a single pass. The general form of the bash command would be mysql -uUser -hHost database -Bse "your insert command".
Once in a relational database, you then have the proper tool for the job of being able to run flexible queries against your data in a sane manner instead of continually writing/re-writing bash snippets to handle your data in a different way each time. That is probably the best scalable solution you are looking for. A little more work up-front, but a lot better setup going forward.
Wikipedia defines algorithm as 'a precise list of precise steps': 'I am looking for "stable" sequences of events that occur often enough in the experiments.' "Stable" and "often enough" without definition makes the task of giving you an algorithm impossible.
So I give you the trivial one to calculate the frequency of sequences of length 2. I will ignore the time stamp. Here is the awk code (pW stands for previous Word, pcs stands for pair counters):
#!/usr/bin/awk -f
BEGIN { getline; pW=$2; }
{ pcs[pW, $2]++; pW=$2; }
END {
for (i in pcs)
print i, pcs[i];
}
I duplicated your sample to show something meaningful looking
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
Running the code above on it gives:
gammaalpha 1
alphabeta 4
gammabeta 3
deltaalpha 3
betagamma 4
alphadelta 1
betadelta 3
which can be interpreted as alpha followed by beta and beta followed by gamma are the most frequent length two sequences each occurring 4 times in the sample. I guess that would be your definition of stable sequence occurring often enough.
What's next?
(1) You can easily adopt the code above to sequences of length N and to find sequences occurring often enough you can sort (-k2nr) the output on the second column.
(2) To put a limit on N you can stipulate that no event triggers itself, that provides you with a cut-off point. Or you can place a limit on the timestamp ie the difference between consecutive events.
(3) So far those sequences were really strings and I used exact matching between them (CLRS terminology). Nothing prevents you from using your favourite similarity measure instead:
{ pcs[CLIFY(pW, $2)]++; pW=$2; }
CLIFY would be a function which takes k consecutive events and puts them into a bin ie maybe you want ABCDE and ABDCE to go to the same bin. CLIFY could of course take as an additional argument the set of bins so far.
The choice of awk is for convenience. It wouldn't fly, but you can easily run them in parallel.
It is unclear what you want to use this for but a google search for Markov chains, Mark V Shaney would probably help.

Minimal and maximal magnitude in Fortran

I'm trying to rewrite minpack Fortran77 library to Java (for my own needs), so I met this in minpack.f source code:
integer mcheps(4)
integer minmag(4)
integer maxmag(4)
double precision dmach(3)
equivalence (dmach(1),mcheps(1))
equivalence (dmach(2),minmag(1))
equivalence (dmach(3),maxmag(1))
...
data dmach(1) /2.22044604926d-16/
data dmach(2) /2.22507385852d-308/
data dmach(3) /1.79769313485d+308/
dpmpar = dmach(i)
return
What are minmag and maxmag functions, and why dmach(2) and dmach(3) have these values?
There is an explanation in comments:
c dpmpar(1) = b**(1 - t), the machine precision,
c dpmpar(2) = b**(emin - 1), the smallest magnitude,
c dpmpar(3) = b**emax*(1 - b**(-t)), the largest magnitude.
What is smallest and largest magnitude? There must be a way to count these values on runtime; machine constants in source code is a bad style.
EDIT:
I suppose that static fields Double.MIN_VALUE and Double.MAX_VALUE are those values I looked for.
minmag and maxmag (and mcheps too) are not functions, they are declared to be rank 1 integer arrays with 4 elements each. Likewise dmach is a rank 1 3 element array of double precision values. It is very likely, but not certain, that each integer value occupies 4 bytes and each d-p value 8 bytes. Bear this in mind as the answer progresses.
So an expression such as mcheps(1) is not a function call but a reference to the 1st element of an array.
equivalence is an old FORTRAN feature, now deprecated both by language standards and by software engineering practices. A statement such as
equivalence (dmach(1),mcheps(1))
states that the first element of dmach is located, in memory, at the same address as the first element of mcheps. By implication, this also means that the 24 bytes of dmach occupy the same addresses as the 16 bytes of mcheps, and another 8 bytes too. I'll leave you to draw a picture of what is going on. Note that it is conceivable that the code originally (and perhaps still) uses 8 byte integers so that the elements of the equivalenced arrays match 1:1.
Note that equivalence gives, essentially, more than one name, and more than one interpretation, to the same memory locations. mcheps(1) is the name of an integer stored in 4 bytes of memory which form part of the storage for dmach(1). Equivalencing used to be used to implement all sorts of 'clever' tricks back in the days when every byte was precious.
Then the data statements assign values to the elements of dmach. To me those values look to be just what the comment tells us they are.
EDIT: The comment indicates that those magnitudes are the smallest and largest representable double precision numbers on the platform for which the code was last compiled. I think that in Java they are probably called doubles. I don't know Java so don't know what facilities it has for returning the value of the largest and smallest doubles, if you don't know this either hit the 'net or ask another SO question -- to which you'll probably get responses along the lines of search the net.
Most of this you should be able to ignore entirely. As you write, a better approach would be to find out those values at run-time by enquiry using intrinsic functions. Fortran 90 (and later) have such functions, I imagine Java has too but that's your domain not mine.

Resources