Need to construct DFA (deterministic finite automaton) - regular-language

I need to construct a DFA which recognises all the strings made solely from 0s and 1s, so that thay have an even number of zeros and number of ones divisible by 3. I found an automaton for the case of even number of 0s and even number of 1s:
I tried going from here by adding some states, changing branches, etc.. However I remained unsuccessful usually losing track of what's the automaton doing beacuse of branches and states I'd add. Any help would be greatly appreciated.

You need states which record the divisibility by 2 and by 3, which means you need 6 states. Just call them 0|0, 1|0, 0|1, 1|1, 0|2, 1|2. The first digit tells you that when you reach the state, you have an even or odd number of zeros, the 2nd digit tells you that when you reach the state, you have a number of 1s that, when divided by 3, give the given modulus.
Your state diagram contains:
0|x --0--> 1|x
1|x --0--> 0|x
y|0 --1--> y|1
y|1 --1--> y|2
y|2 --1--> y|0
The start state is 0|0, which is also the only stop state.
The important bit to understand is that each state records the the modulus of the number of zeros or ones read when divided by 2, respectively 3. The 0|0 is then modulus 0 in both cases, which is the accepting criteria. This all works, because the number of different states to keep track of is finite. The name DFA tells us already that it would not work for an infinite number of states to track.

One way to view it is that the problem asks for the intersection of 2 languages: one containing an even number of zeroes and another having the number of ones divisible by 3. A way to approach this is to make DFAs for both the languages and then make another DFA which keeps track of the pair of states in each DFA when an input symbol is read.
I have used 'e' and 'o' to denote that the number of zeroes is even or odd respectively. The second digit in each of the states defines the remainder obtained by dividing the number of ones by 3.

Related

Bitwise operations Python

This is a first run-in with not only bitwise ops in python, but also strange (to me) syntax.
for i in range(2**len(set_)//2):
parts = [set(), set()]
for item in set_:
parts[i&1].add(item)
i >>= 1
For context, set_ is just a list of 4 letters.
There's a bit to unpack here. First, I've never seen [set(), set()]. I must be using the wrong keywords, as I couldn't find it in the docs. It looks like it creates a matrix in pythontutor, but I cannot say for certain. Second, while parts[i&1] is a slicing operation, I'm not entirely sure why a bitwise operation is required. For example, 0&1 should be 1 and 1&1 should be 0 (carry the one), so binary 10 (or 2 in decimal)? Finally, the last bitwise operation is completely bewildering. I believe a right shift is the same as dividing by two (I hope), but why i>>=1? I don't know how to interpret that. Any guidance would be sincerely appreciated.
[set(), set()] creates a list consisting of two empty sets.
0&1 is 0, 1&1 is 1. There is no carry in bitwise operations. parts[i&1] therefore refers to the first set when i is even, the second when i is odd.
i >>= 1 shifts right by one bit (which is indeed the same as dividing by two), then assigns the result back to i. It's the same basic concept as using i += 1 to increment a variable.
The effect of the inner loop is to partition the elements of _set into two subsets, based on the bits of i. If the limit in the outer loop had been simply 2 ** len(_set), the code would generate every possible such partitioning. But since that limit was divided by two, only half of the possible partitions get generated - I couldn't guess what the point of that might be, without more context.
I've never seen [set(), set()]
This isn't anything interesting, just a list with two new sets in it. So you have seen it, because it's not new syntax. Just a list and constructors.
parts[i&1]
This tests the least significant bit of i and selects either parts[0] (if the lsb was 0) or parts[1] (if the lsb was 1). Nothing fancy like slicing, just plain old indexing into a list. The thing you get out is a set, .add(item) does the obvious thing: adds something to whichever set was selected.
but why i>>=1? I don't know how to interpret that
Take the bits in i and move them one position to the right, dropping the old lsb, and keeping the sign. Sort of like this
Except of course that in Python you have arbitrary-precision integers, so it's however long it needs to be instead of 8 bits.
For positive numbers, the part about copying the sign is irrelevant.
You can think of right shift by 1 as a flooring division by 2 (this is different from truncation, negative numbers are rounded towards negative infinity, eg -1 >> 1 = -1), but that interpretation is usually more complicated to reason about.
Anyway, the way it is used here is just a way to loop through the bits of i, testing them one by one from low to high, but instead of changing which bit it tests it moves the bit it wants to test into the same position every time.

Automaton that recognizes a language without 3 consecutive zeros

I would create a finite state automaton that recognizes the language of strings of 0 and 1 that don't contain 3 consecutive zeros.
I tried to do the following automaton but isn't complete as, for example, it doesn't recognize the string: 1001110
How can I change it? For the rest, my reasoning to solve the exercise is that correct?
Thanks so much
i made this in paint so not looking nice but,try below automation.
Your starting state, q0, is a state that is not reached by reading a zero. As you correctly modeled in your automaton, from the state q0 you must allow the automaton to read up to two zeros, hence you need states q1 (reached by reading exactly one consecutive zero) and q2 (reached by reading exactly two consecutive zeros).
Whenever you read a 1 you will be in a state that was not reached by reading a zero.
Now a question is, how many states do you need?
It is permissible to have more than one end state in a finite automaton. In this case you must have more than one end state, because any time you read a 1 you must reach a state that permits two subsequent consecutive zeros to be read, whereas every time you read a zero you much reach a state that does not permit two subsequent consecutive zeros to be read, and your language has strings that end in zero as well as strings that end in 1.

Security: longer keys versus more available characters

I apologize if this has been answered before, but I was not able to find anything. This question was inspired by a comment on another security-related question here on SO:
How to generate a random, long salt for use in hashing?
The specific comment is as follows (sixth comment of accepted answer):
...Second, and more importantly, this will only return hexadecimal
characters - i.e. 0-9 and A-F. It will never return a letter higher
than an F. You're reducing your output to just 16 possible characters
when there could be - and almost certainly are - many other valid
characters.
– AgentConundrum Oct 14 '12 at 17:19
This got me thinking. Say I had some arbitrary series of bytes, with each byte being randomly distributed over 2^(8). Let this key be A. Now suppose I transformed A into its hexadecimal string representation, key B (ex. 0xde 0xad 0xbe 0xef => "d e a d b e e f").
Some things are readily apparent:
len(B) = 2 len(A)
The symbols in B are limited to 2^(4) discrete values while the symbols in A range over 2^(8)
A and B represent the same 'quantities', just using different encoding.
My suspicion is that, in this example, the two keys will end up being equally as secure (otherwise every password cracking tool would just convert one representation to another for quicker attacks). External to this contrived example, however, I suspect there is an important security moral to take away from this; especially when selecting a source of randomness.
So, in short, which is more desirable from a security stand point: longer keys or keys whose values cover more discrete symbols?
I am really interested in the theory behind this, so an extra bonus gold star (or at least my undying admiration) to anyone who can also provide the math / proof behind their conclusion.
If the number of different symbols usable in your password is x, and the length is y, then the number of different possible passwords (and therefore the strength against brute-force attacks) is x ** y. So you want to maximize x ** y. Both adding to x or adding to y will do that, Which one makes the greater total depends on the actual numbers involved and what your practical limits are.
But generally, increasing x gives only polynomial growth while adding to y gives exponential growth. So in the long run, length wins.
Let's start with a binary string of length 8. The possible combinations are all permutations from 00000000 and 11111111. This gives us a keyspace of 2^8, or 256 possible keys. Now let's look at option A:
A: Adding one additional bit.
We now have a 9-bit string, so the possible values are between 000000000 and 111111111, which gives us a keyspace size of 2^9, or 512 keys. We also have option B, however.
B: Adding an additional value to the keyspace (NOT the keyspace size!):
Now let's pretend we have a trinary system, where the accepted numbers are 0, 1, and 2. Still assuming a string of length 8, we have 3^8, or 6561 keys...clearly much higher.
However! Trinary does not exist!
Let's look at your example. Please be aware I will be clarifying some of it, which you may have been confused about. Begin with a 4-BYTE (or 32-bit) bitstring:
11011110 10101101 10111110 11101111 (this is, btw, the bitstring equivalent to 0xDEADBEEF)
Since our possible values for each digit are 0 or 1, the base of our exponent is 2. Since there are 32 bits, we have 2^32 as the strength of this key. Now let's look at your second key, DEADBEEF. Each "digit" can be a value from 0-9, or A-F. This gives us 16 values. We have 8 "digits", so our exponent is 16^8...which also equals 2^32! So those keys are equal in strength (also, because they are the same thing).
But we're talking about REAL passwords, not just those silly little binary things. Consider an alphabetical password with only lowercase letters of length 8: we have 26 possible characters, and 8 of them, so the strength is 26^8, or 208.8 billion (takes about a minute to brute force). Adding one character to the length yields 26^9, or 5.4 trillion combinations: 20 minutes or so.
Let's go back to our 8-char string, but add a character: the space character. now we have 27^8, which is 282 billion....FAR LESS than adding an additional character!
The proper solution, of course, is to do both: for instance, 27^9 is 7.6 trillion combinations, or about half an hour of cracking. An 8-character password using upper case, lower case, numbers, special symbols, and the space character would take around 20 days to crack....still not nearly strong enough. Add another character, and it's 5 years.
As a reference, I usually make my passwords upwards of 16 characters, and they have at least one Cap, one space, one number, and one special character. Such a password at 16 characters would take several (hundred) trillion years to brute force.

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

Is there a formal definition of character difference across a string and if so how is it calculated?

Overview
I'm looking to analyse the difference between two characters as part of a password strength checking process.
I'll explain what I'm trying to achieve and why and would like to know if what I'm looking to do is formally defined and whether there are any recommended algorithms for achieving this.
What I'm looking to do
Across a whole string, I'm looking to compare the current character with the previous character and determine how different they are.
As this relates to password strength checking, the difference between one character and it's predecessor in a string might be defined as being how predictable character N is from knowing character N - 1. There might be a formal definition for this of which I'm not aware.
Example
A password of abc123 could be arguably less secure than azu590. Both contain three letters followed by three numbers, however in the case of the former the sequence is more predictable.
I'm assuming that a password guesser might try some obvious sequences such that abc123 would be tried much before azu590.
Considering the decimal ASCII values for the characters in these strings, and given that b is 1 different from a and c is 1 different again from b, we could derive a simplistic difference calculation.
Ignoring cases where two consecutive characters are not in the same character class, we could say that abc123 has an overall character to character difference of 4 whereas azu590 has a similar difference of 25 + 5 + 4 + 9 = 43.
Does this exist?
This notion of character to character difference across a string might be defined, similar to the Levenshtein distance between two strings. I don't know if this concept is defined or what it might be called. Is it defined and if so what is it called?
My example approach to calculating the character to character difference across a string is a simple and obvious approach. It may be flawed, it may be ineffective. Are there any known algorithms for calculating this character to character difference effectively?
It sounds like you want a Markov Chain model for passwords. A Markov Chain has a number of states and a probability of transitioning between the states. In your case the states are the characters in the allowed character set and the probability of a transition is proportional to the frequency that those two letters appear consecutively. You can construct the Markov Chain by looking at the frequency of the transitions in an existing text, for example a freely available word list or password database.
It is also possible to use variations on this technique (Markov chain of order m) where you for example consider the previous two characters instead of just one.
Once you have created the model you can use the probability of generating the password from the model as a measure of its strength. This is the product of the probabilities of each state transition.
For general signals/time-series data, this is known as Autocorrelation.
You could try adapting the Durbin–Watson statistic and test for positive auto-correlation between the characters. A naïve way may be to use the unicode code-points of each character, but I'm sure that will not be good enough.

Resources