AQL to validate path to node - arangodb

We're required to have some AQL that validates a specific path to an entity. The current solution performs very poorly, due to needing to scan whole collections.
e.g. here we have 3 entity 'types': a, b, c (though they are all in a single collection) and specific edge collections between them and we want to establish whether or not there is a connection between _key "123" and _key "234" that goes exactly through a -> b -> c.
FOR a IN entities FILTER e._key == "123"
FOR b IN 1..1 OUTBOUND e edges_a_to_b
FOR c IN 1..1 INBOUND e_1 edges_c_to_b
FILTER e_2._key == "234"
...
This can fan out very quickly!
We have another solution, where we use SHORTEST PATH and specify the appropriate DIRECTION and edge collections which is much faster (>100times). But worry that this approach does not satisfy quite our general case... the order of the edges is not enforced, and we may have to go through the same edge collection more than once, which we cannot do with that syntax.
Is there another way, possibly involving paths in the traversal?
Thanks!
Dan.

If i understand correctly you always know the exact path that is required between your two vertices.
So to take your example a -> b -> c, a valid result will have:
path.vertices == [a, b, c]
So we can use this path to filter on it, which only works if you use a single traversal step instead of multiple ones.
So what we try to du is the following pattern:
FOR c,e, path IN <pathlength> <direction> <start> <edge-collections>
FILTER path.vertices[0] == a // This needs to be formulated correctly
FILTER path.vertices[1] == b // This needs to be formulated correctly
FILTER path.vertices[2] == c // This needs to be formulated correctly
LIMIT 1 // We only net exactly one path, so limit 1 is enough
[...]
So with this hint is it possible to write the query in the following way:
FOR a IN entities
FILTER a._key == "123"
FOR c, e, path IN 2 OUTBOUND a edges_a_to_b, INBOUND edges_b_to_c
FILTER path.vertices[1] == /* whatever identifies b e.g. vertices[1].type == "b" */
FILTER path.vertices[2]._key == "234"
LIMIT 1 /* This will stop as soon as the first match is found, so very important! */
/* [...] */
This will allow the optimizer to apply the filter conditions as early as possible, und will (almost) use the same algorithm as the shortest path implementation.
The trick is to use one traversal instead of multiples to save internal overhead and allow for better optimization.
Also take into account that it might be better to search in the opposite direction:
e.g. instead of a -> b -> c check for c <- b <- a which might be faster.
This depends on the amount of edges per each node.
I assume a doctor has many surgeries, but a single patient most likely has only a small amount of surgeries so it is better to start at the patient and check backwards instead of starting at the doctor and check forwards.
Please let me know it this helps already, otherwise we can talk about more details and see if we can find some further optimizations.
Disclaimer: I am part of the Core-Dev team at ArangoDB

Related

What is wrong with "keccak256(abi.encode(_a ^ _b))" considering merkle tree security?

In this article MERKLE PROOFS FOR OFFLINE DATA INTEGRITY there is a paragraph:
Warning: Cryptography is harder than it looks. The initial version of
this article had the hash function hash(a^b). That was a bad idea
because it meant that if you knew the legitimate values of a and b you
could use b' = a^b^a' to prove any desired a' value. With this
function you'd have to calculate b' such that hash(a') ^ hash(b') is
equal to a known value (the next branch on the way to root), which is
a lot harder.
I can see that this code seems to be insecure:
function pairHash(uint _a, uint _b) internal pure returns(uint) {
return uint(keccak256(abi.encode(_a ^ _b)));
}
Could somebody explain why and provide an example(real code appreciated)? Or maybe just a URL to corresponding article (the closest one I've found is this one)
Thanks in advance
Let's say we have two values a and b and we create the pair as hash(a^b). The issue is that we can create the same pair by passing two different values a' and b' such that b' = a^b^a' for every a'.
The proof is easily done knowing that the xor operation is commutative and x^x = 0:
hash(a'^b') = hash(a'^(a^b^a')) = hash(a'^a'^a^b) = hash(a^b)
The most basic example is a tree with a and b as only leaves, so with root equal hash(a^b). Let's say the leaves are tokens amounts for an airdrop. We are supposed to claim the tokens by passing a as value and [b] as merkle proof for example. However someone can pass any value a' with proof [a^b^a'], as this still verifies the merkle root.
(PS: the issue from the codearena audit is a different one)

Letter substitutions termination

Given:
A char string S length l containing only characters from 'a' to 'z'
A set of ordered substitution rules R (in the form X->Y) where x, y are single letters from 'a' to 'z' (eg, 'a' -> ' e' could be a valid rule but 'ce'->'abc' would never be a valid rule)
When a rule r in R is applied on S, all letters of S which are equal to the left side of the rule r would be replaced by the letter in the right side of r, if the rule r cause any replacement in S, r is called triggered rule.
Flowchart (Algorithm) :
(1) Alternately apply all rules in R (following the order of rules in R) on S.
(2) While (there exists any 'triggered rule' DURING (1) ) : repeat (1)
(3) Terminate
The question is: Is there any way to determine if with a given string S and set R, the algorithm would terminate or not (running forever)
Example1 : (manually executed)
S = 'abcdef' R = { 'a'->'b' , 'b' -> 'c' }
(the order is implied the order of appearance from left to right of each rule)
Ater running algorithm on S and R:
(1.1): 'abcdef' --> 'bbcdef' --> 'cccdef'
(2.1): repeat (1) because there are 2 replacements during the (1.1)
(1.2): 'cccdef'
(2.2): continue to (3) because there is no replacement during the (1.2)
(3) : terminate the algorithm
=> The algorithm terminate with the given S and R
Example2:
S = 'abcdef' R = { 'a'->'b' , 'b' -> 'a' }
(the order is implied the appearance order from left to right of each rule)
Ater running algorithm on S and R:
(1.1): 'abcdef' --> 'bbcdef' --> 'abcdef'
(2.1): repeat (1) because there are 2 replacements during the (1.1)
(1.2): 'abcdef --> 'bbcdef' --> 'abcdef'
(2.2): repeat (1) because there are 2 replacements during the (1.2)
(1.3): ...... that would be alike (1.1) forever....
The step (3) (terminate) is never reached.
=> The algorithm won't terminate with the given S and R.
I worked on this and found no efficient algorithm for the question
"if the algorithm halts".
First idea came to my mind was to "find cycle" of letters which
are in triggered rules but the number of rules may be too large
for this idea to be ideal.
The second one is to propose a "threshold" for the time of the
repeat, if the threshold is exceeded, we conclude the algorithm
would not terninate.
The "threshold" could be choosen randomly, (as long as it big
enough) - this approach is not really compelling.
I am thinking that if there is any upper bound for the
"threshold" which ensures that we always get the right answer.
And I came up with threshold = 26 where 26 is the number of
letter from 'a' to 'z' - but I can't prove that it true (or not).
(I hope that It would be something like Bellman-Ford algorithm which determines negative cycle in a fixed number of step,..)
How about you? Please help me find the answer (this is not a
homework)
Thankyou for reading.
One simple way to think about solving this is to consider a string of length 1 and see if the problem can loop for any given starting letter. Since the string's length is never changing, and applying a rule applies to each character in S independently, it suffices to consider just a string of length 1.
Now, start with a state diagram with 26 states - 1 for each letter of the alphabet. Now, for your state transitions, consider this process:
Apply the transitions from R 1 at a time in order, until you reach the end of R. If from a particular state (letter), you do not ever reach a new letter, you know that if you reach the starting letter, you terminate. Otherwise, after applying the entire sequence of R, you will end up with a new letter. This will be your new state.
Note that all state transitions are deterministic because we apply the entire sequence of R, not just the individual transitions. If we applied the individual transitions, we might get confused, because we might have a -> b, b->a, a->c. When looking at the individual operations, we might think there are two possible transitions from a (either to b or to c), but really, considering the entire sequence, we see definitively that a transitions to c.
You will be done creating your state diagram after considering the next states of each starting letter. Creating the entire state diagram in this manner requires 26 * |R| operations. If the state diagram contains a loop, then if the string S contains any of the letters in the loop, then it fails to halt, otherwise it will halt.
Alternatively, if you just consider halting after 26 iterations through the entire sequence from R, you can use that as well.

Data Structure for Subsequence Queries

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)
Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.
It could done by building an automaton. You can start with NFA (nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A. Let's say you A is:
A = {'ab, 'bc'}
If you build NFA for ab string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S state is the starting state and F is the ending state.
If you are at F state it means your string qualifies as a subsequence.
The rule of propagating within an autmaton is that you can consume e (epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e closure.
Now if given b, starting at state S I can jump one epsilon, reach 2, and consume b and reach 3. Now given end string I consume epsilon and reach F, thus b qualifies as a sub-sequence of ab. So does a or ab you can try yourself using above automata.
The good thing about NFA is that they have one start state and one final state. Two NFA could be easily connected using epsilons. There are various algorithms that could help you to convert NFA to DFA. DFA is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}, we would need to build NFA for ab then NFA for bc then join the two NFAs and build the DFA of the entire big NFA.
EDIT
NFA of subsequence of abc would be a?b?c?, so you can build your NFA as:
Now, consider the input acd. To query if ab is subsequence of {'abc', 'acd'}, you can use this NFA: (a?b?c?)|(a?c?d). Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc or acd or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A. By unique I mean A is something like this: {'abc', 'def'} i.e. intersection of each element of A is empty set. Then your DFA would be worst case in terms of states i.e. 2^10000. But I'm not sure when would that be possible given that there can never be 10,000 unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.
Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Test 2
A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.
As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, e.g. 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. I.e. c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
The intersection of 2 sorted lists of size m and n can be found in O(m+n) time. To find the intersection of r lists, perform r-1 pairwise intersections in any order. Since taking intersections can only produce sets that are smaller or of the same size, time can be saved by intersecting the smallest pair of lists first, then the next smallest pair (this will necessarily include the result of the first operation), and so on. In particular: sort lists in increasing size order, then always intersect the next list with the "current" intersection.
It is actually faster to find the intersection a different way, by adding the first element (sequence number) of each of the r lists into a heap data structure, then repeatedly pulling out the minimum value and replenishing the heap with the next value from the list that the most recent minimum came from. This will produce a list of sequence numbers in nondecreasing order; any value that appears fewer than r times in a row can be discarded, since it cannot be a member of all r sets.
If a k-string s has only a few sequences in c[s], then it is in some sense discriminating. For most datasets, not all k-strings will be equally discriminating, and this can be used to our advantage. After preprocessing, consider throwing away all lists having more than some fixed number (or some fixed fraction of the total) of sequences, for 3 reasons:
They take a lot of space to store
They take a lot of time to intersect during query processing
Intersecting them will usually not shrink the overall intersection much
It is not necessary to consider every k-subsequence of q. Although this will produce the smallest intersection, it involves merging (|q| choose k) lists, and it might well be possible to produce an intersection that is nearly as small using just a fraction of these k-subsequences. E.g. you could limit yourself to trying all (or a few) k-substrings of q. As a further filter, consider just those k-subsequences whose sequence lists in c[s] are below some value. (Note: if your threshold is the same for every query, you might as well delete all such lists from the database instead, since this will have the same effect, and saves space.)
One thought;
if q tends to be short, maybe reducing A and q to a set will help?
So for the example, derive to { (a,b,c,d,e,f), (a), (a,c,d) }. Looking up possible candidates for any q should be faster than the original problem (that's a guess actually, not sure how exactly. maybe sort them and "group" similar ones in bloom filters?), then use bruteforce to weed out false positives.
If A strings are lengthy, you could make the characters unique based on their occurence, so that would be {(a1,b1,c1,d1,e1,f1),(a1,a2,a3,a4,a5,a6),(a1,c1,d1,d2)}. This is fine, because if you search for "ddca" you only want to match the second d to a second d. The size of your alphabet would go up (bad for bloom or bitmap style operations) and would be different ever time you get new A's, but the amount of false positives would go down.
First let me make sure my understanding/abstraction is correct. The following two requirements should be met:
if A is a subsequence of B, then all characters in A should appear in B.
for those characters in B, their positions should be in an ascending order.
Note that, a char in A might appear more than once in B.
To solve 1), a map/set can be used. The key is the character in string B, and the value doesn't matter.
To solve 2), we need to maintain the position of each characters. Since a character might appear more than once, the position should be a collection.
So the structure is like:
Map<Character, List<Integer>)
e.g.
abcdefab
a: [0, 6]
b: [1, 7]
c: [2]
d: [3]
e: [4]
f: [5]
Once we have the structure, how to know if the characters are in the right order as they are in string A? If B is acd, we should check the a at position 0 (but not 6), c at position 2 and d at position 3.
The strategy here is to choose the position that's after and close to the previous chosen position. TreeSet is a good candidate for this operation.
public E higher(E e)
Returns the least element in this set strictly greater than the given element, or null if there is no such element.
The runtime complexity is O(s * (n1 + n2)*log(m))).
s: number of strings in the set
n1: number of chars in string (B)
n2: number of chars in query string (A)
m: number of duplicates in string (B), e.g. there are 5 a.
Below is the implementation with some test data.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
public class SubsequenceStr {
public static void main(String[] args) {
String[] testSet = new String[] {
"abcdefgh", //right one
"adcefgh", //has all chars, but not the right order
"bcdefh", //missing one char
"", //empty
"acdh",//exact match
"acd",
"acdehacdeh"
};
List<String> subseqenceStrs = subsequenceStrs(testSet, "acdh");
for (String str : subseqenceStrs) {
System.out.println(str);
}
//duplicates in query
subseqenceStrs = subsequenceStrs(testSet, "aa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
subseqenceStrs = subsequenceStrs(testSet, "aaa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
}
public static List<String> subsequenceStrs(String[] strSet, String q) {
System.out.println("find strings whose subsequence string is " + q);
List<String> results = new ArrayList<String>();
for (String str : strSet) {
char[] chars = str.toCharArray();
Map<Character, TreeSet<Integer>> charPositions = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < chars.length; i++) {
TreeSet<Integer> positions = charPositions.get(chars[i]);
if (positions == null) {
positions = new TreeSet<Integer>();
charPositions.put(chars[i], positions);
}
positions.add(i);
}
char[] qChars = q.toCharArray();
int lowestPosition = -1;
boolean isSubsequence = false;
for (int i = 0; i < qChars.length; i++) {
TreeSet<Integer> positions = charPositions.get(qChars[i]);
if (positions == null || positions.size() == 0) {
break;
} else {
Integer position = positions.higher(lowestPosition);
if (position == null) {
break;
} else {
lowestPosition = position;
if (i == qChars.length - 1) {
isSubsequence = true;
}
}
}
}
if (isSubsequence) {
results.add(str);
}
}
return results;
}
}
Output:
find strings whose subsequence string is acdh
abcdefgh
acdh
acdehacdeh
find strings whose subsequence string is aa
acdehacdeh
find strings whose subsequence string is aaa
As always, I might be totally wrong :)
You might want to have a look into the Book Algorithms on Strings and Sequences by Dan Gusfield. As it turns out part of it is available on the internet. You might also want to read Gusfield's Introduction to Suffix Trees. As it turns out this book covers many approaches for you kind of question. It is considered one of the standard publications in this field.
Get a fast longest common subsequence algorithm implementation. Actually it suffices to determine the length of the LCS. Notice that Gusman's book has very good algorithms and also points to more sources for such algorithms.
Return all s ∈ A with length(LCS(s,q)) == length(q)

Help finding paths

% link(Origin,Destination,Speed,Length).
link(paris,milano,140,360).
link(paris,london,200,698).
link(berlin,atena,110,714).
link(atena,paris,90,370).
I need to write this route predicate so I get a Path from city X to city Y.
route(Origin,Destination,TrainType,Path,Length,Duration).
I am new to Prolog, so I wrote something like this. I know it's not correct:
route(Origin,Destination,TrainType,Path,Legth,Duration) :-
link(Origin,City,Len),
City \= Origin,
NewLen is Length + Len,
route(City,Destination,TrainType,Path,NewLen,Duration).
Your predicate lacks a base case, that tells it when to stop. Right now, your predicate will always call itself until it fails (or worse, loops indefinitely, depending on your link predicate). The following gives you a reverse path:
route(Goal, Goal, Path, Path). % base case
route(From, To, Path0, Path) :-
direct_link(From, Via),
route(Via, To, [From|Path0], Path).
where a direct link means you can get from A to B; I'm assuming railways are bidirectional:
direct_link(A,B) :- link(A, B, _Speed, _Length).
direct_link(B,A) :- link(B, A, _Speed, _Length).
You will need to call reverse on the path and add arguments to keep track of length and duration. I'll leave that as an exercise.

Regular expression for strings with even number of a's and odd no of b's

Im having a problem in solving the problem:-
Its an assignment, i solved it, but it seems to be too long and vague, Can anyboby help me please......
Regular expression for the strings with even number of a's and odd number of b's where the character set={a,b}.
One way to do this is to pass it through two regular expressions making sure they both match (assuming you want to use regular expressions at all, see below for an alternative):
^b*(ab*ab*)*$
^a*ba*(ba*ba*)*$
Anything else (and, in fact, even that) is most likely just an attempt to be clever, one that's generally a massive failure.
The first regular expression ensures there are an even number of a with b anywhere in the mix (before, after and in between).
The second is similar but ensures that there's an odd number of b by virtue of the starting a*ba*.
A far better way to do it is to ignore regular expressions altogether and simply run through the string as follows:
def isValid(s):
set evenA to true
set oddB to false
for c as each character in s:
if c is 'a':
set evenA to not evenA
else if c is 'b':
set oddB to not oddB
else:
return false
return evenA and oddB
Though regular expressions are a wonderful tool, they're not suited for everything and they become far less useful as their readability and maintainability degrades.
For what it's worth, a single-regex answer is:
(aa|bb|(ab|ba)(aa|bb)*(ba|ab))*(b|(ab|ba)(bb|aa)*a)
but, if I caught anyone on my team actually using a monstrosity like that, they'd be sent back to do it again.
This comes from a paper by one Greg Bacon. See here for the actual inner workings.
Even-Even = (aa+bb+(ab+ba)(aa+bb)*(ab+ba))*
(Even-Even has even number of Aas and b's both)
Even a's and odd b's = Even-Even b Even-Even
This hsould work
This regular expression takes all strings with even number of a's and even number of b's
r1=((ab+ba)(aa+bb)*(ab+ba)+(aa+bb))*
Now to get regular expression for even a's and odd b's
r2=(b+a(aa+bb)*(ab+ba))((ab+ba)(aa+bb)*(ab+ba)+(aa+bb))*
(bb)*a(aa)*ab(bb)*
ab(bb)* a(aa)*
b(aa)*(bb)*
.
.
.
.
.
.
there can be many such regular expressions. Do you have any other condition like "starting with a" or something of the kind (other than odd 'b' and even 'a') ?
For even number of a's and b's , we have regex:
E = { (ab + ba) (aa+bb)* (ab+ba) }*
For even number of a's and odd number of b's , all we need to do is to add an extra b in the above expression E.
The required regex will be:
E = { ((ab + ba) (aa+bb)* (ab+ba))* b ((ab + ba) (aa+bb)* (ab+ba))* }
I would do as follows:
regex even matches the symbol a, then a sequence of b's, then the symbol a again, then another sequence of b's, such that there is an even number of b's:
even -> (a (bb)* a (bb)* | a b (bb)* a b (bb)*)
regex odd does the same with an odd total number of b's:
odd -> (a b (bb)* a (bb)* | a (bb)* a b (bb)*)
A string of even number of a's and odd number of b's either:
starts with an odd number of b's, and is followed by an even number of odd patterns amongst even patterns;
or starts with an even number of b's, and is followed by an odd number of odd patterns amongst even patterns.
Note that even has no incidence on the evenness/oddness of the a/b's in the string.
regex ->
(
b (bb)* even* (odd even* odd)* even*
|
(bb)* even* odd even* (odd even* odd)* even*
)
Of course one can replace every occurence of even and odd in the final regex to get a single regex.
It is easy to see that a string satisfying this regex will indeed have an even number of a's (as symbol a occurs only in even and odd subregexes, and these each use exactly two a's) and an odd number of b's (first case : 1 b + even number of b's + even number of odd; second case : even number of b's + odd number of odd).
A string with an even number of a's and an odd number of b's will satisfy this regex as it starts with zero or more b's, then is followed by [one a, zero or more b's, one more a and zero or more b's], zero or more times.
A high-level advice: construct a deterministic finite automaton for the language---very easy, encode parity of the number of as and bs in the states, with q0 encoding even nr. of as and even nr. of bs, and transition accordingly---, and then convert the DFA into a regular expression (either by using well-known algorithms for this or "from scratch").
The idea here is to exploit the well-understood equivalence between the DFA (an algorithmic description of regular languages) and the regular expressions (an algebraic description of regular languages).
The regular expression are given below :
(aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*b)*
The structured way to do it is to make one transition diagram and build the regular expression from it.
The regex in this case will be
(a((b(aa)*b)*a+b(aa)*ab)+b((a(bb)*a)*b+a(bb)*ba))b(a(bb)*a)*
It looks complicated but it covers all possible cases that may arise.
the answer is (aa+ab+ba+bb)* b (aa+ab+ba+bb)*
(bb)* b (aa)* + (aa)* b (bb)*
This is the answer which handles all kind of strings with odd b's and even a's.
If it is even number of a's followed by odd number of b's
(aa)*b(bb)* should work
if it is in any order
(aa)*b(bb)* + b(bb)(aa) should work
All strings that have even no of a's and odd no of b's
(((aa+bb) * b(aa+bb) * ) + (A +((a+b)b(a+b)) *)) *
here A is for null string. A can be neglected.
if there is any error plz point it out.

Resources