Related
In SML, how can i count the number of appearences of chars in a String using recursion?
Output should be in the form of (char,#AppearenceOfChar).
What i managed to do is
fun frequency(x) = if x = [] then [] else [(hd x,1)]#frequency(tl x)
which will return tupels of the form (char,1). I can too eliminate duplicates in this list, so what i fail to do now is to write a function like
fun count(s:string,l: (char,int) list)
which 'iterates' trough the string incrementing the particular tupel component. How can i do this recursively? Sorry for noob question but i am new to functional programming but i hope the question is at least understandable :)
I'd break the problem into two: Increasing the frequency of a single character, and iterating over the characters in a string and inserting each of them. Increasing the frequency depends on whether you have already seen the character before.
fun increaseFrequency (c, []) = [(c, 1)]
| increaseFrequency (c, ((c1, count)::freqs)) =
if c = c1
then (c1, count+1)
else (c1,count)::increaseFrequency (c, freqs)
This provides a function with the following type declaration:
val increaseFrequency = fn : ''a * (''a * int) list -> (''a * int) list
So given a character and a list of frequencies, it returns an updated list of frequencies where either the character has been inserted with frequency 1, or its existing frequency has been increased by 1, by performing a linear search through each tuple until either the right one is found or the end of the list is met. All other character frequencies are preserved.
The simplest way to iterate over the characters in a string is to explode it into a list of characters and insert each character into an accumulating list of frequencies that starts with the empty list:
fun frequencies s =
let fun freq [] freqs = freqs
| freq (c::cs) freqs = freq cs (increaseFrequency (c, freqs))
in freq (explode s) [] end
But this isn't a very efficient way to iterate a string one character at a time. Alternatively, you can visit each character by indexing without converting to a list:
fun foldrs f e s =
let val len = size s
fun loop i e' = if i = len
then e'
else loop (i+1) (f (String.sub (s, i), e'))
in loop 0 e end
fun frequencies s = foldrs increaseFrequency [] s
You might also consider using a more efficient representation of sets than lists to reduce the linear-time insertions.
I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))
I encountered this problem in an interview and I was stuck on the best way to go about it. The question is as follows:
Given a string sequence of words and a string sequence pattern, return true if the sequence of words matches the pattern otherwise false.
Definition of match: A word that is substituted for a variable must always follow that substitution. For example, if "f" is substituted as "monkey" then any time we see another "f" then it must match "monkey" and any time we see "monkey" again it must match "f".
Examples
input: "ant dog cat dog", "a d c d"
output: true
This is true because every variable maps to exactly one word and vice verse.
a -> ant
d -> dog
c -> cat
d -> dog
input: "ant dog cat dog", "a d c e"
output: false
This is false because if we substitute "d" as "dog" then you can not also have "e" be substituted as "dog".
a -> ant
d, e -> dog (Both d and e can't both map to dog so false)
c -> cat
input: "monkey dog eel eel", "e f c c"
output: true
This is true because every variable maps to exactly one word and vice verse.
e -> monkey
f -> dog
c -> eel
Initially, I thought of doing something as follows...
function matchPattern(pattern, stringToMatch) {
var patternBits = pattern.split(" ");
var stringBits = stringToMatch.split(" ");
var dict = {};
if (patternBits.length < 0
|| patternBits.length !== stringBits.length) {
return false;
}
for (var i = 0; i < patternBits.length; i++) {
if (dict.hasOwnProperty(patternBits[i])) {
if (dict[patternBits[i]] !== stringBits[i]) {
return false;
}
} else {
dict[patternBits[i]] = stringBits[i];
}
}
return true;
}
var ifMatches = matchPattern("a e c d", "ant dog cat dog");
console.log("Pattern: " + (ifMatches ? "matches!" : "does not match!"));
However, I realized that this won't work and fails example #2 as it erroneously returns true. One way to deal with this issue is to use a bi-directional dictionary or two dictionaries i.e store both {"a": "ant"} and
{"ant": "a"} and check both scenarios in the if check. However, that seemed like wasted space. Is there a better way to tackle this problem without using regular expressions?
I think a simple choice that is quadratic in the length of the list of words is to verify that every pairing of list indices has the same equality characteristics in the two lists. I'll assume that you get the "words" and "pattern" as lists already and don't need to parse out spaces and whatever -- that ought to be a separate function's responsibility anyway.
function matchesPatternReference(words, pattern) {
if(words.length !== pattern.length) return false;
for(var i = 0; i < words.length; i++)
for(var j = i+1; j < words.length; j++)
if((words[i] === words[j]) !== (pattern[i] === pattern[j]))
return false;
return true;
}
A slightly better approach would be to normalize both lists, then compare the normalized lists for equality. To normalize a list, replace each list element by the number of unique list elements that appear before its first occurrence in the list. This will be linear in the length of the longer list, assuming you believe hash lookups and list appends take constant time. I don't know enough Javascript to know if these are warranted; certainly at worst the idea behind this algorithm can be implemented with suitable data structures in n*log(n) time even without believing that hash lookups are constant time (a somewhat questionable assumption no matter the language).
function normalize(words) {
var next_id = 0;
var ids = {};
var result = [];
for(var i = 0; i < words.length; i++) {
if(!ids.hasOwnProperty(words[i])) {
ids[words[i]] = next_id;
next_id += 1;
}
result.push(ids[words[i]]);
}
return result;
}
function matchesPatternFast(words, pattern) {
return normalize(words) === normalize(pattern);
}
Note: As pointed out in the comments, one should check deep equality of the normalized arrays manually, since === on arrays does an identity comparison in Javascript and does not compare elementwise. See also How to check if two arrays are equal with Javascript?.
Addendum: Below I argue that matchesPatternFast and matchesPatternReference compute the same function -- but use the faulty assumption that === on arrays compares elements pointwise rather than being a pointer comparison.
We can define the following function:
function matchesPatternSlow(words, pattern) {
return matchesPatternReference(normalize(words), normalize(pattern));
}
I observe that normalize(x).length === x.length and normalize(x)[i] === normalize(x)[j] if and only if x[i] === x[j]; therefore matchesPatternSlow computes the same function as matchesPatternReference.
I will now argue that matchesPatternSlow(x,y) === matchesPatternFast(x,y). Certainly if normalize(x) === normalize(y) then we will have this property. matchesPatternFast will manifestly return true. On the other hand, matchesPatternSlow operates by making a number of queries on its two inputs and verifying that these queries always return the same results for both lists: outside the loop, the query is function(x) { return x.length }, and inside the loop, the query is function(x, i, j) { return x[i] === x[j]; }. Since equal objects will respond identically to any query, it follows that all queries on the two normalized lists will align, matchesPatternSlow will also return true.
What if normalize(x) !== normalize(y)? Then matchesPatternFast will manifestly return false. But if they are not equal, then either their lengths do not match -- in which case matchesPatternSlow will also return false from the first check in matchesPatternReference as we hoped -- or else the elements at some index are not equal. Suppose the smallest mismatching index is i. It is a property of normalize that the element at index i will either be equal to an element at index j<i or else it will be one larger than the maximal element from indices 0 through i-1. So we now have four cases to consider:
We have j1<i and j2<i for which normalize(x)[j1] === normalize(x)[i] and normalize(y)[j2] === normalize(y)[i]. But since normalize(x)[i] !== normalize(y)[i] we then know that normalize(x)[j1] !== normalize(y)[i]. So when matchesPatternReference chooses the indices j1 and i, we will find that normalize(x)[j1] === normalize(x)[i] is true and normalize(y)[j1] === normalize(y)[i] is false and immediately return false as we are trying to show.
We have j<i for which normalize(x)[j] === normalize(x)[i] and normalize(y)[i] is not equal to any previous element of normalize(y). Then matchesPatternReference will return false when it chooses the indices j and i, since normalize(x) matches on these indices but normalize(y) doesn't.
We have j<i for which normalize(y)[j] === normalize(y)[i] and normalize(x)[i] is not equal to any previous element of normalize(x). Basically the same as in the previous case.
We have that normalize(x)[i] is one larger than the largest earlier element in normalize(x) and normalize(y)[i] is one larger than the largest earlier element in normalize(y). But since normalize(x) and normalize(y) agree on all previous elements, this means normalize(x)[i] === normalize(y)[i], a contradiction to our assumption that the normalized lists differ at this index.
So in all cases, matchesPatternFast and matchesPatternSlow agree -- hence matchesPatternFast and matchesPatternReference compute the same function.
For this special case, I assume the pattern refers to matching first character. If so, you can simply zip and compare.
# python2.7
words = "ant dog cat dog"
letters = "a d c d"
letters2 = "a d c e"
def match(ws, ls):
ws = ws.split()
ls = ls.split()
return all(w[0] == l for w, l in zip(ws + [[0]], ls + [0]))
print match(words, letters)
print match(words, letters2)
The funny [[0]] and [0] in the end is to ensure that the pattern and the words have the same length.
Suppose we are given a string S, and a list of some other strings L.
How can we know if S is a one of all the possible concatenations of L?
For example:
S = "abcdabce"
L = ["abcd", "a", "bc", "e"]
S is "abcd" + "a" + "bc" + "e", then S is a concatenation of L, whereas "ababcecd" is not.
In order to solve this question, I tried to use DFS/backtracking. The pseudo code is as follows:
boolean isConcatenation(S, L) {
if (L.length == 1 && S == L[0]) return true;
for (String s: L) {
if (S.startwith(s)) {
markAsVisited(s);
if (isConcatnation(S.exclude(s), L.exclude(s)))
return true;
markAsUnvisited(s);
}
}
return false;
}
However, DFS/backtracking is not a efficient solution. I am curious what is the fastest algorithm to solve this question or if there is any other algorithm to solve it in a faster way. I hope there are algorithms like KMP, which can solve it in O(n) time.
In python:
>>> yes = 'abcdabce'
>>> no = 'ababcecd'
>>> L = ['abcd','a','bc','e']
>>> yes in [''.join(p) for p in itertools.permutations(L)]
True
>>> no in [''.join(p) for p in itertools.permutations(L)]
False
edit: as pointed out, this is n! complex, so is inappropriate for large L. But hey, development time under 10 seconds.
You can instead build your own permutation generator, starting with the basic permutator:
def all_perms(elements):
if len(elements) <=1:
yield elements
else:
for perm in all_perms(elements[1:]):
for i in range(len(elements)):
yield perm[:i] + elements[0:1] + perm[i:]
And then discard branches that you don't care about by tracking what the concatenation of the elements would be and only iterating if it adds up to your target string.
def all_perms(elements, conc=''):
...
for perm in all_perms(elements[1:], conc + elements[0]):
...
if target.startswith(''.join(conc)):
...
A dynamic programming approach would be to work left to right, building up an array A[x] where A[x] is true if the first x characters of the string form one of the possible concatenations of L. You can work out A[n] given earlier A[n] by checking each possible string in the list - if the characters of S up to the nth character match a candidate string of length k and if A[n-k] is true, then you can set A[n] true.
I note that you can use https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm to find the matches you need as input to the dynamic program - the matching costs will be linear in the size of the input string, the total size of all candidate strings, and the number of matches between the input string and candidate strings.
i would try the following:
find all positions of L_i patterns in S
let n=length(S)+1
create a graph with n nodes
for all L_i positions i: directed edges: node_i --> L_i matches node --> node_{i+length(L_i)}
to enable the permutation constrains you have to add some more node/edges to exclude multiple usage of the same pattern
now i can ask a new question: is there exists a directed path from 0 to n ?
notes:
if there exists a node(0 < i < n) with degree <2 then no match is possible
all nodes which have d-=1, d+=1 are part of the permutation
bread first or diskstra to look for the solution
You can use the Trie data structure. First, construct a trie from strings in L.
Then, for the input string S, search for the S in the trie.
During searching, for every visited node which is an end of one of the words in L, call a new search on the trie (from the root) with remaining (yet unmatched) suffix of S. So, we are using recursion. If you consume all characters of S in that process then you know, that S is a contatenation of some strings from L.
I would suggest this solution:
Take an array of size 256 which will store the occurence count of each character in all strings of L. Now try to match that with count of each character of S. If both are unequal then we can confidently say that they cannot form the given character.
If counts are same, Do the following, using KMP algorithm try to find simultaneously each string in L in S. If at any time there is a match we remove that string from L and continue search for other strings in L. If at any time we dont find a match we just print that it cannot be represented. If at the end L is empty we conclude that S indeed is a concatenation of L.
Assuming that L is a set of unique strings.
Two Haskell propositions:
There may be some counter examples to this...just for fun...sort L by a custom sort:
import Data.List (sortBy,isInfixOf)
h s l = (concat . sortBy wierd $ l) == s where
wierd a b | isInfixOf (a ++ b) s = LT
| isInfixOf (b ++ a) s = GT
| otherwise = EQ
More boring...attempt to build S from L:
import Data.List (delete,isPrefixOf)
f s l = g s l [] where
g str subs result
| concat result == s = [result]
| otherwise =
if null str || null subs'
then []
else do sub <- subs'
g (drop (length sub) str) (delete sub subs) (result ++ [sub])
where subs' = filter (flip isPrefixOf str) subs
Output:
*Main> f "abcdabce" ["abcd", "a", "bc", "e", "abc"]
[["abcd","a","bc","e"],["abcd","abc","e"]]
*Main> h "abcdabce" ["abcd", "a", "bc", "e", "abc"]
False
*Main> h "abcdabce" ["abcd", "a", "bc", "e"]
True
Your algorithm has complexity N^2 (N is the length of list). Let's see in actual C++
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
typedef pair<string::const_iterator, string::const_iterator> stringp;
typedef vector<string> strings;
bool isConcatenation(stringp S, const strings L) {
for (strings::const_iterator p = L.begin(); p != L.end(); ++p) {
auto M = mismatch(p->begin(), p->end(), S.first);
if (M.first == p->end()) {
if (L.size() == 1)
return true;
strings T;
T.insert(T.end(), L.begin(), p);
strings::const_iterator v = p;
T.insert(T.end(), ++v, L.end());
if (isConcatenation(make_pair(M.second, S.second), T))
return true;
}
}
return false;
}
Instead of looping on the entire vector, we could sort it, then reduce the search to O(LOG(N)) steps in the optimum case, where all strings start with different chars. The worst case will remain O(N^2).
Given two finite sequences of string, A and B, of length n each,
for example:
A1: "kk", A2: "ka", A3: "kkk", A4: "a"
B1: "ka", B2: "kakk", B3: "ak", B4: "k"
Give a finite sequences of indexes so that their concentration for A
and B gives the same string. Repetitions allowed.
In this example I can't find the solution but for example if the list (1,2,2,4) is a solution then A1 + A2 + A2 + A4 = B1 + B2 + B2 + B4. In this example there are only two characters but it's already very difficult. Actually it's not even trivial to find the shortest solution with one character!
I tried to think of things.. for example the total sum of the length of the strings must be equal and the for the first and last string we need corresponding characters. But nothing else. I suppose for some set of strings it's simply impossible. Anyone can think of a good algorithm?
EDIT: Apparently, this is the Post Correspondence Problem
There is no algorithm that can decide whether a such an instance has a solution or not. If there were, the halting problem could be solved. Dirty trick...
Very tough question, but I'll give it a shot. This is more of a stream of consciousness than an answer, apologies in advance.
If I understand this correctly, you're given 2 equal sized sequences of strings, A and B, indexed from 1..n, say. You then have to find a sequence of indices such that the concatenation of strings A(1)..A(m) equals the concatenation of strings B(1)..B(m) where m is the length of the sequence of indices.
The first thing I would observe is that there could be an infinite number of solutions. For example, given:
A { "x", "xx" }
B { "xx", "x" }
Possible solutions are:
{ 1, 2 }
{ 2, 1 }
{ 1, 2, 1, 2 }
{ 1, 2, 2, 1 }
{ 2, 1, 1, 2 }
{ 2, 1, 2, 1 }
{ 1, 2, 1, 2, 1, 2}
...
So how would you know when to stop? As soon as you had one solution? As soon as one of the solutions is a superset of another solution?
One place you could start would be by taking all the strings of minimum common length from both sets (in my example above, you would take the "x" from both, and searching for 2 equal strings that share a common index. You can then repeat this for strings of the next size up. For example, if the first set has 3 strings of length 1, 2 and 3 respectively, and the second set has strings of length 1, 3 and 3 respectively, you would take the strings of length 3. You would do this until you have no more strings. If you find any, then you have a solution to the problem.
It then gets harder when you have to start combining several strings as in my example above. The naive, brute force approach would be to start permuting all strings from both sets that, when concatenated, result in strings of the same length, then compare them. So in the below example:
A { "ga", "bag", "ac", "a" }
B { "ba", "g", "ag", "gac" }
You would start with sequences of length 2:
A { "ga", "ac" }, B { "ba", "ag" } (indices 1, 3)
A { "bag", "a" }, B { "g", "gac" } (indices 2, 4)
Comparing these gives "gaac" vs "baag" and "baga" vs "ggac", neither of which are equal, so there are no solutions there. Next, we would go for sequences of length 3:
A { "ga", "bag", "a" }, B { "ba", "g", "gac" } (indices 1, 2, 4)
A { "bag", "ac", "a" }, B { "g", "ag", "gac" } (indices 2, 3, 4)
Again, no solutions, so then we end up with sequences of size 4, of which we have no solutions.
Now it gets even trickier, as we have to start thinking about perhaps repeating some indices, and now my brain is melting.
I'm thinking looking for common subsequences in the strings might be helpful, and then using the remaining parts in the strings that were not matched. But I don't quite know how.
A very simple way is to just use something like a breadth-first search. This also has the advantage that the first solution found will have minimal size.
It is not clear what the 'solution' you are looking for is, the longest solution? the shortest? all solutions?
Since you allow repetition there will an infinite number of solutions for some inputs so I will work on:
Find all sequences under a fixed length.
Written as a pseudo code but in a manner very similar to f# sequence expressions
// assumed true/false functions
let Eq aList bList =
// eg Eq "ab"::"c" "a" :: "bc" -> true
// Eq {} {} is _false_
let EitherStartsWith aList bList =
// eg "ab"::"c" "a" :: "b" -> true
// eg "a" "ab" -> true
// {} {} is _true_
let rec FindMatches A B aList bList level
= seq {
if level > 0
if Eq aList bList
yield aList
else if EitherStartsWith aList bList
Seq.zip3 A B seq {1..}
|> Seq.iter (func (a,b,i) ->
yield! FindMatches A B aList::(a,i) bList::(b,i) level - 1) }
let solution (A:seq<string>) (B:seq<string>) length =
FindMatches A B {} {} length
Some trivial constraints to reduce the problem:
The first selection pair must have a common start section.
the final selection pair must have a common end section.
Based on this we can quickly eliminate many inputs with no solution
let solution (A:seq<string>) (B:seq<string>) length =
let starts = {}
let ends = {}
Seq.zip3 A B seq {1..}
|> Seq.iter(fun (a,b,i) ->
if (a.StartsWith(b) or b.StartsWith(a))
start = starts :: (a,b,i)
if (a.EndsWith(b) or b.EndsWith(a))
ends = ends :: (a,b,i))
if List.is_empty starts || List.is_empty ends
Seq.empty // no solution
else
Seq.map (fun (a,b,i) ->
FindMatches A B {} :: (a,i) {} :: (b,i) length - 1)
starts
|> Seq.concat
Here's a suggestion for a brute force search. First generate number sequences bounded to the length of your list:
[0,0,..]
[1,0,..]
[2,0,..]
[3,0,..]
[0,1,..]
...
The number sequence length determines how many strings are going to be in any solution found.
Then generate A and B strings by using the numbers as indexes into your string lists:
public class FitSequence
{
private readonly string[] a;
private readonly string[] b;
public FitSequence(string[] a, string[] b)
{
this.a = a;
this.b = b;
}
private static string BuildString(string[] source, int[] indexes)
{
var s = new StringBuilder();
for (int i = 0; i < indexes.Length; ++i)
{
s.Append(source[indexes[i]]);
}
return s.ToString();
}
public IEnumerable<int[]> GetSequences(int length)
{
foreach (var numberSequence in new NumberSequence(length).GetNumbers(a.Length - 1))
{
string a1 = BuildString(a, numberSequence);
string b1 = BuildString(b, numberSequence);
if (a1 == b1)
yield return numberSequence;
}
}
}
This algorithm assumes equal lengths for A and B.
I tested your example with
static void Main(string[] args)
{
var a = new[] {"kk", "ka", "kkk", "a"};
var b = new[] {"ka", "kakk", "ak", "k"};
for (int i = 0; i < 100; ++i)
foreach (var sequence in new FitSequence(a, b).GetSequences(i))
{
foreach (int x in sequence)
Console.Write("{0} ", x);
Console.WriteLine();
}
}
but could not find any solutions, though it seemed to work for simple tests.