Java8 sum up to a limit with streams - hashmap

I have this LinkedHashMap:
myMap = {
0 => 10,
1 => 6,
2 => 28,
...
}
int limit = 15;
What I'd want to do using streams is to sum (in order) the map values, and stop when the limit is reached, and return back the correspective index in the map (in this case 0).
Is there an elegant way with streams?

You could sum up to a limit like this
myMap.values().reduce(0, (a, b) -> a+b > limit ? a : a + b);

It's possible via my free StreamEx library which extends the standard Stream API, though even with library it's not very elegant:
EntryStream.of(myMap) // like myMap.entrySet().stream()
.prefix(
(e1, e2) -> new AbstractMap.SimpleEntry<>(e2.getKey(), e1.getValue() + e2.getValue()))
.takeWhile(e -> e.getValue() < limit)
.reduce((a, b) -> b)
.ifPresent(System.out::println);
Here we use two special StreamEx operations. One is prefix which lazily calculates running prefix (like scanl in Haskell). Here for two entries we select the key of the latter one and the sum of their values, creating a new entry. Next, we use takeWhile (which will also appear in Java 9 standard Stream API) to stop as soon as value exceeds the limit. Finally we reduce to the last found element and print it if its present, so we have not only index, but also final sum printed (if you need only index, add .map(Entry::getKey) step).
While such solution is moreless FP-ish I would not recommend it in general, because it's not very efficient and produces garbage (intermediate entries and boxed Integers), not to mention external library dependency. Use plain old for loop. It is efficient and easy-to-understand:
int sum = 0;
Integer lastKey = null;
for(Map.Entry<Integer, Integer> e : myMap.entrySet()) {
sum+=e.getValue();
if(sum >= limit) break;
lastKey = e.getKey();
}
if(lastKey != null) {
System.out.println(lastKey);
}

Related

Generate string permutations recursively; each character appears n times

I'm trying to write an algorithm that will generate all strings of length nm, with exactly n of each number 1, 2, ... m,
For instance all strings of length 6, with exactly two 1's, two 2's and two 3's e.g. 112233, 121233,
I managed to do this with just 1's and 2's using a recursive method, but can't seem to get something that works when I introduce 3's.
When m = 2, the algorithm I have is:
generateAllStrings(int len, int K, String str)
{
if(len == 0)
{
output(str);
}
if(K > 0)
{
generateAllStrings(len - 1, K - 1, str + '2');
}
if(len > K)
{
generateAllStrings(len - 1, K, str + '1');
}
}
I've tried inserting similar conditions for the third number but the algorithm doesn't give a correct output. After that I wouldn't even know how to generalise for 4 numbers and above.
Is recursion the right thing to do? Any help would be appreciated.
One option would be to list off all distinct permutations of the string 111...1222...2...nnn....n. There are nice algorithms for enumerating all distinct permutations of a string in time proportional to the length of the string, and they'd probably be a good way to go about solving this problem.
To use a simple recursive algorithm, give each recursion the permutation so far (variable perm), and the number of occurances of each digit that is still available (array count).
Run the code snippet to generate all unique permutations for n=2 and m=4 (set: 11223344).
function permutations(n, m) {
var perm = "", count = []; // start with empty permutation
for (var i = 0; i < m; i++) count[i] = n; // set available number for each digit = n
permute(perm, count); // start recursion with "" and [n,n,n...]
function permute(perm, count) {
var done = true;
for (var i = 0; i < count.length; i++) { // iterate over all digits
if (count[i] > 0) { // more instances of digit i available
var c = count.slice(); // create hard copy of count array
--c[i]; // decrement count of digit i
permute(perm + (i + 1), c); // add digit to permutation and recurse
done = false; // digits left over: not the last step
}
}
if (done) document.write(perm + "<BR>"); // no digits left: complete permutation
}
}
permutations(2, 4);
You can easily do this using DFS (or BFS alternatively). We can define an graph such that each node contains one string and a node is connected to any node that holds a string with a pair of int swaped in comparison to the original string. This graph is connected, thus we can easily generate a set of all nodes; which will contain all strings that are searched:
set generated_strings
list nodes
nodes.add(generateInitialString(N , M))
generated_strings.add(generateInitialString(N , M))
while(!nodes.empty())
string tmp = nodes.remove(0)
for (int i in [0 , N * M))
for (int j in distinct([0 , N * M) , i))
string new = swap(tmp , i , j)
if (!generated_strings.contains(new))
nodes.add(new)
generated_strings.add(new)
//generated_strings now contains all strings that can possibly be generated.

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Performance difference in toString.map and toString.toArray.map

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

faster min and max of different array components with CouchDb map/reduce?

I have a CouchDB database with a view whose values are paired numbers of the form [x,y]. For documents with the same key, I need (simultaneously) to compute the minimum of x and the maximum of y. The database I am working with contains about 50000 documents. Building the view takes several hours, which seems somewhat excessive. (The keys are themselves length-three arrays.) I show the map and reduce functions below, but the basic question is: how can I speed up this process?
Note that the builtin functions won't work because the values have to be numbers, not length-two arrays. It is possible that I could make two different views (one for min(x) and one for max(y)), but it is unclear to me how to combine them to get both results simultaneously.
My current map function looks basically like
function(doc) {
emit ([doc.a, doc.b, doc.c], [doc.x, doc.y])
}
and my reduce function looks like
function(keys, values) {
var x = null;
var y = null;
for (i = 0; i < values.length; i++) {
if (values[i][0] == null) break;
if (values[i][1] == null) break;
if (x == null) x = values[i][0];
if (y == null) y = values[i][1];
if (values[i][0] < x) x = values[i][0];
if (values[i][1] > y) y = values[i][1];
}
emit([x, y]);
}
Just two more notes. Using Math.max() and Math.min() should be a little faster.
function(keys, values) {
var x = -Infinity,
y = Infinity;
for (var i = 0, v; v = values[i]; i++) {
x = Math.max(x, v[0]);
y = Math.min(y, v[1]);
}
return [x, y];
}
And if CouchDB is treating the values as strings, it is because you are storing them as strings in the document.
Hope it helps.
This turned out to be a combination of two factors. One is obvious in the code posted above, where uses "emit" when it should use "return".
The other factor is less obvious and was only found by making a smaller version of the database and logging the steps in the reduce function. Although the entries in "values" were meant to be integers, they were being treated by CouchDB as character strings. Using the parseInt function corrected that problem.
After those two fixes, the entire build of the reduced view took about five minutes, so the speed problem evaporated.
Please check http://www.geeksforgeeks.org/archives/4583 . This may be extended to your application.

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...
I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)
Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996
You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.
I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.
Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();
One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.
I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true
Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Resources