Consider a table with structure:
CREATE TABLE statistics (name text, when timestamp, value int,
PRIMARY KEY ((name, when)));
What is the best way to calculate, for example, 50% value percentile by name?
I thought about:
a) writing custom aggregate function + query like:
SELECT PERCENTILE(value, 0.5) FROM statistics WHERE name = '...'
b) count elements by name first
SELECT COUNT(value) FROM statistics WHERE name = '...'
then find (0.5/count)th row value with paging when it is sorted by value ascending. Say, if count is 100 it will be 50th row.
c) your ideas
I'm not sure if case A can handle the task. Case B might be tricky when there is odd number of rows.
As long as you always provide name - this request can be very expensive without specifying partition and having everything within one. I am assuming you mean ((name), when) not ((name, when)) in your table, otherwise what your asking is impossible without full table scans (using hadoop or spark).
The UDA would work - but it can be expensive unless your willing to accept an approximation. To have it perfectly accurate you need to do 2 pass (ie doing a count, than a 2nd pass to go X into set, but since no isolation this isnt gonna be perfect either). So if you need it perfectly accurate your best bet is probably to just pull entire statistics[name] partition locally or to have UDA build up entire set (or majority) in a map (not recommended if partitions get large at all) before calculating. ie:
CREATE OR REPLACE FUNCTION all(state tuple<double, map<int, int>>, val int, percentile double)
CALLED ON NULL INPUT RETURNS tuple<double, map<int, int>> LANGUAGE java AS '
java.util.Map<Integer, Integer> m = state.getMap(1, Integer.class, Integer.class);
m.put(m.size(), val);
state.setMap(1, m);
state.setDouble(0, percentile);
return state;';
CREATE OR REPLACE FUNCTION calcAllPercentile (state tuple<double, map<int, int>>)
CALLED ON NULL INPUT RETURNS int LANGUAGE java AS
'java.util.Map<Integer, Integer> m = state.getMap(1, Integer.class, Integer.class);
int offset = (int) (m.size() * state.getDouble(0));
return m.get(offset);';
CREATE AGGREGATE IF NOT EXISTS percentile (int , double)
SFUNC all STYPE tuple<double, map<int, int>>
FINALFUNC calcAllPercentile
INITCOND (0.0, {});
If willing to accept an approximation you can use a sampling reservoir, say 1024 elements you store and as your UDA gets elements you replace elements in it at a decreasingly statistical chance. (vitter's algorithm R) This is pretty easy to implement, and IF your data set is expected to have a normal distribution will give you a decent approximation. If your data set is not a normal distribution this can be pretty far off. With a normal distribution theres actually a lot of other options as well but R is I think easiest to implement in a UDA. like:
CREATE OR REPLACE FUNCTION reservoir (state tuple<int, double, map<int, int>>, val int, percentile double)
CALLED ON NULL INPUT RETURNS tuple<int, double, map<int, int>> LANGUAGE java AS '
java.util.Map<Integer, Integer> m = state.getMap(2, Integer.class, Integer.class);
int current = state.getInt(0) + 1;
if (current < 1024) {
// fill the reservoir
m.put(current, val);
} else {
// replace elements with gradually decreasing probability
int replace = (int) (java.lang.Math.random() * (current + 1));
if (replace <= 1024) {
m.put(replace, val);
}
}
state.setMap(2, m);
state.setDouble(1, percentile);
state.setInt(0, current);
return state;';
CREATE OR REPLACE FUNCTION calcApproxPercentile (state tuple<int, double, map<int, int>>)
CALLED ON NULL INPUT RETURNS int LANGUAGE java AS
'java.util.Map<Integer, Integer> m = state.getMap(2, Integer.class, Integer.class);
int offset = (int) (java.lang.Math.min(state.getInt(0), 1024) * state.getDouble(1));
if(m.get(offset) != null)
return m.get(offset);
else
return 0;';
CREATE AGGREGATE IF NOT EXISTS percentile_approx (int , double)
SFUNC reservoir STYPE tuple<int, double, map<int, int>>
FINALFUNC calcApproxPercentile
INITCOND (0, 0.0, {});
In above, the percentile function will get slower sooner, playing with size of sampler can give you more or less accuracy but too large and you start to impact performance. Generally a UDA over more than 10k values (even simple functions like count) starts to fail. Important to recognize in these scenarios too that while the single query returns a single value, theres a ton of work to get it. So a lot of these queries or much concurrency will put a lot of pressure on your coordinators. This does require >3.8 (I would recommend 3.11.latest+) for CASSANDRA-10783
note: I make no promises that I havent missed an off by 1 error in example UDAs - I did not test fully, but should be close enough you can make it work from there
Related
In the documentation of compareTo function, I read:
Returns zero if this object is equal to the specified other object, a
negative number if it's less than other, or a positive number if it's
greater than other.
What does this less than or greater than mean in the context of strings? Is -for example- Hello World less than a single character a?
val epicString = "Hello World"
println(epicString.compareTo("a")) //-25
Why -25 and not -10 or -1 (for example)?
Other examples:
val epicString = "Hello World"
println(epicString.compareTo("HelloWorld")) //-55
Is Hello World less than HelloWorld? Why?
Why it returns -55 and not -1, -2, -3, etc?
val epicString = "Hello World"
println(epicString.compareTo("Hello World")) //55
Is Hello World greater than Hello World? Why?
Why it returns 55 and not 1, 2, 3, etc?
I believe you're asking about the implementation of compareTo method for java.lang.String. Here is a source code for java 11:
public int compareTo(String anotherString) {
byte v1[] = value;
byte v2[] = anotherString.value;
if (coder() == anotherString.coder()) {
return isLatin1() ? StringLatin1.compareTo(v1, v2)
: StringUTF16.compareTo(v1, v2);
}
return isLatin1() ? StringLatin1.compareToUTF16(v1, v2)
: StringUTF16.compareToLatin1(v1, v2);
}
So we have a delegation to either StringLatin1 or StringUTF16 here, so we should look further:
Fortunately StringLatin1 and StringUTF16 have similar implementation when it comes to compare functionality:
Here is an implementation for StringLatin1 for example:
public static int compareTo(byte[] value, byte[] other) {
int len1 = value.length;
int len2 = other.length;
return compareTo(value, other, len1, len2);
}
public static int compareTo(byte[] value, byte[] other, int len1, int len2) {
int lim = Math.min(len1, len2);
for (int k = 0; k < lim; k++) {
if (value[k] != other[k]) {
return getChar(value, k) - getChar(other, k);
}
}
return len1 - len2;
}
As you see, it iterated over the characters of the shorter string and in case the charaters in the same index of two strings are different it returns the difference between them. If during the iterations it doesn't find any different (one string is prefix of another) it resorts to the comparison between the length of two strings.
In your case, there is a difference in the first iteration already...
So its the same as `"H".compareTo("a") --> -25".
The code of "H" is 72
The code of "a" is 97
So, 72 - 97 = -25
Short answer: The exact value doesn't have any meaning; only its sign does.
As the specification for compareTo() says, it returns a -ve number if the receiver is smaller than the other object, a +ve number if the receiver is larger, or 0 if the two are considered equal (for the purposes of this ordering).
The specification doesn't distinguish between different -ve numbers, nor between different +ve numbers — and so neither should you. Some classes always return -1, 0, and 1, while others return different numbers, but that's just an implementation detail — and implementations vary.
Let's look at a very simple hypothetical example:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length)
= metres - other.metres
}
This class has a single numerical property, so we can use that property to compare them. One common way to do the comparison is simply to subtract the two lengths: that gives a number which is positive if the receiver is larger, negative if it's smaller, and zero of they're the same length — which is just what we need.
In this case, the value of compareTo() would happen to be the signed difference between the two lengths.
However, that method has a subtle bug: the subtraction could overflow, and give the wrong results if the difference is bigger than Int.MAX_VALUE. (Obviously, to hit that you'd need to be working with astronomical distances, both positive and negative — but that's not implausible. Rocket scientists write programs too!)
To fix it, you might change it to something like:
class Length(val metres: Int) : Comparable<Length> {
override fun compareTo(other: Length) = when {
metres > other.metres -> 1
metres < other.metres -> -1
else -> 0
}
}
That fixes the bug; it works for all possible lengths.
But notice that the actual return value has changed in most cases: now it only ever returns -1, 0, or 1, and no longer gives an indication of the actual difference in lengths.
If this was your class, then it would be safe to make this change because it still matches the specification. Anyone who just looked at the sign of the result would see no change (apart from the bug fix). Anyone using the exact value would find that their programs were now broken — but that's their own fault, because they shouldn't have been relying on that, because it was undocumented behaviour.
Exactly the same applies to the String class and its implementation. While it might be interesting to poke around inside it and look at how it's written, the code you write should never rely on that sort of detail. (It could change in a future version. Or someone could apply your code to another object which didn't behave the same way. Or you might want to expand your project to be cross-platform, and discover the hard way that the JavaScript implementation didn't behave exactly the same as the Java one.)
In the long run, life is much simpler if you don't assume anything more than the specification promises!
I think I need some help with the OPL language :/
My code is the following:
using CP;
int NbMchs = ...;
range Mchs = 0..NbMchs-1;
tuple Mode {
int opId;
int mch;
int pt;
};
{Mode} Modes = ...;
// Not Working...
int test[m in Mchs] = all(md in Modes: md.mch == m) md.opId;
What I want to do is to extract m 1D arrays from the Modes structure containing just the opId field of the tuple. Each test[m] array has to contain it's corresponding elements: that is the opId field of the tuple md where md.mch =m.
The error that I get from the above code is "Cannot use type int[] for int". It seems like the right hand side of the above function is returning a single integer, but I was thinking that the all() operator is the one that I can use to do the job.
Thanks in advance
In the general case, the number of opId depends on the machine m so you cannot really have a 2-D array here. I would use an array of sets:
{int} test[m in Mchs] = { md.opId | md in Modes: md.mch == m };
Note that it assumes that you only have one mode per opId,mch.
I want to use a calculated value in the WHERE clause and in an ORDER BY expression. In plain sql it would look like
SELECT some, colums, (some arbitrary math) AS calc_value FROM table WHERE calc_value <= ? ORDER BY calc_value
If I try in JPQL
entitymanager.createQuery("SELECT e, (some arbitrary math) AS calc_value FROM Entity e WHERE calc_value <= :param ORDER BY calc_value", Entity.class);
it fails. Obviously, because the return of the query is the tuple of Entity and calc_value (i.e. Double).
Is there a way of getting this into one query, using strong typed return values (i.e. Entity.class, as the calculated value doesn't matter).
I've had a similar problem and didn't resolve the problem to fetch it into the correct object:
Tried all constructor combinations for the object - no luck.
tried Tuple.class - no luck.
Finally I used this approach and then fetched oject[0] into my real Java-Object:
TypedQuery<Object[]> q = em.createQuery("select v, (2 * 4) as temp from MyObject v order by temp", Object[].class);
List<Object[]> results = q.getResultList();
for (Object[] o : results)
MyObject mo = (MyObject) o[0];
I have a set of strings. 90% of them are URLs start with "http://www.". I want to sort them alphabetically.
Currently I use C++ std::sort(). but std::sort is a variant of quick-sort based on comparison, and comparing two strings with long common prefix is not effecient. However (I think) a radix-sort won't work either, since most strings are put in the same bucket because of long common prefix.
Is there any better algorithm than normal quick-sort/radix-sort for this problem?
I would suspect that the processing time you spend trying to exploit common prefixes on the order of 10 characters per URL doesn't even pay for itself when you consider the average length of URLs.
Just try a completely standard sort. If that's not fast enough, look at parallelizing or distributing a completely standard sort. It's a straightforward approach that will work.
Common Prefixes seem to naturally imply that a trie data structure could be useful. So the idea is to build a trie of all the words and then sort each node. The ordering should be that the children of a particular node reside in a list and are sorted. This can be done easily since at a particular node we need only sort the children, so naturally a recursive solution reveals itself. See this for more inspiration: http://goanna.cs.rmit.edu.au/~jz/fulltext/acsc03sz.pdf
Create two groups: the ones with the prefix and the ones without. For the first set remove the prefix, sort and add the prefix back. For the second set just sort. After that divide the second set into before prefix and after prefix. Now concatenate the three lists (list_2_before, list_1, list_2_after).
Instead of removing and adding prefix for the first list, you can write your own custom code that would start comparing after the prefix (i.e. ignore the prefix part while comparing).
Addendum: If you have multiple common prefixes you can use them further to speed up. It is better to create a shallow tree with very common prefixes and join them.
If you figure out the minimum and maximum values in the vector before you start quicksorting, then you always know the range of values for each call to partition(), since the partition value is either the minimum or the maximum (or at least close to the minimum/maximum) of each subrange and the containing partition's minimum and maximum are the other end of each subrange.
If the minimum and the maximum of a subrange share a common prefix, then you can do all of the partition comparisons starting from the character position following the common prefix. As the quicksort progresses, the ranges get smaller and smaller so their common prefixes should get longer and longer, and ignoring them for the comparisons will save more and more time. How much, I don't know; you'd have to benchmark this to see if it actually helps.
In any event, the additional overhead is fairly small; one pass through the vector to find the minim and maximum string, costing 1.5 comparisons per string (*), and then one check for each partition to find the maximum shared prefix for the partition; the check is equivalent to a comparison, and it can start from the maximum shared prefix of the containing prefix, so it's not even a full string comparison.
The min/max algorithm: Scan the vector two elements at a time. For each pair, first compare them with each other, then compare the smaller one with the running minimum and the larger one with the running maximum. Result: three comparisons for two elements, or 1.5 comparisons per element.
At last I found a Ternary Quick Sort works well. I found the algorithm at www.larsson.dogma.net/qsufsort.c.
Here is my modified implementation, with similar interface to std::sort. It's about 40% faster than std::sort on my machine and dataset.
#include <iterator>
template<class RandIt> static inline void multiway_qsort(RandIt beg, RandIt end, size_t depth = 0, size_t step_len = 6) {
if(beg + 1 >= end) {
return;
}
struct { /* implement bounded comparing */
inline int operator() (
const typename std::iterator_traits<RandIt>::value_type& a,
const typename std::iterator_traits<RandIt>::value_type& b, size_t depth, size_t step_len) {
for(size_t i = 0; i < step_len; i++) {
if(a[depth + i] == b[depth + i] && a[depth + i] == 0) return 0;
if(a[depth + i] < b[depth + i]) return +1;
if(a[depth + i] > b[depth + i]) return -1;
}
return 0;
}
} bounded_cmp;
RandIt i = beg;
RandIt j = beg + std::distance(beg, end) / 2;
RandIt k = end - 1;
typename std::iterator_traits<RandIt>::value_type key = ( /* median of l,m,r */
bounded_cmp(*i, *j, depth, step_len) > 0 ?
(bounded_cmp(*i, *k, depth, step_len) > 0 ? (bounded_cmp(*j, *k, depth, step_len) > 0 ? *j : *k) : *i) :
(bounded_cmp(*i, *k, depth, step_len) < 0 ? (bounded_cmp(*j, *k, depth, step_len) < 0 ? *j : *k) : *i));
/* 3-way partition */
for(j = i; j <= k; ++j) {
switch(bounded_cmp(*j, key, depth, step_len)) {
case +1: std::iter_swap(i, j); ++i; break;
case -1: std::iter_swap(k, j); --k; --j; break;
}
}
++k;
if(beg + 1 < i) multiway_qsort(beg, i, depth, step_len); /* recursively sort [x > pivot] subset */
if(end + 1 > k) multiway_qsort(k, end, depth, step_len); /* recursively sort [x < pivot] subset */
/* recursively sort [x == pivot] subset with higher depth */
if(i < k && (*i)[depth] != 0) {
multiway_qsort(i, k, depth + step_len, step_len);
}
return;
}
The paper Engineering Parallel String Sorting has, surprisingly, benchmarks of a large number of single-threaded algorithms on an URL dataset (see page 29). Rantala's variant of multikey quicksort with caching comes out ahead; you can test multikey_cache8 in this repository cited in the paper.
I've tested the dataset in that paper, and if it's any indication you're seeing barely one bit of entropy in the first ten characters, and distinguishing prefixes in the 100-character range. Doing 100 passes of radix sort will thrash the cache with little benefit, eg sorting a million urls means you're looking for ~20 distinguishing bits in each key at a cost of ~100 cache misses.
While in general radix sort will not perform well on long strings , the optimizations that Kärkkäinen and Rantala describe in Engineering Radix Sort for Strings are sufficient for the URL dataset. In particular they read ahead 8 characters and cache them with the string pointers; sorting on these cached values yields enough entropy to get past the cache-miss problem.
For longer strings try some of the LCP-based algorithms in that repository; in my experience the URL dataset is about at the break-even point between highly-optimized radix-type sorts and LCP-based algorithms which do asymptotically better on longer strings.
I have a memory address pool with 1024 addresses. There are 16 threads running inside a program which access these memory locations doing either read or write operations. The output of this program is in the form of a series of quadruples whose defn is like this
Quadruple q1 : (Thread no, Memory address, read/write , time)
e.g q1 = (12,578,r,2t), q2= (16,578,w,6t)
I want to design a program which takes the stream of quadruples as input and reports all the conflicts which occur if more than 2 threads try to access the same memory resource inside an interval of 5t secs with at least one write operation.
I have several solutions in mind but I am not sure if they are the best ones to address this problem. I am looking for a solution from a design and data structure perspective.
So the basic problem here is collision detection. I would generally look for a solution where elements are added to some kind of associative collection. As a new element is about to be added, you need to be able to tell whether the collection already contains a similar element, indicating a collision. Here you would seem to need a collection type that allows for duplicate elements, such as the STL multimap. The Quadraple (quadruple?) would obviously be the value type in the associative collection, and the key type would contain the data necessary to determine whether two elements represent a collision, i.e. memory address and time. In order to use a standard associative collection like STL multimap, you need to define some ordering on the keys by defining operator< for the key type (I'm assuming C++ here, you didn't specify). You define a collision as two elements where the memory location is identical and the time values differ by less than some threshold amount. The ordering of the key type has to be such that two keys that represent a collision come out as equivalent under the ordering. Equivalence under the < operator is expressed as a < b is false and b < a is false as well, so the ordering might be defined by this operator:
bool operator<( Key const& a, Key const& b ) {
if ( a.address == b.address ) {
if ( abs(a.time - b.time) < threshold ) {
return false;
}
return a.time < b.time;
}
return a.address < b.address;
}
There is a problem with this design, due to the fact that two keys may be equivalent under < without being equal. This means that two different but similar Quadraples, i.e. two values that collide with one another, would be stored under the same key in the collection. You could use a simpler definition of the ordering
bool operator<( Key const& a, Key const& b ) {
if ( a.address == b.address ) {
return a.time < b.time;
}
return a.address < b.address;
}
Under this ordering definition, colliding elements end up adjacent in an ordered associative container (but under different keys), so you'd be able to find them easily in a post-processing step after they have all been added to the collection.