Offsets in downcast_iter and series slicing in Polars

Offsets in downcast_iter and series slicing in Polars - rust

In Polars, I'm seeing a return result different than what I would expect when using slicing with series and trying to get the offsets.
I'm creating a Series, then slicing it:
// Make a vec of 3 items, called foo, bar baz
let string_values: Vec<&str> = vec!["foo", "bar", "baz"];
// Add it to a series, this is without dataframes
let series = Series::new("string_values", string_values);
//shape: (3,)
// Series: 'string_values' [str]
// [
// "foo"
// "bar"
// "baz"
// ]
println!("{:?}", series);
This returns a new series.
I can then using downcast_iter() to get the offsets:
// Now we should be able to downcast iter to get the offsets.
// returns [0, 3, 6, 9]
// 0-3 = foo
// 3-6 = bar
// 6-9 = baz
series.utf8().unwrap().downcast_iter().for_each(|array| {
println!("{:?}", array.offsets());
});
Great so far.
I then slice it:
//shape: (2,)
// Series: 'string_values' [str]
// [
// "bar"
// "baz"
// ]
let series_slice = series.slice(1, 2);
println!("{:?}", series_slice);
This returns the correct values.
I then try and use downcast_iter() again:
// Now we should be able to downcast iter to get the offsets for the slice.
// This returns [3, 6, 9]
// Is "foo" still referenced?
series_slice.utf8().unwrap().downcast_iter().for_each(|array| {
println!("{:?}", array.offsets());
});
It returns 3, 6, 9. Why is 9 returned? The length of the series is 6.

Buffers in arrow can be shared. Besides the data they also have an offset and a length.
You original arrow string array contains of the following data:
data: foobarbaz
offsets: 0, 3, 6, 9
offset: 0
length: 3
Retrieving element i uses the following algorithm in pseudocode:
let offset = array.offset
let start_index = offsets[offset + i]
let end_index = offsets[offset + i + 1]
let string_value = data[start_index..end_index]
When you slice an array, we don't copy any data. We only update the offset and the length such that we have all information to represent the sliced array:
data: foobarbaz
offsets: 0, 3, 6, 9
offset: 1
length: 2

Related

Is there an efficient function in Rust that finds the index of the first occurrence of a value in a sorted vector?

In [3, 2, 1, 1, 1, 0], if the value we are searching for is 1, then the function should return 2.
I found binary search, but it seems to return the last occurrence.
I do not want a function that iterates over the entire vector and matches one by one.

binary_search assumes that the elements are sorted in less-to-greater order. Yours is reversed, so you can use binary_search_by:
let x = 1; //value to look for
let data = [3,2,1,1,1,0];
let idx = data.binary_search_by(|probe| probe.cmp(x).reverse());
Now, as you say, you do not get the first one. That is expected, for the binary search algorithm will select an arbitrary value equal to the one searched. From the docs:
If there are multiple matches, then any one of the matches could be returned.
That is easily solvable with a loop:
let mut idx = data.binary_search_by(|probe| probe.cmp(&x).reverse());
if let Ok(ref mut i) = idx {
while x > 0 {
if data[*i - 1] != x {
break;
}
*i -= 1;
}
}
But if you expect many duplicates that may negate the advantages of the binary search.
If that is a problem for you, you can try to be smarter. For example, you can take advantage of this comment in the docs of binary_search:
If the value is not found then Result::Err is returned, containing the index where a matching element could be inserted while maintaining sorted order.
So to get the index of the first value with a 1 you look for an imaginary value just between 2 and 1 (remember that your array is reversed), something like 1.5. That can be done hacking a bit the comparison function:
let mut idx = data.binary_search_by(|probe| {
//the 1s in the slice are greater than the 1 in x
probe.cmp(&x).reverse().then(std::cmp::Greater)
});
There is a handy function Ordering::then() that does exactly what we need (the Rust stdlib is amazingly complete).
Or you can use a simpler direct comparison:
let idx = data.binary_search_by(|probe| {
use std::cmp::Ordering::*;
if *probe > x { Less } else { Greater }
});
The only detail left is that this function will always return Err(i), being i either the position of the first 1 or the position where the 1 would be if there are none. An extra comparison is necessary so solve this ambiguity:
if let Err(i) = idx {
//beware! i may be 1 past the end of the slice
if data.get(i) == Some(&x) {
idx = Ok(i);
}
}

Since 1.52.0, [T] has the method partition_point to find the partition point with a predicate in O(log N) time.
In your case, it should be:
let xs = vec![3, 2, 1, 1, 1, 0];
let idx = xs.partition_point(|&a| a > 1);
if idx < xs.len() && xs[idx] == 1 {
println!("Found first 1 idx: {}", idx);
}

Create comma delimited datapair string from numeric arrays in MATLAB

I have two arrays of values:
t = [0; 1; 2];
q = [0; 100; 200];
I need those to be one string that's like:
str = '0, 0, 1, 100, 2, 200';
I can't see a nice way to do it in MATLAB (R2017a) without using a loop. I'd like to avoid that if possible as there's a pretty large array of values and a lot of files and it'll take forever.
Any ideas?

Combine compose with strjoin:
t = [0; 1; 2];
q = [0; 100; 200];
str = strjoin(compose('%d', [t(:)'; q(:)']), ', ');
Output:
str =
'0, 0, 1, 100, 2, 200'
For non integer numbers, use: %f instead of %d

Here's a possible approach. This works for integer numbers, or if you want a fixed number of decimals in the string representation:
t = [0; 1; 2];
q = [0; 100; 200];
tq = reshape([t(:).'; q(:).'], 1, []);
s = sprintf('%i, ',tq); % or change '%i' to something like '%.5f'
s = s(1:end-2)
Result:
s =
'0, 0, 1, 100, 2, 200'
If you have non-integer numbers and want the number of decimals in the representation to be chosen automatically, you can use mat2str instead of sprintf, but then you need to deal with the spaces using regexpre or a similar function:
t = [0; 1; 2];
q = [0; 100; 200];
tq = reshape([t(:).'; q(:).'], 1, [])
s = regexprep(num2str(tq), '\s+', ', ');

Get all possible sums from a list of numbers

Let's say I have a list of numbers: 2, 2, 5, 7
Now the result of the algorithm should contain all possible sums.
In this case: 2+2, 2+5, 5+7, 2+2+5, 2+2+5+7, 2+5+7, 5+7
I'd like to achieve this by using Dynamic Programming. I tried using a matrix but so far I have not found a way to get all the possibilities.

Based on the question, I think that the answer posted by AT-2016 is correct, and there is no solution that can exploit the concept of dynamic programming to reduce the complexity.
Here is how you can exploit dynamic programming to solve a similar question that asks to return the sum of all possible subsequence sums.
Consider the array {2, 2, 5, 7}: The different possible subsequences are:
{2},{2},{5},{7},{2,5},{2,5},{5,7},{2,5,7},{2,5,7},{2,2,5,7},{2,2},{2,7},{2,7},{2,2,7},{2,2,5}
So, the question is to find the sum of all these elements from all these subsequences. Dynamic Programming comes to the rescue!!
Arrange the subsequences based on the ending element of each subsequence:
subsequences ending with the first element: {2}
subsequences ending with the second element: {2}, {2,2}
subsequences ending with the third element: {5},{2,5},{2,5},{2,2,5}
subsequences ending with the fourth element: {7},{5,7},{2,7},{2,7},{2,2,7},{2,5,7},{2,5,7},{2,2,5,7}.
Here is the code snippet:
The array 's[]' calculates the sums for 1,2,3,4 individually, that is, s[2] calculates the sum of all subsequences ending with third element. The array 'dp[]' calculates the overall sum till now.
s[0]=array[0];
dp[0]=s[0];
k = 2;
for(int i = 1; i < n; i ++)
{
s[i] = s[i-1] + k*array[i];
dp[i] = dp[i-1] + s[i];
k = k * 2;
}
return dp[n-1];

This is done in C# and in an array to find the possible sums that I used earlier:
static void Main(string[] args)
{
//Set up array of integers
int[] items = { 2, 2, 5, 7 };
//Figure out how many bitmasks is needed
//4 bits have a maximum value of 15, so we need 15 masks.
//Calculated as: (2 ^ ItemCount) - 1
int len = items.Length;
int calcs = (int)Math.Pow(2, len) - 1;
//Create array of bitmasks. Each item in the array represents a unique combination from our items array
string[] masks = Enumerable.Range(1, calcs).Select(i => Convert.ToString(i, 2).PadLeft(len, '0')).ToArray();
//Spit out the corresponding calculation for each bitmask
foreach (string m in masks)
{
//Get the items from array that correspond to the on bits in the mask
int[] incl = items.Where((c, i) => m[i] == '1').ToArray();
//Write out the mask, calculation and resulting sum
Console.WriteLine(
"[{0}] {1} = {2}",
m,
String.Join("+", incl.Select(c => c.ToString()).ToArray()),
incl.Sum()
);
}
Console.ReadKey();
}
Possible outputs:
[0001] 7 = 7
[0010] 5 = 5
[0011] 5 + 7 = 12
[0100] 2 = 2

This is not an answer to the question because it does not demonstrate the application of dynamic programming. Rather it notes that this problem involves multisets, for which facilities are available in Sympy.
>>> from sympy.utilities.iterables import multiset_combinations
>>> numbers = [2,2,5,7]
>>> sums = [ ]
>>> for n in range(2,1+len(numbers)):
... for item in multiset_combinations([2,2,5,7],n):
... item
... added = sum(item)
... if not added in sums:
... sums.append(added)
...
[2, 2]
[2, 5]
[2, 7]
[5, 7]
[2, 2, 5]
[2, 2, 7]
[2, 5, 7]
[2, 2, 5, 7]
>>> sums.sort()
>>> sums
[4, 7, 9, 11, 12, 14, 16]

I have a solution that can print a list of all possible subset sums.
Its not dynamic programming(DP) but this solution is faster than the DP approach.
void solve(){
ll i, j, n;
cin>>n;
vector<int> arr(n);
const int maxPossibleSum=1000000;
for(i=0;i<n;i++){
cin>>arr[i];
}
bitset<maxPossibleSum> b;
b[0]=1;
for(i=0;i<n;i++){
b|=b<<arr[i];
}
for(i=0;i<maxPossibleSum;i++){
if(b[i])
cout<<i<<endl;
}
}
Input:
First line has the number of elements N in the array.
The next line contains N space-separated array elements.
4
2 2 5 7
----------
Output:
0
2
4
5
7
9
11
12
14
16
The time complexity of this solution is O(N * maxPossibleSum/32)
The space complexity of this solution is O(maxPossibleSum/8)

Reorder string characters in Swift

So, let's say I have a String that is: "abc" and I want to change each character position so that I can have "cab" and later "bca". I want the character at index 0 to move to 1, the one on index 1 to move to 2 and the one in index 2 to 0.
What do I have in Swift to do this? Also, let's say instead of letters I had numbers. Is there any easier way to do it with integers?

Swift 2:
extension RangeReplaceableCollectionType where Index : BidirectionalIndexType {
mutating func cycleAround() {
insert(removeLast(&self), atIndex: startIndex)
}
}
var ar = [1, 2, 3, 4]
ar.cycleAround() // [4, 1, 2, 3]
var letts = "abc".characters
letts.cycleAround()
String(letts) // "cab"
Swift 1:
func cycleAround<C : RangeReplaceableCollectionType where C.Index : BidirectionalIndexType>(inout col: C) {
col.insert(removeLast(&col), atIndex: col.startIndex)
}
var word = "abc"
cycleAround(&word) // "cab"

In the Swift Algorithms package there is a rotate command
import Algorithms
let string = "abcde"
var stringArray = Array(string)
for _ in 0..<stringArray.count {
stringArray.rotate(toStartAt: 1)
print(String(stringArray))
}
Result:
bcdea
cdeab
deabc
eabcd
abcde

Longest Substring Pair Sequence is it Longest Common Subsequence or what?

I have a pair of strings, for example: abcabcabc and abcxxxabc and a List of Common Substring Pairs (LCSP), in this case LCSP is 6 pairs, because three abc in the first string map to two abc in the second string. Now I need to find the longest valid (incrementing) sequence of pairs, in this case there are three equally long solutions: 0:0,3:6; 0:0,6:6; 3:0,6:6 (those numbers are starting positions of each pair in the original strings, the length of substrings is 3 as length of "abc"). I would call it the Longest Substring Pair Sequence or LSPQ. (Q is not to confuse String and Sequence)
Here is the LCSP for this example:
LCSP('abcabcabc', 'abcxxxabc') =
[ [ 6, 6, 3 ],
[ 6, 0, 3 ],
[ 3, 6, 3 ],
[ 0, 6, 3 ],
[ 3, 0, 3 ],
[ 0, 0, 3 ] ]
LSPQ(LCSP('abcabcabc', 'abcxxxabc'), 0, 0, 0) =
[ { a: 0, b: 0, size: 3 }, { a: 3, b: 6, size: 3 } ]
Now I find it with brute force recursively trying all combinations. So I am limited to about 25 pairs, otherwise it is unpractical. Size=[10,15,20,25,26,30], Time ms = [0,15,300,1000,2000,19000]
Is there a way to do that in linear time or at least not quadratic complexity so that longer input LCSP (List of Common Substring Pairs) could be used.
This problem is similar to the "Longest Common Subsequence", but not exactly it, because the input is not two strings but a list of common substrings sorted by their length. So I do not know where to look for an existing solutions or even if they exist.
Here is my particular code (JavaScript):
function getChainSize(T) {
var R = 0
for (var i = 0; i < T.length; i++) R += T[i].size
return R
}
function LSPQ(T, X, Y, id) {
// X,Y are first unused character is str1,str2
//id is current pair
function findNextPossible() {
var x = id
while (x < T.length) {
if (T[x][0] >= X && T[x][1] >= Y) return x
x++
}
return -1
}
var id = findNextPossible()
if (id < 0) return []
var C = [{a:T[id][0], b:T[id][1], size:T[id][2] }]
// with current
var o = T[id]
var A = C.concat(LSPQ(T, o[0]+o[2], o[1]+o[2], id+1))
// without current
var B = LSPQ(T, X, Y, id+1)
if (getChainSize(A) < getChainSize(B)) return B
return A
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Offsets in downcast_iter and series slicing in Polars - rust

Related

Is there an efficient function in Rust that finds the index of the first occurrence of a value in a sorted vector?

Create comma delimited datapair string from numeric arrays in MATLAB

Get all possible sums from a list of numbers

Reorder string characters in Swift

Longest Substring Pair Sequence is it Longest Common Subsequence or what?

Categories

Resources