Selecting arbitrary rows from a Neo matrix in Nim? - nim-lang

I am using the Neo library for linear algebra in Nim, and I would like to extract arbitrary rows from a matrix.
I can explicitly select a continuous sequence of rows as per the examples in the README, but can't select a disjoint subset of rows.
import neo
let x = randomMatrix(10, 4)
let some_rows = #[1,3,5]
echo x[2..4, All] # works fine
echo x[some_rows, All] ## error

The first echo works because you are creating a Slice object, which neo has defined a proc for. The second echo uses a sequence of integers, and that kind of access is not defined in the neo library. Unfortunately Slices define contiguous closed ranges, you can't even specify steps to iterate in bigger increments than one, so there is no way to accomplish what you want.
Looking at the structure of a Matrix, it seems that it is highly optimised to avoid copying data. Matrix transformation operations seem to reuse the data of the previous matrix and change the access/dimensions. As such, a matrix transformation with arbitrary random would not be possible, the indexes in your example specifically access non contiguos data and this would need to be encoded somehow in the new structure. Plus if you wrote #[1,5,3] that would defeat any kind of normal iterative looping.
An alternative of course is to write a proc which accepts a sequence instead of a slice and then builds a new matrix copying data from the old one. This implies a performance penalty, but if you think this is a good addition to the library please request it in the issue tracker of the project. If it is not accepted, then you will need to write yourself such a proc for personal use in your programs.

Related

How to implement efficient string interning in f#?

What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.
In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.
New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.
Edit:
String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.
I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)
Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.
sorry if is am missing something obvious here.
I have a few things to say, so I'll post them as an answer.
First, I guess String.Intern works just as well in F# as in C#.
let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true
Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?
To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.
As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.
It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)
It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.
The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.
I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.

Clustering string data with ELKI

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a) load string data in ELKI from a file (only "Labels")?
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
Some code snippets or example input files would be helpful.
It's actually pretty straightforward:
A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.
B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

Best approach to read and write large files with collective MPI-IO

I would like to read and write large data sets in Fortran using MPI-IO. My preferred approach would be to use a MPI type defined with MPI_type_create_subarray with a single dimension to describe the view of each process to the file. My Fortran code thus looks like this:
! A contiguous type to describe the vector per element.
! MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR)
call MPI_Type_contiguous(nComponents, rk_mpi, &
& me%vectype, iError)
call MPI_Type_commit( me%vectype, iError )
! A subarray to describe the view of this process on the file.
! MPI_TYPE_CREATE_SUBARRAY(ndims, array_of_sizes, array_of_subsizes,
! array_of_starts, order, oldtype, newtype, ierror)
call MPI_Type_create_subarray( 1, [ globElems ], [ locElems ], &
& [ elemOff ], MPI_ORDER_FORTRAN, &
& me%vectype, me%ftype, iError)
However, array_of_sizes and array_of_starts, describing global quantities are just "normal" integers in the MPI-Interface. Thus there is a limit at about 2 billion elements with this approach.
Is there another interface, which uses MPI_OFFSET_KIND for these global values?
The only way to work around this, I see so far, is using the displacement option in the MPI_File_set_view instead of defining the view with the help of the subarray MPI type. However this "feels" wrong. Would you expect a performance impact in either approach for collective IO? Does anybody know, if this interface will change in MPI-3?
Maybe I should use some other MPI type?
What is the recommended solution here to write large data files with collective IO efficiently in parallel to disk?
Help is coming.
In MPI-3, there will be datatype manipulation routines that use MPI_Count instead of an int. For backwards compatability (groan) the existing routines won't change, but you should be able to make your type.
But for now..
For subarray in particular, though, this isn't usually thought of as a huge issue at the moment - even for a 2d array, indices of 2 billion give you an array size of 4x1018 which is admittedly pretty large (but exactly the sort of numbers targetted for exascale-type computing). In higher dimensions, it's even larger.
In 1d, though, a list of numbers 2 billion long is only ~8GB which isn't by any stretch big data, and I think that's the situation you find yourself in. My suggstion would be to leave it in the form you have it now for as long as you can. Is there a common factor in the local elements? You can work around this by bundling up the types in units of (say) 10 vectypes if that works - for your code it shouldn't matter, but it would reduce by that same factor the numbers in the locElements and globElements. Otherwise, yes, you could always use the displacement field in file set view.

Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?
One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )
Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.
The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.
I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter
You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Resources