Is it possible to store boxed arrays as memory mapped files in J? - j

I have data that works well in j files and key files but for possible performance reasons memory mapped files make more sense. At one time J did have MMF's for boxed arrays but I think it was discontinued. I think JD has this feature now.
What are the best options for memory mapping boxed arrays in J?
Arrays with a min to max column count ratio of 1/100 seem best suited for boxed arrays. I recall reading in the JD docs that the recommendation to consider switching to boxed arrays, using JD, was 1/50.

Related

python 3.x: which is more efficient: list of lists vs. dict?

Imaging you are doing a BFS over a grid (e.g. shortest distance between two cells). Two data structures can be used to host the visited info:
1) List of lists, i.e. data = [[False for _ in range(cols)] for _ in range(rows)]. Later we can access the data in a certain cell by data[r][c].
2) Dict, i.e. data = dict(). Later we can access the data in a certain cell by data[(r, c)].
My question is: which is computationally more efficient in such BFS scenario?
Coding wise it seems the dict approach saves more characters/lines. Memory wise the dict approach can potentially save some space for untouched cells, but can also waste some space for hashtable's extra space.
EDIT
#Peteris mentioned numpy arrays. The advantage over list of lists is obvious: numpy arrays operate on continuous blocks of memory, which allow faster addressing and more cache hits. However I'm not sure how they compare to hashtables (i.e. dict). If the algorithm touches relatively small number of elements, hashtables might provide more cache hits given it's potentially smaller memory footprint.
Also, the truth is that numpy arrays are unavailable to me. So I really need to compare list of lists against dict.
A 2D array
The efficient answer to storing 2D data is 2D arrays/matrixes allocated into a continuous area of memory (not like a list of lists). This avoids the multiple memory lookups required otherwise, and the calculation of a hash value at every lookup that's needed for a dict.
The standard way to do this in python is with the numpy library, here's a simple example
import numpy as np
data = np.zeros( (100, 200) ) # create a 100x200 array and initialize it with zeroes
data[10,20] = 1 # set element at coordinates 10,20 to 1

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

bin packing with overlapping objects

I have some bins with different capacities and some objects with specified size. The goal is to pack these objects in the bins. Until now it is similar to the bin-packing problem. But the twist is that each object has a partial overlap with another. So while object 1 and 2 has sizes s1 and s2, when I put them in the same bin the filled space is less than s1+s2. Supposing that I know this overlapping value for each pair of objects, is there any approximation algorithm like the ones for original bin-packing for this problem too?
The answer is to use a kind of tree that captures the similarity of objects assuming that objects can be broken. Then run a greedy algorithm to fill the bins according to the tree. This algorithm has 3-x approximation bound. However, there should also be better answers.
This method is presented in Michael Sindelar, Ramesh K. Sitaraman, Prashant J. Shenoy: Sharing-aware algorithms for virtual machine colocation. SPAA 2011: 367-378.
I got this answer from this thread but just wanted to close this question by giving the answer.
The only algorithm I think will work is to prune items that doesn't fit into the bins and use another bin. I don't mean first fit algorithm but to wait a period of time and then use new bins for the items. In reality you can use just another bin? It's a practical approach. I mean you can grow the bin to the left or to the right like in this example: http://codeincomplete.com/posts/2011/5/7/bin_packing/.

How to pick a vector graphics element quickly in lots of elements

How to quickly "pick" a 2d element in large number of vector graphics elements, such as polylines, polygons, curves etc.
In Qt, QGraphics can do this easily, but In my program, I don't need this class, I just need QPaint and QWidget.I want to manage and render these elements data myself.
So..
Which related graphics knowledge I need to search in google?, BSP-tree?R-tree?
Give me some advice, Thanks!
Seems that an R-tree is more designed for picking than a BSP-tree. According to the wikipedia article on Spatial Indexing, R-tree is
Typically the preferred method for
indexing spatial data. Objects
(shapes, lines and points) are grouped
using the minimum bounding rectangle
(MBR). Objects are added to an MBR
within the index that will lead to the
smallest increase in its size.
But are you sure it's worth your while to implement the creation, maintenance, and use of the R-tree rather than using QGraphics?

Resources