search data structure with custom distance on histograms - search

I have a list of k-dimensional features M. I want to find in this list the nearest item to a query feature A. Feature comparison is not directly based on a common metric (such as Euclidean or Symmetric Chi2). Rather, the comparison between feature A and feature B is done as follows : compute the distance (any common metric) between feature A and B'. B' is obtained from circular shifting of B. Since the features are k-dimensional, we obtain k-1 distances between A and B, and the comparison function returns the lowest.
Considering my comparison function above, is it possible to optimize the NN search with an appropriate algorithm or data structure ?

Related

Mixed data type tensor-flow based random forest regression

As the topic suggests I would like to create tensor-flow based random forest regression, using python for our data set which contains the following columns:
HotelName(text& categorical),Country(text categorical). Review(text..?), date( continous or categorical not sure...) and some continous valued columns.
My questions are:
What should be the exact categories of the data types we mentioned above, and is any mapping/discretization of the features necessary( for example, if there are 10 countries, we map them to integers 1-10)
How do we implement the random forest tensorflow model? I searched on the internet but only found the iris data set random forest example ( which has only continous data). In the estimator api, one can specify the type of value of each column, but that doesnt work with tensor_forest right? How should I do the implementation?
Thanks and wishing everyone a happy new year!

Does it make sense to do vecterindex after stringindex on categorical features?

Say I have bunch of categorical string columns in my dataframe. Then I do below transform:
StringIndex the columns
then I use VectorAssembler to assemble all the transformed columns into one vector feature column
do VectorIndexer on the new vector feature column.
Question: for step 3, does it make sense, or is it duplicated effort? I think step 1 already did the index.
Yes it makes sense if you're going to use Spark tree based algorithm (RandomForestClassifier or GBMClassifier) and you have high cardinality features.
E.g. for criteo dataset StringIndexer would convert values in categorical column to integers in range 1 to 65000. It will save this in metadata as a NominalAttribute. Then in RFClassifier it would extract this from metadata as categorical features.
For tree based algorithms you have to specify maxBins parameter that
Must be >= 2 and >= number of categories in any categorical feature.
Too high maxBins parameter would lead to slow performance. To solve this need to use VectorIndexer with .setMaxCategories(64) for example. This will treat as categorical variables only those that has <64 unique values.

Accord.net decision tree breaks when the Decide function is used with data not found in the training set

I used separate training and testing data sets to test the decision tree induced in Accord.net. But in my training data set I found out that there had been a record which had one field value which was not found in the training data set. So after creating the tree from training data, I used the "Decide" method of the tree to see the output for the record with the new value found in the testing dataset during the runtime. But the tree breaks with the following message.
"The tree is degenerated. This is often a sign that the tree is expecting discrete inputs, but it was given only real values".
Furthermore I saw in codification the integers are assigned to the distinct values in the input data. But according to what I have said above, the testing data has a distinct data value that come in between other data for the relevant field. So until that data element is met the same integers are assigned to data in both training and testing data sets when tried to codify separately. But after the newly found data element gets an integer assigned then for the similar data elements in the testing and training data gets different values thereafter. Can someone tell me how to solve these two issues?
For the clarity of the second issue I have given some sample data below. The testing data for the same column (in this case it's Qualification) contains "Diploma" as the new value not found in the training data set.
Training data for column "Qualification": High-School, Bachelor-Degree, Masters, Doctorate
Testing data for column "Qualification": High-School, Bachelor-Degree, Diploma ,Masters, Doctorate

Use Spark RDD to Find Path Cost

I am using Spark to design a TSP solver. Essentially, each element in the RDD is a 3-tuple (id, x, y) where id is the index of a point and x-y is the coordinate of that point. Given a RDD storing a sequence of 3-tuple, how can I evaluate the path cost of this sequence? For example, the sequence (1, 0, 0), (2, 0, 1), (3, 1, 1) will give the cost 1 + 1 = 2 (from the first point to the second point and then to the third point). It seems in order to do this I have to know how exactly the Spark partitions the sequence (RDD). Also, how can I evaluate the cost between boundary points of two partitions? Or is there any simple operation for me to do this?
With any parallel processing, you want to put serious thought into what a single data element is, so that only the data that needs to be together is together.
So instead of having every row be a point, it's likely that every row should be the array of points that define a path, at which point calculating the total path length with Spark becomes easy. You'd just use whatever you would normally use to calculate the total length of an array of line segments given the defining points.
But even then it's not clear that we need the full generality of points. For the TSP, a candidate solution is a path that includes all locations, which means that we don't need to store the locations of the cities for every solution, or calculate the distances every time. We just need to calculate one matrix of distances, which we can then broadcast so every Spark worker has access to it, and then lookup the distances instead of calculating them.
(It's actually a permutation of location ids, rather than just a list of them, which can simplify things even more.)

combining rows/columns from spark data frames by mathematical operation

I have two spark data frames (A and B) with respective sizes a x m and b x m, containing floating point values.
Additionally, each data frame has a column 'ID', that is a string identifier. A and B have exactly the same set of 'ID's (i.e. contain information about the same group of customers.)
I'd like to combine a column of A with a column of B by some function.
More specifically, I'd like to build a scalar product a column of A with a column of B, with ordering of the columns according to the ID.
Even more specifically I'd like to calculate the correlation between columns of A and B.
Performing this operation on all pairs of columns would be the same as a matrix multiplication: A_transposed x B.
However, for now I'm only interested in correlations of a small subset of pairs.
I have two approaches in mind, but I struggle to implement them. (And don't know whether either is possible or advisable, at all.)
(1) Take the column of interest of each data frame and combines each entry to a key value pair, where the key is the ID. Then something like reduceByKey() on the two columns of key value pairs and subsequent summation.
(2) Take the column of interest of each data frame, sort it by its ID, cast it to an RDD (haven't figure out how to do this) and simply apply
Statistics.corr(rdd1,rdd2) from pyspark.mllib.stat.
Also I wonder: Is it generally computationally preferable to operate on columns rather than rows (since spark data frames are columnar oriented) or does that make no difference?
Starting from spark 1.4 and if all you need is pearson correlation then you could go like this:
cor = dfA.join(dfB, dfA.id == dfB.id, how='inner').select(dfA.value.alias('aval'), dfB.value.alias('bval')).corr('aval', 'bval')

Resources