What would be the best ML method for this use case? - azure

Within AzureML, I have a CSV file which contains 2 columns of data with thousands of rows. I'm looking to run this entire file as training, and find a pattern between these 2 sets of numbers, for example:
x -> y
... 10k x
And after all that training, I'd want to give this one line as the score model, so It'd look like:
x -> ? (Predict answer from training)
-- Note, the question mark here wouldn't need to be an exact match, as long as it is somewhat around what that actual number would turn out to be like.
Is their a ML method (Inside Azure ML) that does such thing? Any points would be great.
tl;dr: Finding any type of pattern between 2 numbers (w/ intense training).

Read about linear regression. This is answer for your question. And here is the link to Azure ML tutorial link

Related

Finding powerlines in LIDAR point clouds with RANSAC

I'm trying to find powerlines in LIDAR points clouds with skimage.measures ransac() function. This is my very first time meddling with these modules in python so bear with me.
So far all I knew how to do reliably was filtering low or 'ground' points from the cloud to reduce the number of points to deal with.
def filter_Z(las, threshold):
filtered = laspy.create(point_format = las.header.point_format, file_version = las.header.version)
filtered.points = las.points[las.Z > las.Z.min() + threshold]
print(f'original size: {len(las.points)}')
print(f'filtered size: {len(filtered.points)}')
filtered.write('filtered_points2.las')
return filtered
The threshold is something I put in by hand since in the las files I worked with are some nasty outliers that prevent me from dynamically calculating it.
The filtered point cloud, or one of them atleast looks like this:
Note the evil red outliers on top, maybe they're birds or something. Along with them are trees and roofs of buildings. If anyone wants to take a look at the .las files, let me know. I can't put a wetransfer link in the body of the question.
A top down view:
I've looked into it as much as I could, and found the skimage.measure module and the ransac function that comes with it. I played around a bit to get a feel for it and currently I'm stumped on how to continue.
def ransac_linefit_sklearn(points):
model_robust, inliers = ransac(points, LineModelND, min_samples=2, residual_threshold=1000, max_trials=1000)
return model_robust, inliers
The result is quite predictable (I ran ransac on a 2D view of the cloud just to make it a bit easier on the pc)
Using this doesn't really yield any good results in examples like the one I posted. The vegetation clusters have too many points and the line is fitted through it because it has the highest point density.
I tried DBSCAN() to cluster up the points but it didn't work. I also attempted OPTICS() but as I write it still hasn't finished running.
From what I've read on various articles, the best course of action would be to cluster up the points and perform RANSAC on each individual cluster to find lines, but I'm not really sure on how to do that or what clustering method to use in situations like these.
One thing I'm also curious about doing is just filtering out the big blobs of trees that mess with model fititng.
Inadequacy of RANSAC
RANSAC works best whenever your data fits a mono-modal distribution around your model. In the case of this point cloud, it works best whenever there is only one line with outliers, but there are at least 5 lines when viewed birds-eye. Check out this older SO post that discusses your problem. Francesco's response suggests an iterative RANSAC based approach.
Octrees and SVD
Colleagues worked on a similar problem in my previous job. I am not fluent in the approach, but I know enough to provide some hints.
Their approach resembled Francesco's suggestion. They partitioned the point-cloud into octrees and calculated the singular value decomposition (SVD) within each partition. The three resulting singular values will correspond to the geometric distribution of the data.
If the first singular value is significantly greater than the other two, then the points are line-like.
If the first and second singular values are significantly greater than the other, then the points are plane-like
If all three values are of similar magnitude, then the data is just a "glob" of points.
They used these rules iteratively to rule out which points were most likely NOT part of the lines.
Literature
If you want to look into published methods, maybe this paper is a good starting point. Power lines are modeled as hyperbolic functions.

How to get three dimensional vector embedding for a list of words

I have been asked to create three dimensional vector embeddings for a series of words. Although I understand what an embedding is and that word2vec will be able to create the vector embeddings, I cannot find a resource that shows me how to create a three dimensional vector (all the resources show many more dimensions than this).
The format I have to create the file in is:
house 34444 0.3232 0.123213 1.231231
dog 14444 0.76762 0.76767 1.45454
which is in the format <token>\t<word_count>\t<vector_embedding_separated_by_spaces>
Can anyone point me towards a resource that will show me how to create the desired file format given some training text?
Once you've decided on a programming language, and word2vec library, its documentation will likely highlight a configurable parameter that lets you specify the dimensionality of the vectors it trains. So, you just need to change that parameter from its typical values , like 100 or 300, to 3.
(Note, though, that 3-dimensional word-vectors are unlikely to show the interesting & useful property of higher-dimensional vectors.)
Once you've used such a library to create the vectors-in-memory, writing them out in your specified format becomes just a file-IO problem, unrelated to word2vec itself. In typical languages, you'd open a new file for writing, loop over your data printing each line properly, then close the file.
(To get a more detailed answer from StackOverflow, you'd want to pick a specific language/library, show what you've already tried with actual code, and show how the results/errors achieved fall short of your goal.)

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...

How to find a regression line for a closed set of data with 4 parameters in matlab or excel?

I have a set of data I have acquired from simulations. There are 3 parameters that go into my simulations and I get one result out.
I can graph the data from the small subset i have and see the trends for each input, but I need to be able to extrapolate this and get some form of a regression equation seeing as the simulation takes a long time.
In matlab or excel, is it possible to list the inputs and outputs to obtain a 4 parameter regression line for a given set of information?
Before this gets flagged as a duplicate, i understand polyfit will give me an equation of best fit and will be as accurate as i want it, but i need the equation to correspond to the inputs, not just a regression line.
In other words if i 20 simulations of inputs a, b, c and output y, is there a way to obtain a "best fit":
y=B0+B1*a+B2*b+B3*c
using the data?
My usual recommendation for higher-dimensional curve fitting is to pose the problem as a minimization problem (that may be unneeded here with the nice linear model you've proposed, but I'm a hammer-nail guy sometimes).
It starts by creating a correlation function (the functional form you think maps your inputs to the output) given a vector of fit parameters p and input data xData:
correl = #(p,xData) p(1) + p(2)*xData(:,1) + p(3)*xData(:2) + p(4)*xData(:,3)
Then you need to define a function to minimize given the parameter vector, which I call the objective; this is typically your correlation minus you output data.
The details of this function are determined from the solver you'll use (see below).
All of the method need a starting vector pGuess, which is dependent on the trends you see.
For nonlinear correlation function, finding a good pGuess can be a trial but necessary for a good solution.
fminsearch
To use fminsearch, the data must be collapsed to a scalar value using some norm (2 here):
x = [a,b,c]; % your input data as columns of x
objective = #(p) norm(correl(p,x) - y,2);
p = fminsearch(objective,pGuess); % you need to define a good pGuess
lsqnonlin
To use lsqnonlin (which solves the same problem as above in different ways), the norm-ing of the objective is not needed:
objective = #(p) correl(p,x) - y ;
p = lsqnonlin(objective,pGuess); % you need to define a good pGuess
(You can also specify lower and upper bounds on the parameter solution, which is nice.)
lsqcurvefit
To use lsqcurvefit (which is simply a wrapper for lsqnonlin), only the correlation function is needed along with the data:
p = lsqcurvefit(correl,pGuess,x,y); % you need to define a good pGuess

SVD and singular / non-singular matrices

I need to use the SVD form of a matrix to extract concepts from a series of documents. My matrix is of the form A = [d1, d2, d3 ... dN] where di is a binary vector of M components. Then the svd decomposition gives me svd(A) = U x S x V' with S containing the singular values.
I use SVDLIBC to do the processing in nodejs (using a small module I wrote to use it). It seemed to work all well, but I noticed something quite weird in the running time behavior depending on the state of my matrix (where N, M are growing, but already above 1000 for each).
First, I didn't consider extracting the same document vectors, but now after some tests, it looks like adding a document twice sometimes speeds the processing extraordinarily.
Do I have to make sure that each of the columns of A are pairwise-independent? Are they required to be all linearly independent? (I thought nope, since SVD just seems to be performing its job well even with some columns being exactly the same, it will simply show in the resulting decomposition which columns / rows are useless by having 0 components in U or V)
Now that it sometimes takes way too much time to compute the SVD of my big matrix, I was trying to reduce its size by removing the same columns, but I found out that actually adding dummy same vectors can make it way faster. Is that normal? What's happening?
Logically, I'd say that I want my matrix to contain as much information as possible, and thus
[A] Remove all same columns, and in the best case, maybe
[B] Remove linearly dependent columns.
Doing [A] seems pretty simple and not computationally too expensive, I could hash my vectors at construction to check what are the possibly same vectors, and then spend time to check these, but are there good computation techniques for [A] and [B]?
(I'd appreciate for [A] to not have to check equality of a new vector with the whole past vectors the brute-force way, and as for [B], I don't know any good way to check it / do it).
Added related question: about my second question, why would SVD's running time behavior change so massively by just adding one similar column? Is that a normal possible behavior, or does it mean I should look for a bug in SVDLIBC?
It is difficult to say where the problem is without samples of fast and slow input matrices. But, since one of the primary uses of the SVD is to provide a rotation that eliminates covariance, redundant (or the same) columns should not cause problems.
To answer your question about if the slow behavior being a bug in the library you're using, I'd suggest trying to retrieve the SVD of the same matrix using another tool. For example, in Octave, retrieve an SVD of your matrix to compare runtimes:
[U, S, V] = svd(A)

Resources