Dataset sorting and add data in machine learning - azure

Got a dataset of ID (auto increment) and a random number sequence.
ID,Rnumber
1,500
2,799
3,683
4,237
5,974
6,654
7,778
8,423
9,389
And im trying to create a rank from highest to smallest value and categorize sequences of the dataset based on the rank in groups.
Example: Rank 1-150 is placed in groupe 1
Rank 151-300 is placed in groupe 2 and so on.
What is the easiest solution to this using azure machine learning?
I realize that this may be easy, but since my knowledge on this subject is limited to a general knowledge of function and usage its still alot of possibilities and functions to explore machine learning.
So since i got this specific question im asking it here to get a boosted start!
Any help appriciated!

Related

Assessing features to labelencode or get_dummies() on dataset in Python

I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score

Finding powerlines in LIDAR point clouds with RANSAC

I'm trying to find powerlines in LIDAR points clouds with skimage.measures ransac() function. This is my very first time meddling with these modules in python so bear with me.
So far all I knew how to do reliably was filtering low or 'ground' points from the cloud to reduce the number of points to deal with.
def filter_Z(las, threshold):
filtered = laspy.create(point_format = las.header.point_format, file_version = las.header.version)
filtered.points = las.points[las.Z > las.Z.min() + threshold]
print(f'original size: {len(las.points)}')
print(f'filtered size: {len(filtered.points)}')
filtered.write('filtered_points2.las')
return filtered
The threshold is something I put in by hand since in the las files I worked with are some nasty outliers that prevent me from dynamically calculating it.
The filtered point cloud, or one of them atleast looks like this:
Note the evil red outliers on top, maybe they're birds or something. Along with them are trees and roofs of buildings. If anyone wants to take a look at the .las files, let me know. I can't put a wetransfer link in the body of the question.
A top down view:
I've looked into it as much as I could, and found the skimage.measure module and the ransac function that comes with it. I played around a bit to get a feel for it and currently I'm stumped on how to continue.
def ransac_linefit_sklearn(points):
model_robust, inliers = ransac(points, LineModelND, min_samples=2, residual_threshold=1000, max_trials=1000)
return model_robust, inliers
The result is quite predictable (I ran ransac on a 2D view of the cloud just to make it a bit easier on the pc)
Using this doesn't really yield any good results in examples like the one I posted. The vegetation clusters have too many points and the line is fitted through it because it has the highest point density.
I tried DBSCAN() to cluster up the points but it didn't work. I also attempted OPTICS() but as I write it still hasn't finished running.
From what I've read on various articles, the best course of action would be to cluster up the points and perform RANSAC on each individual cluster to find lines, but I'm not really sure on how to do that or what clustering method to use in situations like these.
One thing I'm also curious about doing is just filtering out the big blobs of trees that mess with model fititng.
Inadequacy of RANSAC
RANSAC works best whenever your data fits a mono-modal distribution around your model. In the case of this point cloud, it works best whenever there is only one line with outliers, but there are at least 5 lines when viewed birds-eye. Check out this older SO post that discusses your problem. Francesco's response suggests an iterative RANSAC based approach.
Octrees and SVD
Colleagues worked on a similar problem in my previous job. I am not fluent in the approach, but I know enough to provide some hints.
Their approach resembled Francesco's suggestion. They partitioned the point-cloud into octrees and calculated the singular value decomposition (SVD) within each partition. The three resulting singular values will correspond to the geometric distribution of the data.
If the first singular value is significantly greater than the other two, then the points are line-like.
If the first and second singular values are significantly greater than the other, then the points are plane-like
If all three values are of similar magnitude, then the data is just a "glob" of points.
They used these rules iteratively to rule out which points were most likely NOT part of the lines.
Literature
If you want to look into published methods, maybe this paper is a good starting point. Power lines are modeled as hyperbolic functions.

Handling optional data in Logistic regression

I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.

Number of training samples for text classification tas

Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.
This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.

Distance dependent Chinese Restaurant Process maybe

I'm new to machine learning and want to implement the distance dependent Chinese Restaurant process in MATLAB for the clustering of audio tracks.
I'm looking to use the dd-CRP on 26 features. I'm guessing the process might go like this
Read in 1st feature vector and assign it a "table"
Read in 2nd feature vector and compare it to the 1st "table", maybe using the cosine angle(due to high dimension) of the two vectors and if it agrees within some defined theta, join that table, else start a new one.
Read in next feature and repeat step 2 for the new feature vector for each existing table.
While this is occurring, I will be keeping track of how many tables there are.
I will be running the algorithm over say for example 16 audio tracks. The way the audio will be fed into the algorithm is the first feature vector will be from say the first frame from audio track 1, the second feature vector from form the first frame in track 2 etc. as I'm trying to find out which audio tracks like to cluster together most, but I don't want to define how many centroids there are. Obviously I'll have to keep track of which audio track is at which "table".
Does this make sense?
This is not a Chinese Restaurant Process. This is a heuristic algorithm which has some similarity to a Chinese Restaurant Process. In a CRP everything is phrased in terms of priors over the assignments of items to clusters (the tables analogy), and these are combined with a likelihood function for each cluster (which formalises the similarity function you described). Inference is then done by Gibbs Sampling, which means non-deterministically sampling which cluster each track is assigned to in turn given all the other assignments. Variational methods for non-parametrics are still in a very preliminary state.
Why do you want to use a CRP? Do you think you'll get something out of it beyond more conventional clustering methods? The bar to entry for the implementation and proper understanding of non-parametrics is pretty high, and they're often of little practical use at the moment because of the constraints on inference I mentioned.
You can use the X-means algorithm, which automatically determines the optimal number of centroids (and hence number of clusters) based on the Bayesian Information Criterion (or BIC). In short, the algorithm looks for how dense each cluster is, and how far is each cluster from the other.

Resources