Accord.net decision tree breaks when the Decide function is used with data not found in the training set - decision-tree

I used separate training and testing data sets to test the decision tree induced in Accord.net. But in my training data set I found out that there had been a record which had one field value which was not found in the training data set. So after creating the tree from training data, I used the "Decide" method of the tree to see the output for the record with the new value found in the testing dataset during the runtime. But the tree breaks with the following message.
"The tree is degenerated. This is often a sign that the tree is expecting discrete inputs, but it was given only real values".
Furthermore I saw in codification the integers are assigned to the distinct values in the input data. But according to what I have said above, the testing data has a distinct data value that come in between other data for the relevant field. So until that data element is met the same integers are assigned to data in both training and testing data sets when tried to codify separately. But after the newly found data element gets an integer assigned then for the similar data elements in the testing and training data gets different values thereafter. Can someone tell me how to solve these two issues?
For the clarity of the second issue I have given some sample data below. The testing data for the same column (in this case it's Qualification) contains "Diploma" as the new value not found in the training data set.
Training data for column "Qualification": High-School, Bachelor-Degree, Masters, Doctorate
Testing data for column "Qualification": High-School, Bachelor-Degree, Diploma ,Masters, Doctorate

Related

Dependent data validation list's with only two raw data columns

I have two columns whose data I want to input in a data validation list upon the selection of other validation list. And to do it, it's not suitable to have other auxiliar columns/tables other than the two mentioned ones. How would you approach it (see video illustration here)?
Thank in advance.πŸ˜€

Compare the data of excel sheet with related data in neo4j

We migrated the data from RDBMS TO NEO4J in excel format.
We have around 100000 records in an excel sheet, Using that excel sheet we created data in neo4j. Now, we want to compare the data of excel with the nodes data in neo4j.
There are some fields in excel sheet that are duplicate means that particular field value is used multiple time but in neo4j its just 1 time.(Using merge)
Is there any way to compare and verify that large amount of data.
So , that we can know we haven't lost any data.
I don't think there is a concrete general answer (like a tool that will do that for you), since it will have to depend on knowledge of the spreadsheet data structure, the neo4j data model, and how you want the two to correspond.
But perhaps you can do a sanity check by extracting from the spreadsheet information about:
how many nodes (of each label) you expect
how many relationships (of each type) you expect
and comparing that with what the neo4j DB actually contains. If the numbers match exactly, then you can have some confidence that your data is intact. You can also spot-check a few nodes to see if they have the expected relationships, and check if those nodes and relationships have the right data.
To get the number of nodes of each label, and relationship of each type, you can use the APOC procedure apoc.meta.stats:
CALL apoc.meta.stats() YIELD labels, relTypesCount
RETURN *
Here is a sample result:
╒══════════════════════════════════════╀══════════════════════════════════════╕
β”‚"labels" β”‚"relTypesCount" β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════════════════════║
β”‚{"Movie":76,"Class":2,"Partner":1,"Conβ”‚{"ACTED_IN":344,"REVIEWED":18,"WROTE":β”‚
β”‚tract":2,"Person":275,"Claim":2} β”‚20,"PRODUCED":30,"CLAIMANT":2,"FOLLOWSβ”‚
β”‚ β”‚":6,"DIRECTED":88,"POLICY_HOLER":2} β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
I only chose to YIELD 2 of the results from that procedure; you may want to look at the others to see what additional checks you may want to do.

How do I handle categorical data where there are different numbers of categories for a data point in training and testing?

I am working on the following Kaggle project: https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
My question is, what if there is an option for a categorical value that is in the test data, but is not in the training data, or vice versa. For example, if data point A has options [a,b], in the training data, but options [a,b,c] in the testing data, or vice versa. Thanks for your help!
I just want to be able to train and run my neural network properly.
Are you OneHotEncoding (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) your categorical data? You could avoid this problem by OneHotEncoding the categorical columns before splitting. This will mean that the model will be trained to consider this column. A drawback of this approach would that the training would only ever see '0' values for the column that represents the value which isn't seen in the training set. Maybe not the best option but could solve the issue you are seeing?
Is the issue caused by the fact that you have a very small dataset, or that you have a column that has lots of unique values?

Mixed data type tensor-flow based random forest regression

As the topic suggests I would like to create tensor-flow based random forest regression, using python for our data set which contains the following columns:
HotelName(text& categorical),Country(text categorical). Review(text..?), date( continous or categorical not sure...) and some continous valued columns.
My questions are:
What should be the exact categories of the data types we mentioned above, and is any mapping/discretization of the features necessary( for example, if there are 10 countries, we map them to integers 1-10)
How do we implement the random forest tensorflow model? I searched on the internet but only found the iris data set random forest example ( which has only continous data). In the estimator api, one can specify the type of value of each column, but that doesnt work with tensor_forest right? How should I do the implementation?
Thanks and wishing everyone a happy new year!

Azure machine learning predict order for customer

I have created a new experiment in Azure Machine Learning and added two datasets by manually uploading csv's.
One is from a customer of which I'd like to predict which products he will order next.
The second dataset has the same type of data, only then from all other customers as reference for learning.
I have productid, amount, and orderdate and orderid for grouping and putting it on a timeframe.
The customer (dataset one) is always several months behind with ordering the latest products. therefor I added the dataset two with all other customers as reference.
Also because the reference can tell which products are more popular (ordered more and by several customers) so perhaps I should add a customerid column to the dataset.
I know how to start and get the data in, and I do know that it is common to split the data for training, feed it to the train model with a Ilearnerdotnet type and give the output to the score model and evaluate the model.
I do not know how to choose a classification type and how this can give an output for the next three months of order. I have read some tutorials, but I just need someone who can give me some pointers.
edit I have added the customerid to the dataset so that I have just one set now which I should split to focus on a specific customer.
edit2 found these templates. will look into it https://stackoverflow.com/a/36552849/169714
Go over this http://download.microsoft.com/download/0/5/A/05AE6B94-E688-403E-90A5-6035DBE9EEC5/machine-learning-basics-infographic-with-algorithm-examples.pdf
If above infographic doesn't help, then you can try all of the learners by going over this experiment and use the one with best results - https://gallery.cortanaintelligence.com/Experiment/Algo-Evaluater-Compare-Performance-of-Multiple-Algos-against-Your-Data-1

Resources