understanding spark data for MLlib data [duplicate] - apache-spark

This question already has answers here:
How to understand the format type of libsvm of Spark MLlib?
(1 answer)
How can I read LIBSVM models (saved using LIBSVM) into PySpark?
(1 answer)
Closed 4 years ago.
I am reading Binary classification used in SparkML data. I read the JavaCode of Spark, I am also aware of Binary classification but I am not able to understand, how these data are generated. for example https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
this link is sample for binary_classifcation if I want to generate these type of data, how to do that?

Usually, the first column is the class label (in this case 0 / 1), the others columns are the values of the features.
To generate the data yourself, you can use a random generator, for instance.
But it is depend on the problem you are working on.
If you need to download datasets to apply classification algorithms you can use repositories, such as: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php

Related

DataFrame and Dataset [duplicate]

This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 4 years ago.
In spark,there always be operation like this:
hiveContext.sql("select * from demoTable").show()
When I look up the show() method in Spark Official API,the result is like this:
enter image description here
And when I change the key word to 'Dataset',I Find that the method used on DataFrame belongs to Dataset. How does it happen? Is there any implication?
According to the documentation:
A Dataset is a distributed collection of data.
And:
A DataFrame is a Dataset organized into named columns.
So, technically:
DataFrame is equivalent to Dataset<Row>
And one last quote:
In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset to represent a DataFrame.
In short, a the concrete type is Dataset.

Feature importance of each variable in GBT Classifier Pyspark [duplicate]

This question already has answers here:
PySpark & MLLib: Random Forest Feature Importances
(5 answers)
Closed 5 years ago.
How do I get the corresponding feature importance of every variable in a GBT Classifier model in pyspark
From spark 2.0+ (here) You have the attribute:
model.featureImportances
This will give a sparse vector of feature importance for each column/ attribute

Convert Text Data into SVMFile format for Spam Classification?

How can I convert Text Data into LibSVM file format for training the model for spam classification.
Are SVMFiles already Labeled ?
SVM format is neither required or that useful. It is used in Apache Spark ML example, only because it can be map directly to the required format.
Are SVMFiles already Labeled ?
Not necessarily, but Spark can read only labeled variant.
In practice you should use org.apache.spark.ml.feature tools to extract relevant features from your data.
You can follow the documentation as well as a number of questions on SO.,

How can I operationalize a SparkML model as a real-time webservice? [duplicate]

This question already has answers here:
How to serve a Spark MLlib model?
(4 answers)
Closed 5 years ago.
Once a SparkML model has been trained on a Spark cluster, how can I take the trained model and make it available for scoring through a restful API?
The problem is that it requires a SparkContext in order to be loaded, but is there a way to 'fake it' since it does not seem really necessary, or what is the minimum required to create a SparkContext?
In some cases - yes, it can.
Many models in Spark can be exported to JPMML, standarized format for ML models. Then you can use it with other Java library like https://github.com/jpmml/jpmml-sparkml
How to export you can read in this question - Spark ml and PMML export.
You can also use Spark Streaming to calculate values, however it will have higher latency until Continous Processing Mode being available
For very time-consuming calculations, such as recommendation algorithms, it's I think quite normal to pre-calculate values and save in database like Cassandra

Not getting proper result in Model Training while using Azure Machine Learning Studio with Two Class Bayes Point Machine Algorithm

We are using Azure Machine Learning Studio for building Trained Model and for that we have used Two Class Bayes Point Machine Algorithm.
For sample data , we have imported .CSV file that contains columns such as: Tweets and Label.
After deploying the web service, we got improper output.
We want our algorithm to predict the result of Label as 0 or 1 on the basis of different types tweets, that are already stored in the dataset.
While testing it with the tweets that are there in the dataset, it gives proper result, but the problem occurs while testing it with other tweets(that are not there in the dataset).
You can view our experiment over here:
Experiment
Are you planning to do a binary classification based on the textual data on tweets? If so you should try doing feature hashing before doing the classification.

Resources