How to save IDFmodel with PySpark - apache-spark

I have produced an IDFModel with PySpark and ipython notebook as follows:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:
Save Apache Spark mllib model in python
But when I tried the suggestion in the answer
idf_train.save(sc, "/home/ubuntu/newfolder")
I get the error code
AttributeError: 'IDFModel' object has no attribute 'save'
Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!

I did something like that in Scala/Java. It seems to work, but might be not very efficient. The idea is to write a file as a serialized object and read it back later. Good Luck! :)
try {
val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized");
val out:ObjectOutputStream = new ObjectOutputStream(fileOut);
out.writeObject(idf);
out.close();
fileOut.close();
System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
case foe:FileNotFoundException => foe.printStackTrace()
case ioe:IOException => ioe.printStackTrace()
}

Related

Run LDA algorithm on Spark 2.0

I use spark 2.0.0 and I'd like to train a LDA model to Tweets dataset, when I try to execute
val ldaModel = new LDA().setK(3).run(corpus)
I get this error
error: reference to LDA is ambiguous;
it is imported twice in the same scope by import org.apache.spark.ml.clustering.LDA and import org.apache.spark.mllib.clustering.LDA
Could someone please help me ?
Thanks !
It looks like you have both of the following import statements:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.mllib.clustering.LDA
You would need to remove one of them.
If you are using Spark ML (data frame based API), the proper syntax would be:
import org.apache.spark.ml.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.fit(corpus)
if you are using RDD-based API then you would have to write:
import org.apache.spark.mllib.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.run(corpus)

How to use large volume of data in Spark

I'm working with spark in python, trying to map PDF files to some custom parsing. Currently I'm loading the PDFS with pdfs = sparkContext.binaryFiles("some_path/*.pdf").
I set the RDD to be cachable on disk with pdfs.persist( pyspark.StorageLevel.MEMORY_AND_DISK ).
I then try to map the parsing operation. And then to save a pickle But it fails with a out-of-memory error in the heap. Could you help me please?
Here is the simplified code of what I do:
from pyspark import SparkConf, SparkContext
import pyspark
#There is some code here that set a args object with argparse.
#But it's not very interesting and a bit long, so I skip it.
def extractArticles( tupleData ):
url, bytesData = tupleData
#Convert the bytesData into `content`, a list of dict
return content
sc = SparkContext("local[*]","Legilux PDF Analyser")
inMemoryPDFs = sc.binaryFiles( args.filePattern )
inMemoryPDFs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData = inMemoryPDFs.flatMap( extractArticles )
pdfData.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData.saveAsPickleFile( args.output )

How to use RandomForest in Spark Pipeline

I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice?
Thanks
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and pipeline.fit raises IllegalArgumentExcetion:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.

how do I preserve the key or index of input to Spark HashingTF() function?

Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:
documents = sc.textFile("...").map(lambda line: line.split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
I would like to do something like this:
documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
and have the resulting tf variable contain the UNIQUE_LINE_KEY value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document RDD with the tf RDD.
I also encountered the same issue. In the example from the docs they encourage you to apply the transformations directly on the RDD.
However, you can apply the transformations on the vectors themselves and this way you can keep the keys whichever way you choose.
val input = sc.textFile("...")
val documents = input.map(doc => doc -> doc.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = documents.mapValues(hashingTF.transform(_))
tf.cache()
val idf = new IDF().fit(tf.values)
val tfidf = tf.mapValues(idf.transform(_))
Note that this code will yield RDD[(String, Vector)] instead of RDD[Vector]
If you use a version of Spark from after commit 85b96372cf0fd055f89fc639f45c1f2cb02a378f (this includes the 1.4), and use the ml API HashingTF (requires DataFrame input instead of plain RDDs), the original columns in its output. Hope that helps!

Need some inputs in feature extraction in Apache Spark

I am new to Apache Spark and we are trying to use the MLIB utility to do some analysis. I collated some code to convert my data into features and then apply a linear regression algorithm to that. I am facing some issues . Please help and excuse if its a silly question
My person data looks like
1,1000.00,36
2,2000.00,35
3,2345.50,37
4,3323.00,45
Just a simple example to get the code working
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Person(rating: String, income: Double, age: Int)
val persondata = sc.textFile("D:/spark/mydata/persondata.txt").map(_.split(",")).map(p => Person(p(0), p(1).toDouble, p(2).toInt))
def prepareFeatures(people: Seq[Person]): Seq[org.apache.spark.mllib.linalg.Vector] = {
val maxIncome = people.map(_ income) max
val maxAge = people.map(_ age) max
people.map (p =>
Vectors.dense(
if (p.rating == "A") 0.7 else if (p.rating == "B") 0.5 else 0.3,
p.income / maxIncome,
p.age.toDouble / maxAge))
}
def prepareFeaturesWithLabels(features: Seq[org.apache.spark.mllib.linalg.Vector]): Seq[LabeledPoint] =
(0d to 1 by (1d / features.length)) zip(features) map(l => LabeledPoint(l._1, l._2))
---Its working till here.
---It breaks in the below code
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people))
scala> val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
<console>:36: error: not found: value people
Error occurred in an application involving default arguments.
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
^
Please advise
You seem to be going in roughly the right direction but there are a few minor problems. First off you are trying to reference a value (people) that you haven't defined. More generally you seem to be writing your code to work with sequences, and instead you should modify your code to work with RDDs (or DataFrames). Also you seem to be using parallelize to try and parallelize your operation, but parallelize is a helper method to take a local collection and make it available as a distributed RDD. I'd probably recommend looking at the programming guides or some additional documentation to get a better understanding of the Spark APIs. Best of luck with your adventures with Spark.

Resources