I have a trained a tf model and I want to apply it to big dataset in hdfs which is about billion of samples. The main point is I need to write the prediction of tf model into hdfs file. However I can't find the relative API in tensorflow about how to save data in hdfs file, only find the api about reading hdfs file
Until now the way I did it is to save the trained tf model into pb file in local and then load the pb file using Java api in spark or Mapreduce code. The problem of both spark or mapreduce is the running speed is very slow and failed with exceeds memory error.
Here is my demo:
public class TF_model implements Serializable{
public Session session;
public TF_model(String model_path){
try{
Graph graph = new Graph();
InputStream stream = this.getClass().getClassLoader().getResourceAsStream(model_path);
byte[] graphBytes = IOUtils.toByteArray(stream);
graph.importGraphDef(graphBytes);
this.session = new Session(graph);
}
catch (Exception e){
System.out.println("failed to load tensorflow model");
}
}
// this is the function to predict a sample in hdfs
public int[][] predict(int[] token_id_array){
Tensor z = session.runner()
.feed("words_ids_placeholder", Tensor.create(new int[][]{token_id_array}))
.fetch("softmax_prediction").run().get(0);
double[][][] softmax_prediction = new double[1][token_id_array.length][2];
z.copyTo(softmax_prediction);
return softmax_prediction[0];
}}
below is my spark code:
val rdd = spark.sparkContext.textFile(file_path)
val predct_result= rdd.mapPartitions(pa=>{
val tf_model = new TF_model("model.pb")
pa.map(line=>{
val transformed = transform(line) // omitted the transform code
val rs = tf_model .predict(transformed)
rs
})
})
I also tried tensorflow deployed in hadoop, but can't find a way to write big dataset into HDFS.
You may read model file from hdfs one time, then use sc.broadcast your bytes array of your graph to partitions. Finally, start load graph and predict. Just to avoid read file multiple time from hdfs.
Related
How to do parallel model training per partition in spark using scala?
The solution given here is in Pyspark. I'm looking for solution in scala.
How can you efficiently build one ML model per partition in Spark with foreachPartition?
Get the distinct partitions using partition col
Create a threadpool of say 100 threads
create future object for each threads and run
sample code may be as follows-
// Get an ExecutorService
val threadPoolExecutorService = getExecutionContext("name", 100)
// check https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/HasParallelism.scala#L50
val uniquePartitionValues: List[String] = ...//getDistingPartitionsUsingPartitionCol
// Asynchronous invocation to training. The result will be collected from the futures.
val uniquePartitionValuesFutures = uniquePartitionValues.map(partitionValue => {
Future[Double] {
try {
// get dataframe where partitionCol=partitionValue
val partitionDF = mainDF.where(s"partitionCol=$partitionValue")
// do preprocessing and training using any algo with an input partitionDF and return accuracy
} catch {
....
}(threadPoolExecutorService)
})
// Wait for metrics to be calculated
val foldMetrics = uniquePartitionValuesFutures.map(Await.result(_, Duration.Inf))
println(s"output::${foldMetrics.mkString(" ### ")}")
I wrote this code in Spark ML
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.Pipeline
val lr = new LogisticRegression()
val pipeline = new Pipeline()
.setStages(Array(fooIndexer, fooHotEncoder, assembler, lr))
val model = pipeline.fit(training)
This code takes a long time to run. Is it possible that after running pipeline.fit I save the model on HDFS so that I don't have to run it again and again?
Edit: Also, how to load it back from HDFS when I have to apply transform on the model so that I can make predictions.
Straight from the official documentation - saving:
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
and loading:
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
Related:
Save ML model for future usage
Can I create a model in spark batch and use it on Spark streaming for real-time processing?
I have seen the various examples on Apache Spark site where both training and prediction are built on the same type of processing (linear regression).
Can I create a model in spark batch and use it on Spark streaming for real-time processing?
Ofcourse, yes. In spark community they call it offline training online predictions. Many training algorithms in spark allow you to save the model on file system HDFS/S3. Same model can be loaded by a streaming application. You simply call predict method of the model to do predictions.
See the section Streaming + MLLib in this link.
For example, if you want to train a DecisionTree offline and do predictions online...
In batch application -
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,impurity, maxDepth, maxBins)
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
In streaming application -
val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel.predict(newData)
here is one more solution which I just implemented.
I created a model in spark-Batch.
suppose the final model object name is regmodel.
final LinearRegressionModel regmodel =algorithm.run(JavaRDD.toRDD(parsedData));
and spark context name is sc as
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Now in a same code I am creating a spark streaming using the same sc
final JavaStreamingContext jssc = new JavaStreamingContext(sc,new Duration(Integer.parseInt(conf.getWindow().trim())));
and doing prediction like this:
JavaPairDStream<Double, Double> predictvalue = dist1.mapToPair(new PairFunction<LabeledPoint, Double,Double>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<Double, Double> call(LabeledPoint v1) throws Exception {
Double p = v1.label();
Double q = regmodel.predict(v1.features());
return new Tuple2<Double, Double>(p,q);
}
});
With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.
It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.
Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?
Am I missing something?
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.
Here is my solution to use spark models outside of spark context (using PMML):
You create model with a pipeline like this:
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result = evaluator.evaluate(args);
Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.
Option 1
As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.
Option 2
If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
// create dataframe from file, or make it up from some data in memory
// use model.transform() to get predictions
But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.
Option 3
MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:
// Schema and hash size must stay consistent across training and prediction
val hasher = new FeatureHasherLite(mySchema, myHashSize)
// create sample data-point and hash it
val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
val featureVector = hasher.hash(feature)
// Make prediction
val prediction = model.predict(featureVector)
You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.
I have produced an IDFModel with PySpark and ipython notebook as follows:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:
Save Apache Spark mllib model in python
But when I tried the suggestion in the answer
idf_train.save(sc, "/home/ubuntu/newfolder")
I get the error code
AttributeError: 'IDFModel' object has no attribute 'save'
Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!
I did something like that in Scala/Java. It seems to work, but might be not very efficient. The idea is to write a file as a serialized object and read it back later. Good Luck! :)
try {
val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized");
val out:ObjectOutputStream = new ObjectOutputStream(fileOut);
out.writeObject(idf);
out.close();
fileOut.close();
System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
case foe:FileNotFoundException => foe.printStackTrace()
case ioe:IOException => ioe.printStackTrace()
}