How to convert ML model to MLlib model? - apache-spark

I had trained Machine Learning models using Spark RDD based API (mllib package) 1.5.2 say "Mymodel123",
org.apache.spark.mllib.tree.model.RandomForestModel Mymodel123 = ....;
Mymodel123.save("sparkcontext","path");
Now I'm using Spark Dataset based API (ml package) 2.2.0 . Is there any way to load the models (Mymodel123) using Dataset based API?
org.apache.spark.ml.classification.RandomForestClassificationModel newModel = org.apache.spark.ml.classification.RandomForestClassificationModel.load("sparkcontext","path");

There is no public API that can do that, however you RandomForestModels wrap old mllib API and provide private methods which can be used to convert mllib models to ml models:
/** Convert a model from the old API */
private[ml] def fromOld(
oldModel: OldRandomForestModel,
parent: RandomForestClassifier,
categoricalFeatures: Map[Int, Int],
numClasses: Int,
numFeatures: Int = -1): RandomForestClassificationModel = {
require(oldModel.algo == OldAlgo.Classification, "Cannot convert RandomForestModel" +
s" with algo=${oldModel.algo} (old API) to RandomForestClassificationModel (new API).")
val newTrees = oldModel.trees.map { tree =>
// parent for each tree is null since there is no good way to set this.
DecisionTreeClassificationModel.fromOld(tree, null, categoricalFeatures)
}
val uid = if (parent != null) parent.uid else Identifiable.randomUID("rfc")
new RandomForestClassificationModel(uid, newTrees, numFeatures, numClasses)
}
so it is not impossible. In Java you can use it directly (Java doesn't respect package private modifiers), in Scala you'll have to put adapter code in org.apache.spark.ml package.

Related

How to use Groovy and Spock to test Apach Flink Job?

I have a Flink job that reads data from Kafka into a table which is emitted into a DataStream on which I apply a filter function and then convert the data stream back to a table which writes data back to Kafka.
I want to test the functionality of the filter function. I am writing unit tests in Groovy using Spock (framework). In my unit test I am calling the Flink job with the Table SQL string with details about the Kafka topic, however, I am confused on how to load the right StreamExecution and TableEnvironment because when I create a new object of my Flink class, those values are null and I don't have getters/setters to set everything up because that would make the code really messy.
The following is my logic. My question is can I write Apache Flink APIs as seamlessly in Groovy or there are many layers/pitfalls and how can I better approach these tests:
class DataStreamTests extends Specification {
#Autowired
ApplicationConfiguration configuration;
FlinkStreaming streaming = new FlinkStreaming();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
final StreamTableEnvironment tableEnvironment = TableEnvironment.create(settings);
def resourcePath = "cars/porche.txt"
StreamSpec streamSpec = ConfigurationParser.createStreamSpec(getClass().getResource(resourcePath).text)
def "create a new input stream from input table sql"() {
given:
DataStream<String> streamRecords = env.readTextFile("/streaming_signals/pedals.txt")
streaming.setConfiguration(configuration)
streaming.setTableEnvironment(tableEnvironment);
when:
String tableSpec = streaming.createTableSpec(streamSpec);
DataStream<Row> rawStream = streaming.getFilteredStream(streamSpec, tableSpec)
DataStream<String> comapareStreams = rawStream.map(new MapFunction<Row, String>() {
#Override
public String map(Row record) {
// Logic to compare stream received with test stream
}
});
then:
// Comparison Logic
}
}
org.apache.flink.table.api.ValidationException: Could not find any factories that implement 'org.apache.flink.table.delegation.ExecutorFactory' in the classpath.
at org.apache.flink.table.factories.FactoryUtil.discoverFactory(FactoryUtil.java:385)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:295)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:266)
at org.apache.flink.table.api.TableEnvironment.create(TableEnvironment.java:95)
at com.streaming.DataStreamTests.$spock_initializeFields(DataStreamTests.groovy:38

Cassandra List<UDT> not getting deserialized using datastax java driver 4.0.0

I am currently working on a new project and chose Cassandra as our data store.
I have a use case where I store prices for material and to accomplish it I created list of User-Defined Types (UDTs). But unfortunately, while deserialization using datastax driver. After queries for the required data, I found that the list object is null while in the database there is value for it. Is it a current limitation of Cassandra java driver or am I missing something?
This is how my simplified Entity (table) looks like:
#PrimaryKeyColumn(name = "tenant_id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
private long tenantId;
#PrimaryKeyColumn(name = "item_id", ordinal = 1, type = PrimaryKeyType.CLUSTERED)
private String itemId;
#CassandraType(type = DataType.Name.LIST, userTypeName = "volume_scale_1")
private List<VolumeScale> volumeScale1;
}
So I am getting volumeScale1 as null after database select query.
And this is how my UDT looks like:
In Cassandra database:
CREATE TYPE pricingservice.volume_scale (
from_scale int,
to_scale int,
value frozen<price_value>
);
As UDT in java :
#UserDefinedType("volume_scale")
public class VolumeScale
{
#CassandraType(type = DataType.Name.TEXT, userTypeName = "from_scale")
#Column("from_scale")
private String fromScale;
#CassandraType(type = DataType.Name.TEXT, userTypeName = "to_scale")
#Column("to_scale")
private String toScale;
#CassandraType(type = DataType.Name.UDT, userTypeName = "value")
private PriceValue value;
// getter and setter
}
I also tried using Object Mapper from java driver itself as per #Alex suggestion but got stuck at one point where creating an object using ItemPriceByMaterialMapperBuilder is throwing compilation error. Is anything additional required towards annotation processing or am I missing something? do you have any idea how to use Mapper annotation? I used google AutoService also to achieve annotation processing externally but didn't work.
#Mapper
//#AutoService(Processor.class)
public interface ItemPriceByMaterialMapper
// extends Processor
{
static MapperBuilder<ItemPriceByMaterialMapper> builder(CqlSession session) {
return new ItemPriceByMaterialMapperBuilder(session);
}
#DaoFactory
ItemPriceByMaterialDao itemPriceByMaterialDao ();
// #DaoFactory
// ItemPriceByMaterialDao itemPriceByMaterialDao(#DaoKeyspace CqlIdentifier
// keyspace);
}
Version used:
Java Version: 1.8
DataStax OSS java-driver-mapper-processor: 4.5.1
DataStax OSS java-driver-mapper-runtime: 4.5.1
Cassandra: 3.11.4
Spring Boot Framework: 2.2.4.RELEASE
From what I understand, you have multiple problems: if you're using Spring Data Cassandra, then you'll get older driver (3.7.2 for Spring 2.2.6-RELEASE), and it may clash with driver 4.0.0 (it's too old, don't use it) that you're trying to use. Driver 4.x isn't binary compatible with previous drivers, and its support in Spring Data Cassandra could be only in the next major release of Spring.
Instead of Spring Data you can use Object Mapper from java driver itself - it could be more optimized than Spring version.
I decided not to use object mapper and work with Spring Data Cassandra with Spring 2.2.6-RELEASE. Thanks

How can I write a Dataset<Row> into kafka output topic on Spark Structured Streaming - Java8

I am trying to use ForeachWriter interface in Spark 2.1 it's interface, but I cannot use it.
It will be supported in Spark 2.2.0. To learn how to use it, I suggest you read this blog post: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
You can try Spark 2.2.0 RC2 [1] or just wait for the final release.
Another option is taking a look at this blog if you cannot use Spark 2.2.0+:
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
It has a very simple Kafka sink and maybe that's enough for you.
[1] http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html
First thing to know is that, if you working with spark structured Stream and processing streaming data, you'll be having a streaming Dataset.
Being said, the way to write this streaming Dataset is by calling the ForeachWriter, you got it right..
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[Commons.UserEvent] {
override def open(partitionId: Long, version: Long) = true
override def process(value: Commons.UserEvent) = {
processRow(value)
}
override def close(errorOrNull: Throwable) = {}
}
val query =
ds.writeStream.queryName("aggregateStructuredStream").outputMode("complete").foreach(writer).start
And the function that writes into topic will be like:
private def processRow(value: Commons.UserEvent) = {
/*
* Producer.send(topic, data)
*/
}

Read & write data into cassandra using apache flink Java API

I intend to use apache flink for read/write data into cassandra using flink. I was hoping to use flink-connector-cassandra, I don't find good documentation/examples for the connector.
Can you please point me to the right way for read and write data from cassandra using Apache Flink. I see only sink example which are purely for write ? Is apache flink meant for reading data too from cassandra similar to apache spark ?
I had the same question, and this is what I was looking for. I don't know if it is over simplified for what you need, but figured I should show it none the less.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("urlToUse.com").withPort(9042).build();
}
};
CassandraInputFormat<Tuple2<String, String>> cassandraInputFormat = new CassandraInputFormat<>("SELECT * FROM example.cassandraconnectorexample", cb);
cassandraInputFormat.configure(null);
cassandraInputFormat.open(null);
Tuple2<String, String> testOutputTuple = new Tuple2<>();
cassandraInputFormat.nextRecord(testOutputTuple);
System.out.println("column1: " + testOutputTuple.f0);
System.out.println("column2: " + testOutputTuple.f1);
The way I figured this out was thanks to finding the code for the "CassandraInputFormat" class and seeing how it worked (http://www.javatips.net/api/flink-master/flink-connectors/flink-connector-cassandra/src/main/java/org/apache/flink/batch/connectors/cassandra/CassandraInputFormat.java). I honestly expected it to just be a format and not the full class of reading from Cassandra based on the name, and I have a feeling others might be thinking the same thing.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
};
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
InputFormat inputFormat = new CassandraInputFormat<Tuple3<Integer, Integer, Integer>>("SELECT * FROM test.example;", cb);//, TypeInformation.of(Tuple3.class));
DataStreamSource t = env.createInput(inputFormat, TupleTypeInfo.of(new TypeHint<Tuple3<Integer, Integer,Integer>>() {}));
tableEnv.registerDataStream("t1",t);
Table t2 = tableEnv.sql("select * from t1");
t2.printSchema();
You can use RichFlatMapFunction to extend a class
class MongoMapper extends RichFlatMapFunction[JsonNode,JsonNode]{
var userCollection: MongoCollection[Document] = _
override def open(parameters: Configuration): Unit = {
// do something here like opening connection
val client: MongoClient = MongoClient("mongodb://localhost:10000")
userCollection = client.getDatabase("gp_stage").getCollection("users").withReadPreference(ReadPreference.secondaryPreferred())
super.open(parameters)
}
override def flatMap(event: JsonNode, out: Collector[JsonNode]): Unit = {
// Do something here per record and this function can make use of objects initialized via open
userCollection.find(Filters.eq("_id", somevalue)).limit(1).first().subscribe(
(result: Document) => {
// println(result)
},
(t: Throwable) =>{
println(t)
},
()=>{
out.collect(event)
}
)
}
}
}
Basically open function executes once per worker and flatmap executes it per record. The example is for mongo but can be similarly used for cassandra
In your case as I understand the first step of your pipeline is reading data from Cassandra rather than writing a RichFlatMapFunction you should write your own RichSourceFunction
As a reference you can have a look at simple implementation of WikipediaEditsSource.

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous analysis.
For this, I've implemented the following Travis custom receiver:
object TravisUtils {
def createStream(ctx : StreamingContext, storageLevel: StorageLevel) : ReceiverInputDStream[Build] = new TravisInputDStream(ctx, storageLevel)
}
private[streaming]
class TravisInputDStream(ctx : StreamingContext, storageLevel : StorageLevel) extends ReceiverInputDStream[Build](ctx) {
def getReceiver() : Receiver[Build] = new TravisReceiver(storageLevel)
}
private[streaming]
class TravisReceiver(storageLevel: StorageLevel) extends Receiver[Build](storageLevel) with Logging {
def onStart() : Unit = {
new BuildStream().addListener(new BuildListener {
override def onBuildsReceived(numberOfBuilds: Int): Unit = {
}
override def onBuildRepositoryReceived(build: Build): Unit = {
store(build)
}
override def onException(e: Exception): Unit = {
reportError("Exception while streaming travis", e)
}
})
}
def onStop() : Unit = {
}
}
Whereas the receiver uses my custom made TRAVIS API library (developed in Java using Apache Async Client). However, the problem is the following: the data that I should be receiving is continuous and changes i.e. is being pushed to Travis and GitHub constantly. As an example, consider the fact that GitHub records per second approx. 350 events - including push events, commit comment and similar.
But, when streaming either GitHub or Travis, I do get the data from the first two batches, but then afterwards, the RDD's apart of the DStream are empty - although there is data to be streamed!
I've checked so far couple of things, including the HttpClient used for omitting requests to the API, but none of them did actually solve this problem.
Therefore, my question is - what could be going on? Why isn't Spark streaming the data after period x passes. Below, you may find the set context and configuration:
val configuration = new SparkConf().setAppName("StreamingSoftwareAnalytics").setMaster("local[2]")
val ctx = new StreamingContext(configuration, Seconds(3))
val stream = GitHubUtils.createStream(ctx, StorageLevel.MEMORY_AND_DISK_SER)
// RDD IS EMPTY - that is what is happenning!
stream.window(Seconds(9)).foreachRDD(rdd => {
if (rdd.isEmpty()) {println("RDD IS EMPTY")} else {rdd.collect().foreach(event => println(event.getRepo.getName + " " + event.getId))}
})
ctx.start()
ctx.awaitTermination()
Thanks in advance!

Resources