Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD? - apache-spark

Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous analysis.
For this, I've implemented the following Travis custom receiver:
object TravisUtils {
def createStream(ctx : StreamingContext, storageLevel: StorageLevel) : ReceiverInputDStream[Build] = new TravisInputDStream(ctx, storageLevel)
}
private[streaming]
class TravisInputDStream(ctx : StreamingContext, storageLevel : StorageLevel) extends ReceiverInputDStream[Build](ctx) {
def getReceiver() : Receiver[Build] = new TravisReceiver(storageLevel)
}
private[streaming]
class TravisReceiver(storageLevel: StorageLevel) extends Receiver[Build](storageLevel) with Logging {
def onStart() : Unit = {
new BuildStream().addListener(new BuildListener {
override def onBuildsReceived(numberOfBuilds: Int): Unit = {
}
override def onBuildRepositoryReceived(build: Build): Unit = {
store(build)
}
override def onException(e: Exception): Unit = {
reportError("Exception while streaming travis", e)
}
})
}
def onStop() : Unit = {
}
}
Whereas the receiver uses my custom made TRAVIS API library (developed in Java using Apache Async Client). However, the problem is the following: the data that I should be receiving is continuous and changes i.e. is being pushed to Travis and GitHub constantly. As an example, consider the fact that GitHub records per second approx. 350 events - including push events, commit comment and similar.
But, when streaming either GitHub or Travis, I do get the data from the first two batches, but then afterwards, the RDD's apart of the DStream are empty - although there is data to be streamed!
I've checked so far couple of things, including the HttpClient used for omitting requests to the API, but none of them did actually solve this problem.
Therefore, my question is - what could be going on? Why isn't Spark streaming the data after period x passes. Below, you may find the set context and configuration:
val configuration = new SparkConf().setAppName("StreamingSoftwareAnalytics").setMaster("local[2]")
val ctx = new StreamingContext(configuration, Seconds(3))
val stream = GitHubUtils.createStream(ctx, StorageLevel.MEMORY_AND_DISK_SER)
// RDD IS EMPTY - that is what is happenning!
stream.window(Seconds(9)).foreachRDD(rdd => {
if (rdd.isEmpty()) {println("RDD IS EMPTY")} else {rdd.collect().foreach(event => println(event.getRepo.getName + " " + event.getId))}
})
ctx.start()
ctx.awaitTermination()
Thanks in advance!

Related

Garbage Collection on Flink Applications

I have a very simple Flink application in Scala. I have 2 simple streams. I am broadcasting one of my stream to the other stream. Broadcasted stream is containing rules and just checking whether the other is stream's tuples are inside of rules or not. Everything is working fine and my code is like below.
This is an infinite running application. I wonder if there is any possibility for JVM to collect my rules object as garbage or not.
Does anyone has any idea? Many thanks in advance.
object StreamBroadcasting extends App {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
val stream = env
.socketTextStream("localhost", 9998)
.flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty))
.keyBy(l => l)
val ruleStream = env
.socketTextStream("localhost", 9999)
.flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty))
val broadcastStream: DataStream[String] = ruleStream.broadcast
stream.connect(broadcastStream)
.flatMap(new SimpleConnect)
.print
class SimpleConnect extends RichCoFlatMapFunction[String, String, (String, Boolean)] {
private var rules: Set[String] = Set.empty[String] // Can JVM collect this object after a long time?
override def open(parameters: Configuration): Unit = {}
override def flatMap1(value: String, out: Collector[(String, Boolean)]): Unit = {
out.collect(value, rules.contains(value))
}
override def flatMap2(value: String, out: Collector[(String, Boolean)]): Unit = {
rules = rules.+(value)
}
}
env.execute("flink-broadcast-streams")
}
No, the Set of rules will not be garbage collected. It will stick around forever. (Of course, since you're not using Flink's broadcast state, the rules won't survive an application restart.)

How to use Groovy and Spock to test Apach Flink Job?

I have a Flink job that reads data from Kafka into a table which is emitted into a DataStream on which I apply a filter function and then convert the data stream back to a table which writes data back to Kafka.
I want to test the functionality of the filter function. I am writing unit tests in Groovy using Spock (framework). In my unit test I am calling the Flink job with the Table SQL string with details about the Kafka topic, however, I am confused on how to load the right StreamExecution and TableEnvironment because when I create a new object of my Flink class, those values are null and I don't have getters/setters to set everything up because that would make the code really messy.
The following is my logic. My question is can I write Apache Flink APIs as seamlessly in Groovy or there are many layers/pitfalls and how can I better approach these tests:
class DataStreamTests extends Specification {
#Autowired
ApplicationConfiguration configuration;
FlinkStreaming streaming = new FlinkStreaming();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
final StreamTableEnvironment tableEnvironment = TableEnvironment.create(settings);
def resourcePath = "cars/porche.txt"
StreamSpec streamSpec = ConfigurationParser.createStreamSpec(getClass().getResource(resourcePath).text)
def "create a new input stream from input table sql"() {
given:
DataStream<String> streamRecords = env.readTextFile("/streaming_signals/pedals.txt")
streaming.setConfiguration(configuration)
streaming.setTableEnvironment(tableEnvironment);
when:
String tableSpec = streaming.createTableSpec(streamSpec);
DataStream<Row> rawStream = streaming.getFilteredStream(streamSpec, tableSpec)
DataStream<String> comapareStreams = rawStream.map(new MapFunction<Row, String>() {
#Override
public String map(Row record) {
// Logic to compare stream received with test stream
}
});
then:
// Comparison Logic
}
}
org.apache.flink.table.api.ValidationException: Could not find any factories that implement 'org.apache.flink.table.delegation.ExecutorFactory' in the classpath.
at org.apache.flink.table.factories.FactoryUtil.discoverFactory(FactoryUtil.java:385)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:295)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:266)
at org.apache.flink.table.api.TableEnvironment.create(TableEnvironment.java:95)
at com.streaming.DataStreamTests.$spock_initializeFields(DataStreamTests.groovy:38

Azure DataBricks Stream foreach fails with NotSerializableException

I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet (lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long]. This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]):
class MyStreamProcessor extends ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)
}
override def close(errorOrNull: Throwable): Unit = {}
}
val query = lastContacts
.writeStream
.foreach(new MyStreamProcessor())
.start()
query.awaitTermination()
I receive a huge stack trace, which the relevant part (I think) is this: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter
Could anyone explain why this exception occurs and how to avoid? Thank you!
This question is related to the following two:
DataFrame to RDD[(String, String)] conversion
Call a function with each element a stream in Databricks
Spark Context is not serializable.
Any implementation of ForeachWriter must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
In your code, you are trying to use spark context within process method,
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
*sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)*
}
To send data to redis, you need to create your own connection and open it in the open method and then use it in the process method.
Take a look how to create redis connection pool. https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala

Read & write data into cassandra using apache flink Java API

I intend to use apache flink for read/write data into cassandra using flink. I was hoping to use flink-connector-cassandra, I don't find good documentation/examples for the connector.
Can you please point me to the right way for read and write data from cassandra using Apache Flink. I see only sink example which are purely for write ? Is apache flink meant for reading data too from cassandra similar to apache spark ?
I had the same question, and this is what I was looking for. I don't know if it is over simplified for what you need, but figured I should show it none the less.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("urlToUse.com").withPort(9042).build();
}
};
CassandraInputFormat<Tuple2<String, String>> cassandraInputFormat = new CassandraInputFormat<>("SELECT * FROM example.cassandraconnectorexample", cb);
cassandraInputFormat.configure(null);
cassandraInputFormat.open(null);
Tuple2<String, String> testOutputTuple = new Tuple2<>();
cassandraInputFormat.nextRecord(testOutputTuple);
System.out.println("column1: " + testOutputTuple.f0);
System.out.println("column2: " + testOutputTuple.f1);
The way I figured this out was thanks to finding the code for the "CassandraInputFormat" class and seeing how it worked (http://www.javatips.net/api/flink-master/flink-connectors/flink-connector-cassandra/src/main/java/org/apache/flink/batch/connectors/cassandra/CassandraInputFormat.java). I honestly expected it to just be a format and not the full class of reading from Cassandra based on the name, and I have a feeling others might be thinking the same thing.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
};
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
InputFormat inputFormat = new CassandraInputFormat<Tuple3<Integer, Integer, Integer>>("SELECT * FROM test.example;", cb);//, TypeInformation.of(Tuple3.class));
DataStreamSource t = env.createInput(inputFormat, TupleTypeInfo.of(new TypeHint<Tuple3<Integer, Integer,Integer>>() {}));
tableEnv.registerDataStream("t1",t);
Table t2 = tableEnv.sql("select * from t1");
t2.printSchema();
You can use RichFlatMapFunction to extend a class
class MongoMapper extends RichFlatMapFunction[JsonNode,JsonNode]{
var userCollection: MongoCollection[Document] = _
override def open(parameters: Configuration): Unit = {
// do something here like opening connection
val client: MongoClient = MongoClient("mongodb://localhost:10000")
userCollection = client.getDatabase("gp_stage").getCollection("users").withReadPreference(ReadPreference.secondaryPreferred())
super.open(parameters)
}
override def flatMap(event: JsonNode, out: Collector[JsonNode]): Unit = {
// Do something here per record and this function can make use of objects initialized via open
userCollection.find(Filters.eq("_id", somevalue)).limit(1).first().subscribe(
(result: Document) => {
// println(result)
},
(t: Throwable) =>{
println(t)
},
()=>{
out.collect(event)
}
)
}
}
}
Basically open function executes once per worker and flatmap executes it per record. The example is for mongo but can be similarly used for cassandra
In your case as I understand the first step of your pipeline is reading data from Cassandra rather than writing a RichFlatMapFunction you should write your own RichSourceFunction
As a reference you can have a look at simple implementation of WikipediaEditsSource.

task not serializable error while performing an RDD map function scala

trust me on this one, I have tried really hard toiling night and day but just just could not get hold of this Task not serializable error which is now totally eating me out!.
And I understand there are many similar questions floating around on SO but either am I really too dumb to get those (I am not expecting any spoon feeding in here) or mine being a different bug story altogether.
Totally requesting you guys to have a look at this one :
class RootServer(val config: Config) extends Configurable with Server with Serializable {
var PKsAffectedDF: DataFrame = dataSource.getAffectedPKs(prevBookMark,currBookMark)
if(PKsAffectedDF.rdd.isEmpty()){
Holder.log.info("[SQL] no records to upsert in the internal Status Table == for bookmarks : "+prevBookMark+" ==and== "+currBookMark+" for table == "+dataSource.db+"_"+dataSource.table)
}
val PKsAffectedDF_json = PKsAffectedDF.toJSON
PKsAffectedDF_json.foreachPartition { partitionOfRecords => {
var props = new Properties()
props.put('','')
props.put('','')
props.put('','')
props.put('','')
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("[TOPIC] "+dbname+"_"+dbtable,dbname,x)
producer.send(message)
}
}
}
}
}
Now Config is the typesafe.config Class which I believe is serializable, and server class is a basic abstract class with just two methods. I am totally stuck at what would be giving the following stacktrace :
http://pastebin.com/5yPA1s7e Sorry for pastebinning it but the entire stacktrace is there.
Thanks in Advance peeps.

Resources