I have sample code to build pipeline.
private Pipeline buildPipeline() {
logger.debug("AbstractAuditLogProcessor.buildPipeline method start");
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<String, CacheEntry<AuditLogRecord>>remoteMapJournal("cache_AuditLog", getPlatformClientConfig(), START_FROM_OLDEST))
.addTimestamps((v) -> getTimeStamp(v), 3000)
.peek()
.groupingKey((v) -> Tuple2.tuple2(getUserID(v),getTranType(v)))
.window(WindowDefinition.sliding(getSlidingWindowLengthInMillies(), getSlidingStepInMillies()))
.aggregate(counting())
.map((v)-> getMapKey(v))
//.<Map.Entry<String, Long>>customTransform("test2", ()-> this)
//.<Offer>customTransform("Offer_Recommendations", ()-> this)
.<Map.Entry<String, Offer>>customTransform("Offer_Recommendations", ()-> this)
//.drainTo(Sinks.remoteList("cache_OfferRecommendations", getPlatformClientConfig()));
.drainTo(Sinks.remoteMap("cache_OfferRecommendations", getPlatformClientConfig()));
logger.debug("AbstractAuditLogProcessor.buildPipeline method end");
return p;
}
This code prints following DAG information
dag
.vertex("remoteMapJournalSource(cache_AuditLog)").localParallelism(1)
.vertex("sliding-window-step1").localParallelism(4)
.vertex("sliding-window-step2").localParallelism(4)
.vertex("map").localParallelism(4)
.vertex("Offer_Recommendations").localParallelism(4)
.vertex("remoteMapSink(cache_OfferRecommendations)").localParallelism(1)
.edge(between("remoteMapJournalSource(cache_AuditLog)", "sliding-window-step1").partitioned(?))
.edge(between("sliding-window-step1", "sliding-window-step2").partitioned(?).distributed())
.edge(between("sliding-window-step2", "map"))
.edge(between("map", "Offer_Recommendations"))
.edge(between("Offer_Recommendations", "remoteMapSink(cache_OfferRecommendations)"))
The DAG information has additional details / method calls like partitioned() , distributed()
Does this distributes the records based on the key ?
As well, how does hazelcast jet ensures, the records are not moved to different partitions.
Related
I have a Flink job that reads data from Kafka into a table which is emitted into a DataStream on which I apply a filter function and then convert the data stream back to a table which writes data back to Kafka.
I want to test the functionality of the filter function. I am writing unit tests in Groovy using Spock (framework). In my unit test I am calling the Flink job with the Table SQL string with details about the Kafka topic, however, I am confused on how to load the right StreamExecution and TableEnvironment because when I create a new object of my Flink class, those values are null and I don't have getters/setters to set everything up because that would make the code really messy.
The following is my logic. My question is can I write Apache Flink APIs as seamlessly in Groovy or there are many layers/pitfalls and how can I better approach these tests:
class DataStreamTests extends Specification {
#Autowired
ApplicationConfiguration configuration;
FlinkStreaming streaming = new FlinkStreaming();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
final StreamTableEnvironment tableEnvironment = TableEnvironment.create(settings);
def resourcePath = "cars/porche.txt"
StreamSpec streamSpec = ConfigurationParser.createStreamSpec(getClass().getResource(resourcePath).text)
def "create a new input stream from input table sql"() {
given:
DataStream<String> streamRecords = env.readTextFile("/streaming_signals/pedals.txt")
streaming.setConfiguration(configuration)
streaming.setTableEnvironment(tableEnvironment);
when:
String tableSpec = streaming.createTableSpec(streamSpec);
DataStream<Row> rawStream = streaming.getFilteredStream(streamSpec, tableSpec)
DataStream<String> comapareStreams = rawStream.map(new MapFunction<Row, String>() {
#Override
public String map(Row record) {
// Logic to compare stream received with test stream
}
});
then:
// Comparison Logic
}
}
org.apache.flink.table.api.ValidationException: Could not find any factories that implement 'org.apache.flink.table.delegation.ExecutorFactory' in the classpath.
at org.apache.flink.table.factories.FactoryUtil.discoverFactory(FactoryUtil.java:385)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:295)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:266)
at org.apache.flink.table.api.TableEnvironment.create(TableEnvironment.java:95)
at com.streaming.DataStreamTests.$spock_initializeFields(DataStreamTests.groovy:38
I'm working in a concurrent environment when index being built by Spark job may receive updates for same document id from the job itself and other sources. It is assumed that updates from other sources are more fresh and Spark job needs to silently ignore documents that already exist, creating all other documents. This is very close to indexing with op_type: create, but the latter throws an exception that is not passed to my error handler. Following block of code:
.rdd
.repartition(getTasks(configurationManager))
.saveJsonToEs(
s"$indexName/_doc",
Map(
"es.mapping.id" -> MenuItemDocument.ID_FIELD,
"es.write.operation" -> "create",
"es.write.rest.error.handler.bulkErrorHandler" ->
"<some package>.IgnoreExistsBulkWriteErrorHandler",
"es.write.rest.error.handlers" -> "bulkErrorHandler"
)
)
where error handler survived several variations, but currently is:
class IgnoreExistsBulkWriteErrorHandler extends BulkWriteErrorHandler with LazyLogging {
override def onError(entry: BulkWriteFailure, collector: DelayableErrorCollector[Array[Byte]]): HandlerResult = {
logger.info("Encountered exception:", entry.getException)
if (entry.getException.getMessage.contains("version_conflict_engine_exception")) {
logger.info("Encountered document already present in index, skipping")
HandlerResult.HANDLED
} else {
HandlerResult.ABORT
}
}
}
(i obviously was checking for org.elasticsearch.index.engine.VersionConflictEngineException in getException().getCause() first, but it didn't work)
emits this in log:
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [186/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: version_conflict_engine_exception: [_doc][12]: version conflict, document already exists (current version [1])
(i assume that my error handler is not called at all)
and terminates my whole Spark job. What is the correct way to achieve my desired result?
I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet (lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long]. This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]):
class MyStreamProcessor extends ForeachWriter[Row] {
override def open(partitionId: Long, version: Long): Boolean = {
true
}
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)
}
override def close(errorOrNull: Throwable): Unit = {}
}
val query = lastContacts
.writeStream
.foreach(new MyStreamProcessor())
.start()
query.awaitTermination()
I receive a huge stack trace, which the relevant part (I think) is this: java.io.NotSerializableException: org.apache.spark.sql.streaming.DataStreamWriter
Could anyone explain why this exception occurs and how to avoid? Thank you!
This question is related to the following two:
DataFrame to RDD[(String, String)] conversion
Call a function with each element a stream in Databricks
Spark Context is not serializable.
Any implementation of ForeachWriter must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
In your code, you are trying to use spark context within process method,
override def process(record: Row) = {
val stringHashRDD = sc.parallelize(Seq(("lastContact", record(1).toString)))
*sc.toRedisHASH(stringHashRDD, record(0).toString)(redisConfig)*
}
To send data to redis, you need to create your own connection and open it in the open method and then use it in the process method.
Take a look how to create redis connection pool. https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala
I created a dummy custom QueryExecutionListener(given below) according to the information here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ExecutionListenerManager.html and here https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/exercises/spark-exercise-custom-scheduler-listener.html.
The custom listener just had some print statements. The listener was added via the configuration property spark.sql.queryExecutionListeners . However i do not see any of my logging statements in the console of spark submit command. Also there are no errors as a result of spark submit.
I can see the properties that are set in by using "spark.sqlContext.getAllConfs"
It looks like the onSuccess and onFailure methods are not at all getting called.
Has anyone ever successfully created a custom query execution listener and "registered" it using the conf properties?
//code for the customlistener is given below:
class LineageListener extends QueryExecutionListener with Logging {
override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
log.info("The function ${funcName} succeeded")
val sparkContext = qe.sparkSession.sparkContext
log info("App name: ${sparkContext.appName} and id is ${sparkContext.applicationId}") }
override def onFailure(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
log.info("The function ${funcName} succeeded")
val sparkContext = qe.sparkSession.sparkContext
log info("App name: ${sparkContext.appName} and id is ${sparkContext.applicationId}") }
Note: My spark version is 2.2.1
It depends on how you called the spark submit.If everything is at local you must see since all drivers share the same console.
If you submit by yarn (where workers are other machines) you can see logs via spark ui or other log viewing tools.
One common mistake that I have found(happened to me) is that if you close the session before the onSuccess() or onFailure() finishes, methods will not be called.
Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous analysis.
For this, I've implemented the following Travis custom receiver:
object TravisUtils {
def createStream(ctx : StreamingContext, storageLevel: StorageLevel) : ReceiverInputDStream[Build] = new TravisInputDStream(ctx, storageLevel)
}
private[streaming]
class TravisInputDStream(ctx : StreamingContext, storageLevel : StorageLevel) extends ReceiverInputDStream[Build](ctx) {
def getReceiver() : Receiver[Build] = new TravisReceiver(storageLevel)
}
private[streaming]
class TravisReceiver(storageLevel: StorageLevel) extends Receiver[Build](storageLevel) with Logging {
def onStart() : Unit = {
new BuildStream().addListener(new BuildListener {
override def onBuildsReceived(numberOfBuilds: Int): Unit = {
}
override def onBuildRepositoryReceived(build: Build): Unit = {
store(build)
}
override def onException(e: Exception): Unit = {
reportError("Exception while streaming travis", e)
}
})
}
def onStop() : Unit = {
}
}
Whereas the receiver uses my custom made TRAVIS API library (developed in Java using Apache Async Client). However, the problem is the following: the data that I should be receiving is continuous and changes i.e. is being pushed to Travis and GitHub constantly. As an example, consider the fact that GitHub records per second approx. 350 events - including push events, commit comment and similar.
But, when streaming either GitHub or Travis, I do get the data from the first two batches, but then afterwards, the RDD's apart of the DStream are empty - although there is data to be streamed!
I've checked so far couple of things, including the HttpClient used for omitting requests to the API, but none of them did actually solve this problem.
Therefore, my question is - what could be going on? Why isn't Spark streaming the data after period x passes. Below, you may find the set context and configuration:
val configuration = new SparkConf().setAppName("StreamingSoftwareAnalytics").setMaster("local[2]")
val ctx = new StreamingContext(configuration, Seconds(3))
val stream = GitHubUtils.createStream(ctx, StorageLevel.MEMORY_AND_DISK_SER)
// RDD IS EMPTY - that is what is happenning!
stream.window(Seconds(9)).foreachRDD(rdd => {
if (rdd.isEmpty()) {println("RDD IS EMPTY")} else {rdd.collect().foreach(event => println(event.getRepo.getName + " " + event.getId))}
})
ctx.start()
ctx.awaitTermination()
Thanks in advance!