Twitter data harvesting - apache-spark

For my project, I need to harvest data from Twitter.
I am currently facing two design choices:
What is the best software architecture? I read that spark has Twitter support but I am not familiar with Scala. On the other hand, Apache Spark seems a good option, but then I'm not sure on how to save data to a common sink
I have some budget constraints. I surely need one server to do the sink and the processing. However, for the data harvesting, I don't know if several VM/containers offer a better performance / cost ratio than a bunch of Raspberry PI running Kafka producers.

Take a look at Confluent platform and especially Kafka Connect [1].
There is a Twitter connector out of the box. All the twitter data will be streamed to Kafka.

Agree with #leshkin that Kafka Connect is the most natural fit. However, the Twitter connector (available on github here) does not require Confluent Platform, simply Kafka Connect which is a standard part of the Apache Kafka distribution.
If you choose, you can run Kafka connect workers in distributed mode to divide the load across several VMs/containers/boxes and these don't have to be the same boxes you run your kafka brokers (they only need some relevant libs from kafka and the libs for the connector and Java of course)


How to build Aggregations on Apache Solr with Spark

I have a requirement to build aggregations on the data that we receive to our Apache Kafka...
I am little bit lost which technlogical path to follow...
It seems people see the standard way, a constellation of Apache Kafka <-> Apache Spark <-> Solr
Bitnami Data Platform
I can't find concrete examples how this actually functions, but I am also asking myself would any solution von
Apache Kafka <-> Kafka Connect Solr <-> Solr
would not do the trick becasue solr supports aggregations also...
Solr Aggregation
but I saw some code snippets that aggregate the Data in Spark and write under special index to Solr.....
Also probably aggregation mit Kafka <-> Kafka Connect Solr <-> Solr will only function for only one Topic from Kafka, so if I have to combine the data from 2 or more, different Topics and aggregate, then Kafka, Spark, Solr is way to go.... (or this viable at all)
So as you may read, I am little bit confused, so I like to ask here, how are you approching this problem with your real life solutions....
Thx for answers...
Spark can of course join multiple topics. So can Flink, or Kafka Streams/KsqlDB. Spark or Flink just happen to be able to also write their data to external sources, such as Solr, rather than exclusively back into a new Kafka topic. The "downside" is that you need to maintain a scheduler exclusively for those, as compared to running a cluster of standalone Kafka Connect or Kafka Streams JAR applications. If you're using kubernetes, then that could be used for all of above (maybe not Flink... Haven't tried)
Kafka Connect can consume multiple topics and, depending on the connector configuration, might write to one or many Solr collections.

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.
Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?
So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.
Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.
Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.
Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.
As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same, the workers redistribute tasks and connectors if one of them fails.
in which situations I should prefer connectors over the Spark streaming solution.
"It Depends" :-)
Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
If you're not using Spark already, Kafka Connect is arguably more
straightforward to deploy (run the JVM, pass in the configuration)
As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)
Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?
Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)
If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.
Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Is the natural replacement for Spark (Direct) Streaming either Spark Structured Streaming or Kafka Streams?

Over the past few years we have developed quite some Spark Streaming (Direct API) applications that are reading or writing to/from Kafka, IBM MQ, Hive, HBase, HDFS, and others on our Cloudera Platform. Now that the Direct API of Spark Streaming (we currently have version 2.3.2) is deprecated and we recently added the Confluent platform (comes with Kafka 2.2.0) to our project we plan to migrate these applications.
What is the natural replacement of our Spark Streaming applications? Should we migrate to Spark Structured Streaming or rather to Kafka Streams?
I personally do not have any experience with both frameworks but in my view Spark Structured Streaming seems to be the natural choice. Our code base is mainly written in Scala which could be also used for the Structured API. Kafka Streams has a few limitations with Scala. Although we might loose some flexibility by leaving the low level API of RDDs and moving to a higher level of DataFrames we could build on our knowledge with Spark.
On the other side there is Kafka Streams which is probably the best choice when it comes to processing data between Kafka topics which is our main use case. And looking at all the Kafka Connectors that come with Confluent the other uses cases can be served as well.
You currently have some Spark scheduler, therefore you can use Structured Streaming, which is binary compatible with the old Streaming API.
If you're using Mesos or k8s, then putting Kafka Streams apps in Docker and running those is easier to scale, monitor and configure than Spark, IMO since it acts as any other Docker container in those systems, so you build a pattern around everything
Kafka Streams... is probably the best choice when it comes to processing data between Kafka topics
Kafka Streams has a few limitations with Scala.
I think you might want to keep reading that section
The Kafka Streams DSL for Scala library is a wrapper over the existing Java APIs for Kafka Streams DSL that addresses the concerns raised
Of course you could always use Kotlin to interop better with the Java API

Which framework should be used to aggregate and joining the data of Kafka topics and store in to MySQL

I have data in two kafka topics from mysql using debezium-connector-mysql-plugin.
now i want to aggregate this data at daily level and store in to another mysql table.
please suggest.
You've not really laid out your requirements, other than commenting that you don't want to use Confluent Platform (but not said why).
In general, with data in Kafka (regardless of where it comes from) you have different options for processing it:
Bespoke consumer (probably a bad idea, given the availability of stream processing frameworks)
KSQL (use SQL to do your joins etc) - part of Confluent Platform
Kafka Streams - a Java library for doing stream processing. Part of Apache Kafka.
Flink, Spark Streaming, Samza, Heron, etc etc etc
It's up to you which you use, and it's going to come down to factors such as
Existing technology in use (no point deploying a Spark cluster if you don't need to; conversely, if you already use Spark and have lots of developers trained on it then it could make sense to use it)
Language familiarity of developers - does it have to be a Java API, or is SQL more accessible
Capabilities of the framework/tool - do you need tight security integration, exactly-once processing, CEP, etc etc. Some of these will rule in or out the tool that you use.
Once you've joined and aggregated your data, a good pattern to follow is to write it back to Kafka (thus more loosely decoupling your design, and enabling separation of responsibilities of the components) and from there write it to MySQL using Kafka Connect and the JDBC Sink. Kafka Connect is part of Apache Kafka.
One final consideration : if you're taking data from MySQL, to process it and then write it back into MySQL… do you even need Kafka? Is there an appropriate reason to be using it and not just doing this processing in mySQL itself?
Disclaimer: I work for Confluent.

KStreams + Spark Streaming + Machine Learning

I'm doing a POC for running Machine Learning algorithm on stream of data.
My initial idea was to take data, use
Spark Streaming --> Aggregate Data from several tables --> run MLLib on Stream of Data --> Produce Output.
But I cam across KStreams. Now I'm confused !!!
Questions :
1. What is difference between Spark Streaming and Kafka Streaming ?
2. How can I marry KStreams + Spark Streaming + Machine Learning ?
3. My idea is to train the test data continuously rather than have batch training..
First of all, the term "Confluent's Kafka Streaming" is technically not correct.
it's called Kafka's Streams API (aka Kafka Streams)
it's part of Apache Kafka and thus "owned" by the Apache Software Foundation (and not by Confluent)
there is Confluent Open Source and Confluent Enterprise -- two offers from Confluent that both leverage Apache Kafka (and thus, Kafka Streams)
However, Confluent contributes a lot of code to Apache Kafka, including Kafka Streams.
About the differences (I only highlight some main differences and refer to the Internet and documentation for further details: and
Spark Streaming:
micro-batching (no real record-by-record stream processing)
no sub-second latency
limited window operations
no event-time processing
processing framework (difficult to operate and to deploy)
part of Apache Spark -- a data processing framework
exactly-once processing
Kafka Streams
record-by-record stream processing
ms latency
rich window operations
stream/table duality
event time, ingestion time, and processing time semantics
Java library (easy to run and deploy -- it's just a Java application as any other)
part of Apache Kafka -- a Stream Processing Platform (ie, it offers storage and processing at once)
at-least-once processing (exactly-once processing is WIP; cf KIP-98 and KIP-129)
elastic, ie, dynamically scalable
Thus there is no reasons to "marry" both -- it's a question of choice which one you want to use.
My personal take is, that Spark is not a good solution for stream processing. If you want to use a library like Kafka Streams or a framework like Apache Flink, Apache Storm, or Apache Apex (which are all good option for stream processing) depends on your use case (and maybe personal taste) and cannot be answered on SO.
A main differentiator of Kafka Streams is, that it is a library and does not require a processing cluster. And because it is part of Apache Kafka and if you have Apache Kafka already in place, this might simplify your overall deployment as you do not need to run an extra processing cluster.
I have recently presented at a conference about this topic.
Apache Kafka Streams or Spark Streaming are typically used to apply a machine learning model in real time to new events via stream processing (process data while it is in motion). Matthias answer already discusses their differences.
On the other side, you first use things like Apache Spark MLlib (or or XYZ) to build the analytic models first using historical data sets.
Kafka Streams can be used for online training of models, too. Though, I think online training has various caveats.
All of this is discussed in more details in my slide deck "Apache Kafka Streams and Machine Learning / Deep Learning for Real Time Stream Processing".
Apache Kafka Steams is library and provides embeddable stream processing engine and it is easy to use in Java applications for stream processing and it is not a framework.
I found some Use cases about when to use Kafka Streams and also good comparison with Apache flink from Kafka author.
Spark Streaming and KStreams in one pic from stream processing point of view.
Highlighted the significant advantages of Spark Streaming and KStreams here to make answer short.
Spark Streaming Advantages over KStreams:
Easy to integrate Spark ML models and Graph computing in same application without writing data outside of an application which means you will process the much quicker than writing kafka again and process.
Join non streaming sources like files system and other non kafka sources with other stream sources in same application.
Messages with Schema can be easily processed with most favorite SQL (StructuredStreaming).
Possible to do graph analysis over streaming data with GraphX inbuilt library.
Spark apps can be deployed over (if) existing YARN or Mesos cluster.
KStreams Advantages:
Compact library for ETL processing and ML model serving/training on messages with rich features. So far, both source and target should be Kafka topic only.
Easy to achieve exactly once semantics.
No separate processing cluster required.
Easy to deploy on docker since it's a plain java application to run.
