Spark,Akka,Storm Or RxJava [closed] - apache-spark

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a use case where I get 10K to 15 K Messages/sec and it might be less than 5K also sometime and I push those into RabbitMQ now those messages I should parse,run some RE on that and do some sort aggregation and run some statistics. My product(the data pipeline) would be deployed in single machine as there are some business constrain. Have explored Spark Akka Storm and RxJava, Could you suggest me what to be used. But I don't want to do in plain Java, as by this way I have to handle all the threading etc.

Based on my experience I would go for Akka. You can create different pools of actors to perform some tasks concurrently on different messages. Also you can leverage the power of akka-camel to have a RabbitMQ consumer using the RabbitMQ Camel Component.
You might be able to do the same with Storm but I don't have that much experience with it to recommend it personally.
I wouldn't go for Apache Spark since you would need to user Apache Spark Streaming and you will need to learn about how to configure the buffering window correctly for your use case.

Related

Should we store data in an intermediary database in stateless stream processing? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i'm working on a study case that consists on proposing a technical architecture for a real-time stream processing problem. the problem is that a transportation company wants to track in near real-time the speed and the number of passengers in its buses.
The initial architecture that i proposed is like this :
Buses send data into a MQQT Server in real time
Apache Kafka gets data from this server through an MQQT connector
calculation of "speed" and "Number of passengers" using Kafka Streams API or Spark streaming
Visualization of "speed" and "Number of passengers".
My questions are the following
the architecure, is it correct ?
the stream processing problem in this case, is it stateless ?
and finally, i would like to know if i have to store data in an intermediary database like cassandra before doing the vizualisation ?
if no, is there an open source visualization tool that can interact directly with streams in motion ?

Is it possible to use Hazelcast with Netflix OSS stack? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Currently we are building a platform using Netflix OSS stack (Microservices). We want to use HazelCast as a caching solution. Can anyone please help me like, how can I integrate HazelCast into Netflix OSS. is it recommended ?
It depends on what parts of the stack you want to integrate Hazelcast. We have a Eureka discovery plugin which makes it possible to discover other Hazelcast nodes. You can put Hystrix in front of Hazelcast calls but remember those are fault tolerant, so they might are re-run. I never tried Governator or Zuul but I think there was a user to successfully integrate the latter one with Hazelcast.

Google Dataflow vs Apache Spark Streaming (either on Google Cloud or with Google Dataproc) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am new to Cloud and Big-data however have much of interest in these and I have significant experience in Java programming. I am currently working on my uni project for comparing performance of Apache Spark streaming with Google Cloud Dataflow. I have read a number of articles including the comparison done here.
I understand that the programming model of Spark and Dataflow is different, however because of my limited and new knowledge in this area, I am trying to understand if this performance comparison can still be done?
and what type of use case would be correct for this? And what performance parameters should be considered here for a streaming application?
While reading about Dataflow and Spark, I also came across Dataproc and also thinking if it is better to do comparison between Dataflow vs Spark on Dataproc or Dataflow vs Spark+Google Cloud.
Any advise on this would be appreciated as I am not getting a clear direction in this.
The best way to compare performance is with real end-to-end data processing pipelines. So you first need to answer your own question "what type of use case would be correct for this?" as there are a nearly unlimited variety.
You might find some inspiration in the included examples.

What is the difference between JDBCRealm and DataSourceRealm? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I read this comment: "don't use JDBCRealm at all: it does not scale at all since there is a single JDBC Connection object used for all database communication. You are better off using DataSourceRealm"
What does it mean in a greater detail?
Incase you don't know about why and what realms are- for JAVA web applications, authentication and authorization can be handled either by the application or by the container(Tomcat etc.). If one chooses to use the container, you need to specify a user-store(a place where usernames,hopefully encrypted passwords, roles etc are stored). This could even be your tomcat-users xml incase of Tomcat. Or you could use a database(MYSQL etc.) or a directory(Active Directory etc.) . Tomcat connects to the database using JDBC(your JDBC realm) and to the directory using JNDI(your DataSourceRealm).
Coming to your question JDBC connections are expensive, have pooling limitations, and suffer from high synchronization which means in a high load application, authentication may fail for some requests due to unavailability JDBC. JNDI has better pooling being read optimized, and as such gives better performance.

dedicated servers for socket.io? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
One of the main features in my website is a simple One-to-One chat.
I'm debating whether or not I shall dedicate a server (or a cluster) for the sole purpose of this chat feature. The simpler option would be combining this feature as part of the web-servers and just scale out when necessary.
It is worth mentioning I'd like in the future to enable images transfer within the chat.
So what is the better option and why?
Well yes, Whether to use another dedicated server is not depending on how much traffic your site will have to handle. If you're dealing with images It will be a good idea to store them in another server and keep the root server clean.

Resources