Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
What are advantage/disadvantage of using akka stream vs spark stream for stream processing? like, built in back pressure, performance, fault tolerance, built in transformation, flexibility etc. I'm NOT asking akka vs spark pros/cons strictly streaming component. Also I'm NOT asking under the hood framework architecture difference.
Akka Streams and Spark streams are from 2 different lands. Do not let the word "streams" confuse you.
Akka streams implement something called reactive manifesto which is great to achieve really low latency and provide a lot of operators to write declaratively transformations over streams easily. More about this on https://doc.akka.io/docs/akka/2.5.4/scala/stream/stream-introduction.html#motivation.
Spark Streaming aka Structured Streaming as of 2.2 is still a micro-batch approach to process a huge amount of data(Big Data).Events are collected and then processed periodically in small batches every few seconds.
Akka streams is basically not a distributed and do not scale out across clusters, unlike Spark.Akka streams use actor model of Akka to achieve Concurrency.
Akka streams is a toolkit and Spark is a framework.
PS: Even I had the same question couple of months back. Took a while to get my answers. Hope its helpful.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need to start a new project, and I do not know if Spark or Flink would be better. Currently, the project needs micro-batching but later it could require stream-event-handling as well.
Suppose Spark would be best, is there any disadvantage to using Beam instead and selecting Spark/Flink as the runner/engine?
Will Beam add any overhead or lack certain API/functions available in Spark/Flink?
To answer a part of your question:
First of all, Beam defines API to program for data processing. To adopt it, you have to first understand its programming model and make sure its model will fit your need.
Assuming you have fairly understood what Beam could help you, and you are planning to select Spark as the execution runner, you can check runner capability matrix[1] for Beam API support on Spark.
Regarding to overhead of running Beam over Spark. You might need to ask in user#beam.apache.org or dev#beam.apache.org. Runner developers could have better answers on it.
[1] https://beam.apache.org/documentation/runners/capability-matrix/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have worked on Storm and Spark but Samza is quite new.
I do not understand why Samza was introduced when Storm is already there for real time processing. Spark provides in memory near real time processing and has other very useful components as graphx and mllib.
What are improvements that Samza brings and what further improvements are possible?
This is a good summary of the differences and pros and cons.
I would just add that Samza, which actually isn't that new, brings a certain simplicity since it is opinionated on the use of Kafka as its backend, while others try to be more generic at the cost of simplicity. Samza is pioneered by the same people who created Kafka, who are also the same people behind the Kappa Architecture--primarily Jay Kreps formerly of LinkedIn. That's pretty cool.
Also, the programming models are totally different between realtime streams with Samza, microbatches in Spark Streaming (which isn't exactly the same as Spark), and spouts and bolts with tuples in Storm.
None of these are "better." It all depends on your use cases, the strengths of your team, how the APIs match up with your mental models, quality of support, etc.
You also forgot Apache Flink and Twitter's Heron, which they made because Storm started to fail them. Then again, very few need to operate at the scale of Twitter.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I dont know how to implement multithreading concept on scala language. Can any one suggest me how to implement & provide some samples for multithreading. Thank you
You have several options.
Scala Akka actor system
Akka is a toolkit and runtime
for building highly concurrent,
distributed, and resilient
message-driven applications
on the JVM.
Futures and Promises
Futures provide a way to reason about performing many operations in parallel– in an efficient and non-blocking way. A Future is a placeholder object for a value that may not yet exist. Generally, the value of the Future is supplied concurrently and can subsequently be used. Composing concurrent tasks in this way tends to result in faster, asynchronous, non-blocking parallel code.
Java Concurrency Model
Scala concurrency is built on top of the Java concurrency model. On
Sun JVMs, with a IO-heavy workload, we can run tens of thousands of
threads on a single machine. A Thread takes a Runnable. You have to
call start on a Thread in order for it to run the Runnable.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How do data processing engine like Spark, apache flink integrate structured, semi-structured and unstructured data together and affect computation?
General-purpose data processing engines like Flink or Spark let you define own data types and functions.
In case you have unstructured or semi-structured data, your data types can reflect these properties, e.g., by making some information optional or model it with flexible data structures (nested types, lists, maps, etc.). Your user-defined functions should be aware that some information might not always be present and know how to handle such cases.
So handling of semi-structured or unstructured data does not come for free. It must be explicitly specified. In fact, both systems put a focus on user-defined data and functions but have recently added APIs to ease the processing of structured data (Flink: Table API, Spark: DataFrames).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assume we have 100 gb of file. And my system is 60gb.How apache spark will handle this data?
We all know spark performs partitions on its own based on the cluster. But then when there is a reduced amount of memory I wanna know how spark handles it
In short: Spark does not require the full dataset to fit in memory at once. However, some operations may demand an entire partition of the dataset to fit in memory. Note that Spark allows you to control the number of partitions (and, consequently, the size of them).
See this topic for the details.
It is also worth to note that Java objects usually take more space than the raw data, so you may want to look at this.
Also i would recommend to look at Apache Spark : Memory management and Graceful degradation