Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have worked on Storm and Spark but Samza is quite new.
I do not understand why Samza was introduced when Storm is already there for real time processing. Spark provides in memory near real time processing and has other very useful components as graphx and mllib.
What are improvements that Samza brings and what further improvements are possible?
This is a good summary of the differences and pros and cons.
I would just add that Samza, which actually isn't that new, brings a certain simplicity since it is opinionated on the use of Kafka as its backend, while others try to be more generic at the cost of simplicity. Samza is pioneered by the same people who created Kafka, who are also the same people behind the Kappa Architecture--primarily Jay Kreps formerly of LinkedIn. That's pretty cool.
Also, the programming models are totally different between realtime streams with Samza, microbatches in Spark Streaming (which isn't exactly the same as Spark), and spouts and bolts with tuples in Storm.
None of these are "better." It all depends on your use cases, the strengths of your team, how the APIs match up with your mental models, quality of support, etc.
You also forgot Apache Flink and Twitter's Heron, which they made because Storm started to fail them. Then again, very few need to operate at the scale of Twitter.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need to start a new project, and I do not know if Spark or Flink would be better. Currently, the project needs micro-batching but later it could require stream-event-handling as well.
Suppose Spark would be best, is there any disadvantage to using Beam instead and selecting Spark/Flink as the runner/engine?
Will Beam add any overhead or lack certain API/functions available in Spark/Flink?
To answer a part of your question:
First of all, Beam defines API to program for data processing. To adopt it, you have to first understand its programming model and make sure its model will fit your need.
Assuming you have fairly understood what Beam could help you, and you are planning to select Spark as the execution runner, you can check runner capability matrix[1] for Beam API support on Spark.
Regarding to overhead of running Beam over Spark. You might need to ask in user#beam.apache.org or dev#beam.apache.org. Runner developers could have better answers on it.
[1] https://beam.apache.org/documentation/runners/capability-matrix/
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
What are advantage/disadvantage of using akka stream vs spark stream for stream processing? like, built in back pressure, performance, fault tolerance, built in transformation, flexibility etc. I'm NOT asking akka vs spark pros/cons strictly streaming component. Also I'm NOT asking under the hood framework architecture difference.
Akka Streams and Spark streams are from 2 different lands. Do not let the word "streams" confuse you.
Akka streams implement something called reactive manifesto which is great to achieve really low latency and provide a lot of operators to write declaratively transformations over streams easily. More about this on https://doc.akka.io/docs/akka/2.5.4/scala/stream/stream-introduction.html#motivation.
Spark Streaming aka Structured Streaming as of 2.2 is still a micro-batch approach to process a huge amount of data(Big Data).Events are collected and then processed periodically in small batches every few seconds.
Akka streams is basically not a distributed and do not scale out across clusters, unlike Spark.Akka streams use actor model of Akka to achieve Concurrency.
Akka streams is a toolkit and Spark is a framework.
PS: Even I had the same question couple of months back. Took a while to get my answers. Hope its helpful.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We are building a university website and finding a search solution for it. Our university website has high-traffic because it has faculty of open university so very much students (approximately 1.5 million). Even we use caching for speeding up the website. Anyway, which search engine do you suggest for our situation?
Note: We think Solr, Elasticsearch or Sphinx for now, but also it can be one of the others.
Update: We need a full-text search engine which must be fast, extendable and with the features like query likening and indicating priority support.
Thanks.
It really depends on your use-case, what features you want, and whether you have any experience with any of the technologies. I could paraphrase arguments, but there's a very good discussion here: ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage? that covers the pros and cons of each.
Edit (in response to the question's edit):
Of these technologies I have only used Solr (and SQL), but I've found it to be easy to use and would recommend it. It supports native sharding and replication, which should cover the extendibility issue. It also supports things like joins and field weighting, which I think covers all your needs if I read your requirements correctly.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
After much reading and some attempts to implement DDD, I think I understand what people mean when they say the concept was developed for complex domains.
I usually develop web applications for small and medium businesses, usually the interactions are just CRUD application and tables in HTML, which goes beyond this are some validations before inserting the data into a database.
I was reading about CQRS on Martin Fowler's website and a phrase caught my attention: "CQRS is suited to complex domains, the kind that Also benefit from Domain-Driven Design.".
So my question would be how to analyze the complexity of the software?
When applying DDD worth?
Worth applying DDD in software for small and medium complexity?
Thank you!
Often even the simplest applications on the surface can turn into something complex. Right now I'm always trying to apply some basics of DDD (at least tactical patterns) and if I see that project is going out of hand, then I start to map contexts etc.
Complexity of software can be analyzed by analyzing your understanding of bussiness domain.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
How does Zed Shaw's Lua web framwork called Tir, compare to other Lua web frameworks such as Kepler, LuCI, etc?
Comparison to such things like:
maturity of code base
features/functionality
performance
ease of use
UPDATE:
Since Tir is based on the use of Lua's coroutine, doesn't this imply that Tir will never be able to scale well? Reason being, Lua's coroutine's cannot take advantage of multi-core/processor systems given that coroutines are implemented in Lua as a cooperative/collaborative threads (as opposed to pre-emptive)?
Tir is much newer than Kepler or LuCI, so the code isn't nearly as mature. I would rank Tir as experimental, right now. The same factor also means that it has significantly fewer features.
It does have a very pleasant continuation passing style of development available though, through its coroutine based flow stuff.
I would rate it, personally, as fun for experimentation, but probably not ready for heavy lifting until Zed stabilizes it more :-)
This video from PyCon 2011 says basically you scale on multicore or multiprocessor by running more workers, under high load condition the memory advantage gives better performance.
In the video it's said that at Meebo's they have used this approach for last months with huge load.
The video is python specific, so it's just for the scaling of coroutine approach part of the question. Video length is about thirty minutes.