Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assume we have 100 gb of file. And my system is 60gb.How apache spark will handle this data?
We all know spark performs partitions on its own based on the cluster. But then when there is a reduced amount of memory I wanna know how spark handles it
In short: Spark does not require the full dataset to fit in memory at once. However, some operations may demand an entire partition of the dataset to fit in memory. Note that Spark allows you to control the number of partitions (and, consequently, the size of them).
See this topic for the details.
It is also worth to note that Java objects usually take more space than the raw data, so you may want to look at this.
Also i would recommend to look at Apache Spark : Memory management and Graceful degradation
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Can we process 1tb of data using spark with 2 executors having 5 gb of memory each.if not how many executors are required, Assuming we don't have any time constraints.
This is very difficult question without looking at your data and code.
If you're ingesting raw files of 1TB without any caching then it MAY be possible with 5GB memory, but it will take very very long time as the parallelization is limited with only 2 executors unless you have multiple cores. Also, it depends wther you're asking for compressed 1GB or raw text files.
I hope this helps.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need to start a new project, and I do not know if Spark or Flink would be better. Currently, the project needs micro-batching but later it could require stream-event-handling as well.
Suppose Spark would be best, is there any disadvantage to using Beam instead and selecting Spark/Flink as the runner/engine?
Will Beam add any overhead or lack certain API/functions available in Spark/Flink?
To answer a part of your question:
First of all, Beam defines API to program for data processing. To adopt it, you have to first understand its programming model and make sure its model will fit your need.
Assuming you have fairly understood what Beam could help you, and you are planning to select Spark as the execution runner, you can check runner capability matrix[1] for Beam API support on Spark.
Regarding to overhead of running Beam over Spark. You might need to ask in user#beam.apache.org or dev#beam.apache.org. Runner developers could have better answers on it.
[1] https://beam.apache.org/documentation/runners/capability-matrix/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is it safe to run sstableverify against live sstables while Cassandra is running?
After and while running sstableverify in a lab, I cannot find anything in any logs that would indicate a problem.
SSTables are immutable by nature, so you can work with them using the same user that runs Cassandra (see comment for explanation). If you want to prevent their disappearance because of compaction, you may take the snapshot that will create hard links to files (but don't forget to remove snapshot later).
Safe maybe... But, you may run into the disappearance of sstables, due to compaction (something else mabye), while running sstableverify. Use nodetool verify instead for live data. Use sstableverify for data on offline recovery systems to verify data.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am exploring the cassandra and it seems very interesting, Could someone give me a overview of how bloom filter works?
What is the purpose of it in Cassandra?
Thanks
Bloom Filter (In General) - It is an index based data structure, which gives definitely not for objects not available in it, Sometimes May be available for objects available in it.
It is used for faster search in Cassandra, It will run in In-Memory, Available in SS Table. (NOTE : This bloom filter will also be available in disk which will be used while restarting)
Check this link for more understanding.
http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What's better to retrieve complex data from ArangoDB: A big query with all collection joins and graph traversal or multiple queries for each piece of data?
I think it depends on several aspects, e.g. the operation(s) you want to perform, scenario in which the querie(s) should be executed or if you favor performance over maintainability.
AQL provides the ability to write a single non-trivial query which might span through entire dataset and perform complex operation(s). Dissolving a big query into multiple smaller ones might improve maintainability and code readability, but on the other hand separate queries for each piece of data might have negative performance impact in the form of network latency associated with each request. One should also consider if the scenario allows to work with partial results returned from database while the other batch of queries is being processed.