Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Can we process 1tb of data using spark with 2 executors having 5 gb of memory each.if not how many executors are required, Assuming we don't have any time constraints.
This is very difficult question without looking at your data and code.
If you're ingesting raw files of 1TB without any caching then it MAY be possible with 5GB memory, but it will take very very long time as the parallelization is limited with only 2 executors unless you have multiple cores. Also, it depends wther you're asking for compressed 1GB or raw text files.
I hope this helps.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I would like to plot the CPU and memory usage of an application on linux vs time. What is the best way to do this?
Would greping these values out from top every 0.1s and writing them into some file work - or is there a better and easier way?
There is an easier way. All of the information displayed in top can be found in /proc/<pid>/, most of it in /proc/<pid>/stat. man proc describes the content of these files.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
what is "spark.executor.memoryOverhead" and "spark.memory.fraction"?
what is the default properties
spark.memory.fraction parameter can be used to separately understand memory available for storage and memory available for execution. If you are caching too many objects in memory then you will need more of storage (spark.memory.fraction can be 0.5/0.6). However, if you are using memory for largely execution purposes then you need memory to be available for execution (spark.memory.fraction can be 0.2/0.3).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assume we have 100 gb of file. And my system is 60gb.How apache spark will handle this data?
We all know spark performs partitions on its own based on the cluster. But then when there is a reduced amount of memory I wanna know how spark handles it
In short: Spark does not require the full dataset to fit in memory at once. However, some operations may demand an entire partition of the dataset to fit in memory. Note that Spark allows you to control the number of partitions (and, consequently, the size of them).
See this topic for the details.
It is also worth to note that Java objects usually take more space than the raw data, so you may want to look at this.
Also i would recommend to look at Apache Spark : Memory management and Graceful degradation
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
It is possible with Kettle to read different files simultaneously? How does it work?
There is the notion of parallelism? Is it related with threads?
Thanks
This is naturally how Kettle works. Each step in a transform runs in its own thread. So if you have multiple input steps, each reading a different file, each file will be read in its own thread.
Note, this is true of transforms, not jobs. Parallel execution in jobs is trickier. For an example of sequencing parallel jobs, check out my answer here:
Waiting for Transformations in a Job
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
This question has been flagged as irrelevant so I guess this has no real worth to anyone so I tried removing the question but the system won't let me so I am now truncating the content of this post ;)
I think you need to run the actual numbers for both scenarios:
On the fly
how long does one image take to generate and do you want the client to wait that long
do you need to pay by cpu utilization, number of CPUs etc. and what will this cost for X images thumbnailed Y times over 1 year
Stored
how much space will this use and what will it cost
how many files are there? Is the number bigger than the number of inodes in the destination file system, or is the total estimated size bigger than the file system
It^s mostly an economics question, there is no general yes/no answer. When in doubt, I'd probably go with storing them since it's a computation intensive tasks and it's not very efficient to do it over and over again. You could also do a hybrid solution like generate a thumbnail on the fly when it is first requested, then cache it until it wasn't used for certain a number of days.
TL;DR: number of inodes is probably your least concern.