Determining Partitioning in spark - apache-spark

I was going through this blog and comes under below paragraph:-
If the config is set to some value, and the maximum value is above the config value, then the maximum value is chosen.
If the config is set and the maximum value is below the config value, however, the maximum value is within a single order of magnitude of the highest number of partitions among all upstream RDDs, then the maximum value is chosen. For example, if the maximum value is 120, the highest number of partitions among upstream RDDs is 1000, and the config value for ‘spark.default.parallelism’ is 1050, then 120 is chosen.
If the config is set and the maximum value is below the config value, however, the maximum value is not within a single order of magnitude of the highest number of partitions among all upstream RDDs, then config value is the answer. For example, if the maximum value is 80, the highest number of partitions among upstream RDDs is 1000, and the config value for ‘spark.default.parallelism’ is 1050, then 1050 is chosen.
If the config is not set, then the maximum value is answer.
point 1 and 4 are clear to me but I am confused for point number 2 and 3, I am not able to understand below:-
what is the meaning of --> "the maximum value is within a single order of magnitude of the highest number of partitions among all upstream RDDs"
from where the number 1000 has been driven.
If I am joining two RDDs i.e A and B where A has 30 partitions and B has 120 partitions then what is the meaning of maximum in point 2 and 3? is it reflects 120 here?
can someone please help me understanding the same with some example.

Related

how to decide number of executors for 1 billion rows in spark

We have a table which has one billion three hundred and fifty-five million rows.
The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)
How to decide number of executors and its resource allocation details?
How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?
Like #samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.
Here are some things that you may want to tweak:
spark.executor.cores is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5.
spark.files.maxPartitionBytes determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.
spark.sql.shuffle.partitions is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.

Randomly generate x whole numbers with sum of y with Excel

I was referring to this post on generating random x numbers using =RAND(), however it cater for every number including those with decimal. I want to generate only positive whole number (e.g. 1, 3, 50) and not those with decimal.
To be clear, for example, I want to generate:
50 random positive whole numbers that has the sum of 1000
PS: If you find this question for Excel solution redundant, let me know and I'll close this.
I offer a solution which has better statistical properties than I had originally supposed:
Estimate the upper limit of each number as being twice the mean. In your case the mean is 20 (1000 / 50), so the upper limit is 39, as the lower limit is 1.
Generate 50 floating point numbers using
=RAND() * 38 + 1
Sum the total that you get, call that s
Rescale each number by multiplying by 1000 / s, and round the result in the normal way. (Use ROUND.)
Sum that. Call it t.
If t is less than 1000, add 1000 - t to the smallest number. If it's greater than 1000, subtract t - 1000 from the largest number.
This should be approximately uniformly distributed and have a good mean. You can run the results through some statistical tests for randomness to gauge whether or not it will fit your requirements. My instinct suggests to me that it will not be much worse than Rand() itself.

Spotfire- calculated column with row ratios based on condition

I’m having trouble understanding if Spotfire allows for conditional computations between arbitrary rows containing numerical data repeated over data groups. I could not find anything to cue me onto a right solution.
Context (simplified): I have data from a sensor reporting state of a process and this data is grouped into bursts/groups representing a measurement taking several minutes each.
Within each burst the sensor is measuring a signal and if a predefined feature (signal shape) was detected the sensor outputs some calculated value, V quantifying this feature and also reports a RunTime at which this happened.
So in essence I have three columns: Burst number, a set of RTs within this burst and Values associated with these RTs.
I need to add a calculated column to do a ratio of Values for rows where RT is equal to a specific number, let’s say 1.89 and 2.76.
The high level logic would be:
If a Value exists at 1.89 Run Time and a Value exists at 2.76 Run Time then compute the ratio of these values. Repeat for every Burst.
I understand I can repeat the computation over groups using OVER operator but I’m struggling with logic within each group...
Any tips would be appreciated.
Many thanks!
The first thing you need to do here is apply an order to your dataset. I assume the sample data is complete and encompasses the cases in your real data, thus, we create a calculated column:
RowID() as [ROWID]
Once this is done, we can create a calculated column which will compute your ratio over it's respective groups. Just a note, your B4 example is incorrect compared to the other groups. That is, you have your numerator and denominator reversed.
If(([RT]=1.89) or ([RT]=2.76),[Value] / Max([Value]) OVER (Intersect([Burst],Previous([ROWID]))))
Breaking this down...
If(([RT]=1.89) or ([RT]=2.76), limits the rows to those where the RT = 1.89 or 2.76.
Next comes the evaluation if the above condition is TRUE
[Value] / Max([Value]) OVER (Intersect([Burst],Previous([ROWID])))) This takes the value for the row and divides it by the Max([Value]) over the grouping of [Burst] and AllPrevious([ROWID]). This is noted by the Intersect() function. So, the denominator will always be the previous value for the grouping. Note that Max() was a simple aggregate used, but any should do for this case since we are only expecting a single value. All Over() functions require and aggregate to limit the result set to a single row.
RESULTS

Count times a multiple of a threshold has been exceeded

I've got a single row of absolute values along the lines of this:
3001
3123
3342
3453
3561
Think of this as a growing graph with the individual values being connected. Now I want to count the amount of times the value of a cell has exceeded a certain threshold from the previous entry. Specifically, every time a number has exceeded a multiple of 500, I want the counter to go up.
So in this example, nothing happens until the very last entry, where the number went from 3453 to 3561 and thus surpassed the the 3500 threshold.
How would you do this?
If you only care about how many time the number has increased by 500 from the start:
=INT((MAX($A$1:$A$5)-MIN($A$1:$A$5))/500)
As per your comments you can use this formula
=SUMPRODUCT(--(INT((A2:A11-A1)/500)>INT((A1:A10-A1)/500)))
Column B is just to show where the numbers came from.

Load factor in HashMap with linkedList

For load factor, I know it's the total number of elements divided by the space available. For the picture below, at index 2 for example, does it count as 1 spot or 6?
for load factor, I know it's the total number of elements divide by the space available
Yes, the load factor is the total number of entries divided by the number of bins. That's the average number of entries stored in each bin of the HashMap. This number should be kept small in order for the HashMap to have expected constant running time for the get(key) and put(key,value) methods.
at index 2 for example, does it count as 1 spot or 6
Each index represents 1 bin of the HashMap, regardless of how many entries are stored in it.
Therefore, in your example (the image you linked to), you have 10 entries and 5 bins, and the load factor is 2.

Resources