How many ways to place brick on board - combinatorics

Given - board 5x5 cells, 5 bricks.
Assume that all bricks the same.
How many ways to place bricks on board?
How many ways to place bricks on board with empty cell left (top left one). Rest cells can be either empty on not.
More than one brick allowed in one cell.
All bricks have different colors.
same questions.
Can anyone help me with it? With explanation if possible.

How many ways can I distinctly put 5 identical bricks on a 5x5 playing board with either 1 or 0 bricks per square?
25! / (5! * 20!)
The first brick can be put on any of 25 squares, the second on any of 24, the third on any of 23, the fourth on any of 22 and the fifth on any of 21. So there are 25 * 24 * 23 * 22 * 21 ways to place the bricks = 25! / 20!
As the bricks are identical, there are 5! ways of placing them (5 options for place 1, 4 for place 2 etc...) so there are 25! / 20! ways of placing distict bricks and 25! / (5! * 20!) ways of placing identical bricks on a 25x25 board.
How many ways can I distinctly put 5 identical bricks on a 5x5 playing board with either 1 or 0 bricks per square for all squares except the top left one where there cannot be any bricks?
24! / (5! * 19!)
This is the same as above, but with only 24 available squares for the bricks.
How many ways can I put 5 distinct bricks on a 5x5 playing board with either 1 or 0 bricks per square?
25! / 20!
Reason: This is explained as the first part of the explanation of question 1.
How many ways can I put 5 distinct bricks on a 5x5 playing board with either 1 or 0 bricks per square but no bricks in the top left square?
24! / 19!
Reason: This is the same problem as the previous question, except that there are only 24 available squares instead of 25.
How many ways can I put 5 distinct bricks on a 5x5 playing board with 0 - 5 bricks alowed in any square?
25 ^ 5
Reason: There are 25 ways to place each of the bricks, so there are 25 * 25 * 25 * 25 * 25 solutions.
How many ways can I put 5 distinct bricks on a 5x5 playing board with 0 - 5 bricks alowed in any square except the top left square?
24 ^ 5
Reason: There are 24 ways to place each of the bricks, so there are 24 * 24 * 24 * 24 * 24 solutions.
How many ways can I put 5 identical bricks on a 5x5 playing board with 0-5 bricks allowed in any square?
(5 + 25 - 1 ) choose 5 = 29 choose 5 = 29! / (5! * 24!)
Read up about multinomial cooeficients or google "Number of ways to place n balls in m boxes" for a better explanation.


In HPCC ECL, when running a LOCAL, LOOKUP JOIN. Does the RHS dataset gets copied to all nodes, or kept distributed due to LOCAL?

Say I have a cluster of 400 machines, and 2 datasets. some_dataset_1 has 100M records, some_dataset_2 has 1M. I then run:
Then, I run the join:
Will the distribution of ds2 "mess up" the join, meaning parts of ds2 will be incorrectly scattered across the cluster leading to low match rate?
Or, will the LOOKUP keyword take precedence and the distributed ds2 will get copied in full to each node, thus rendering the distribution irrelevant, and allowing the join to find all the possible matches (as each node will have a full copy of ds2).
I know I can test this myself and come to my own conclusion, but I am looking for a definitive answer based on the way the language is written to make sure I understand and can use these options correctly.
For reference (from the Language Reference document v 7.0.0):
LOOKUP: Specifies the rightrecset is a relatively small file of lookup records that can be fully copied to every node.
LOCAL: Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE
It seems that with the LOCAL, the join completes more quickly. There does not seem to be a loss of matches on initial trials. I am working with others to run a more thorough test and will post the results here.
First, your code:
Since you're intending these results to be used in a JOIN, it is imperative that both datasets are distributed on the "same" data, so that the matching values end up on the same nodes so that your JOIN can be done with the LOCAL option. So this will only work correctly if ds1.field_a and ds2.field_b contain the "same" data.
Then, your join code. I assume you've made a typo in this post, because your join code needs to be (to work at all):
Using both LOOKUP and LOCAL options is redundant because a LOOKUP JOIN is implicitly a LOCAL operation. That means, your LOOKUP option does "override" the LOCAL in this insatnce.
So, all that means that you should either do it this way:
Or this way:
Because the LOOKUP option does copy the entire right-hand dataset (in memory) to every node, it makes the JOIN implicitly a LOCAL operation and you do not need to do the DISTRIBUTEs. Which way you choose to do it is up to you.
However, I see from your Language Reference version that you may be unaware of the SMART option on JOIN, which in my current Language Reference (8.10.10) says:
SMART -- Specifies to use an in-memory lookup when possible, but use a
distributed join if the right dataset is large.
So you could just do it this way:
and let the platform figure out which is best.
Thank you, Richard. Yes, I am notorious for typo's. I apologize. As I use a lot of legacy code, I have not had a chance to work with the SMART option, but I will certainly keep that in mine for me and the team, - so thank you for that!
However, I did run a test to evaluate how the compiler and the platform would handles this scenario. I ran the following code:
sd1:=DATASET(100000,TRANSFORM({unsigned8 num1},SELF.num1 := COUNTER ));
sd2:=DATASET(1000,TRANSFORM({unsigned8 num1, unsigned8 num2},SELF.num1 := COUNTER , SELF.num2 := COUNTER % 10 ));
j11:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1 ):independent;
j12:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j13:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j21:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1 ):independent;
j22:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j23:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j31:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1 ):independent;
j32:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j33:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1, LOCAL):independent;
j41:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j42:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j43:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j51:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j52:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j53:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL,HASH):independent;
] , {unsigned8 num, string lbl});
On a 400 node cluster, the results come back as:
If you look at the row 12 in the result ( lbl 34 ), you will notice the match rate drops substantially, suggesting the compiler does indeed distribute the file (with the wrong hashed field) and disregard the LOOKUP option.
My conclusion is therefore that as always, it remains the developer's responsibility to ensure the distribution is right ahead of the join REGARDLESS of which join options are being used.
The manual page could be better. LOOKUP by itself is properly documented. and LOCAL by itself is properly documented. However, they represent two different concepts and can be combined without issue so that JOIN(,,, LOOKUP, LOCAL) makes sense and can be useful.
It is probably best to consider LOOKUP as a specific kind of JOIN matching algorithm and to consider LOCAL as a way to tell the compiler that you are not a novice and that you are absolutely sure the data is already where it needs to be to accomplish what you intend.
For a normal LOOKUP join the LEFT-hand side doesn't need to be sorted or distributed in any particular way and the whole RHS-hand side is copied to every slave. No matter what join value appears on the LEFT, if there is a matching value on the RIGHT then it will be found because the whole RIGHT dataset is present.
In a 400-way system with well-distributed join values, IF the LEFT side is distributed on the join value, then the LEFT dataset in each worker only contains 1/400th of the join values and only 1/400th of the values in the RIGHT dataset will ever be matched. Effectively, within each worker, 399/400th of the RIGHT data will be unused.
However, if both the LEFT and RIGHT datasets are distributed on the join value ... and you are not a novice and know that using LOCAL is what you want ... then you can specify a LOOKUP, LOCAL join. The RIGHT data is already where it needs to be. Any join value that appears in the LEFT data will, if the value exists, find a match locally in the RIGHT dataset. As a bonus, the RIGHT data only contains join values that could match ... it is only 1/400th of the LOOKUP only size.
This enables larger LOOKUP joins. Imagine your 400-way system and a 100GB RIGHT dataset that you would like to use in a LOOKUP join. Copying a 100GB dataset to each slave seems unlikely to work. However, if evenly distributed, a LOOKUP, LOCAL join only requires 250MB of RIGHT data per worker ... which seems quite reasonable.

Is there a module in Arena Simulation that help set the day off of workers?

Here the thing. I am try to simulate a manufacturing plants. It is said that the Machine A operated by 5 workers and statistics show that on average, every 20 working days there will be 2 workers off 1 day. So how do I set up the arena so that out of these 5 people, 2 people will have a day of. I was thinking about using Failure, but i cant find the function to take random 2 out of 5 workers.

Spark Geolocated Points Clustering

I have a dataset of points of interest on the maps like the following:
ID latitude longitude
1 48.860294 2.338629
2 48.858093 2.294694
3 48.8581965 2.2937403
4 48.8529717 2.3477134
The goal is to find those clusters of points that are very close to each other (distance less than 100m).
So the output I expect for this dataset would be:
(2, 3)
The point 2 and 3 are very close to each other with a distance less than 100m, while the others are far away so they should be ignored.
Since the dataset is huge with all the points of interest in the world, I need to do it with Spark with some parallel processing.
What approach should I take for this case?
I actually solved this problem using the following 2 approaches:
DBSCAN algorithm implemented as Spark job with partitioning
GeoSpark with spacial distance join
both of them are based on Spark so they work well with large scale of data.
however I found the dbscan-on-spark consumes a lot of memory, so I ended up using the GeoSpark with distance join.
I would love to do a cross join here , however that probably won't work since your data is huge.
One approach is to partition the data per region wise. That means you can change the input data as
ID latitude longitude latitiude_int longitude_int group_unique_id
1 48.860294 2.338629 48 2 48_2
2 48.858093 2.294694 48 2 48_2
3 48.8581965 2.2937403 48 2 48_2
4 48.8529717 2.3477134 48 2 48_2
The assumption here if the integral portion of the lat/long changes that will result > 100m deviation.
Now you can partition the data w.r.t group_unique_id and then do a cross join per partition.
This will probably reduce the workload.

Iterate cluster centers for K means in Python

I have 4 columns of data. For these Xs, I need to pick 3 cluster centers randomly and find the cluster with least SSE. Why is it that the centers and inertia(SSE) turn out to be the same both with varying random states, and init=random parameter?
kmeans1= KMeans(n_clusters=3, init='random', random_state=101)
On too simple data, many different initial seeds will converge to the same result.
Plus, he default of n_init is 10 if I remember correctly, so if just 1 out of ten runs yields the same...

Cassandra data modeling timeseries data

I have this data about visited users for an app/service:
contentId (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
I have modeled it very simply like
id(String) time(Date) count(int)
contentid1-sw1.1 Feb06 30
contentid1-sw2.1 Feb06 20
contentid1-sw1.1 Feb07 10
contentid1-sw2.1 Feb07 10
contentid1-us144 Feb06 23
contentid1-sw1.1-us144 Feb06 10
Reason is because there's a popular query where someone can ask for contentId=foo,platform=bar,regionId=baz or any combination of those for a range of time (say between Jan 01 - Feb 05).
But another query that's not easily answerable is:
Return top K 'platform' for contentId=foo between Jan01 - Feb05. By top it means to be sorted by 'count's in that range. So for above data, query for top 2 platforms for contentId=contentId1 between Feb6-Feb8 must return:
sw1.1 40
sw2.1 30
Not sure how to model that in C* to get answers for top K queries, anyone has any ideas?
PS: there are 1billion+ entries for each day.
Also I am open to using Spark or any other frameworks along with C* to get these answers.
