Say I have a cluster of 400 machines, and 2 datasets. some_dataset_1 has 100M records, some_dataset_2 has 1M. I then run:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Then, I run the join:
j1:=JOIN(ds1,ds2,LEFT.field_a=LEFT.field_b,LOOKUP,LOCAL);
Will the distribution of ds2 "mess up" the join, meaning parts of ds2 will be incorrectly scattered across the cluster leading to low match rate?
Or, will the LOOKUP keyword take precedence and the distributed ds2 will get copied in full to each node, thus rendering the distribution irrelevant, and allowing the join to find all the possible matches (as each node will have a full copy of ds2).
I know I can test this myself and come to my own conclusion, but I am looking for a definitive answer based on the way the language is written to make sure I understand and can use these options correctly.
For reference (from the Language Reference document v 7.0.0):
LOOKUP: Specifies the rightrecset is a relatively small file of lookup records that can be fully copied to every node.
LOCAL: Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE
It seems that with the LOCAL, the join completes more quickly. There does not seem to be a loss of matches on initial trials. I am working with others to run a more thorough test and will post the results here.
First, your code:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Since you're intending these results to be used in a JOIN, it is imperative that both datasets are distributed on the "same" data, so that the matching values end up on the same nodes so that your JOIN can be done with the LOCAL option. So this will only work correctly if ds1.field_a and ds2.field_b contain the "same" data.
Then, your join code. I assume you've made a typo in this post, because your join code needs to be (to work at all):
j1:=JOIN(ds1,ds2,LEFT.field_a=RIGHT.field_b,LOOKUP,LOCAL);
Using both LOOKUP and LOCAL options is redundant because a LOOKUP JOIN is implicitly a LOCAL operation. That means, your LOOKUP option does "override" the LOCAL in this insatnce.
So, all that means that you should either do it this way:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
j1:=JOIN(ds1,ds2,LEFT.field_a=RIGHT.field_b,LOCAL);
Or this way:
j1:=JOIN(some_dataset_1,some_dataset_2,LEFT.field_a=RIGHT.field_b,LOOKUP);
Because the LOOKUP option does copy the entire right-hand dataset (in memory) to every node, it makes the JOIN implicitly a LOCAL operation and you do not need to do the DISTRIBUTEs. Which way you choose to do it is up to you.
However, I see from your Language Reference version that you may be unaware of the SMART option on JOIN, which in my current Language Reference (8.10.10) says:
SMART -- Specifies to use an in-memory lookup when possible, but use a
distributed join if the right dataset is large.
So you could just do it this way:
j1:=JOIN(some_dataset_1,some_dataset_2,LEFT.field_a=RIGHT.field_b,SMART);
and let the platform figure out which is best.
HTH,
Richard
Thank you, Richard. Yes, I am notorious for typo's. I apologize. As I use a lot of legacy code, I have not had a chance to work with the SMART option, but I will certainly keep that in mine for me and the team, - so thank you for that!
However, I did run a test to evaluate how the compiler and the platform would handles this scenario. I ran the following code:
sd1:=DATASET(100000,TRANSFORM({unsigned8 num1},SELF.num1 := COUNTER ));
sd2:=DATASET(1000,TRANSFORM({unsigned8 num1, unsigned8 num2},SELF.num1 := COUNTER , SELF.num2 := COUNTER % 10 ));
ds1:=DISTRIBUTE(sd1,hash(num1));
ds4:=DISTRIBUTE(sd1,random());
ds2:=DISTRIBUTE(sd2,hash(num1));
ds3:=DISTRIBUTE(sd2,hash(num2));
j11:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1 ):independent;
j12:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j13:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j14:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j21:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1 ):independent;
j22:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j23:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j24:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j31:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1 ):independent;
j32:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j33:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1, LOCAL):independent;
j34:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j41:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j42:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j43:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j44:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j51:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j52:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j53:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL,HASH):independent;
j54:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL,HASH):independent;
dataset([{count(j11),'11'},{count(j12),'12'},{count(j13),'13'},{count(j14),'14'},
{count(j21),'21'},{count(j22),'22'},{count(j23),'23'},{count(j24),'24'},
{count(j31),'31'},{count(j32),'32'},{count(j33),'33'},{count(j34),'34'},
{count(j31),'41'},{count(j32),'42'},{count(j33),'43'},{count(j44),'44'},
{count(j51),'51'},{count(j52),'52'},{count(j53),'53'},{count(j54),'54'}
] , {unsigned8 num, string lbl});
On a 400 node cluster, the results come back as:
##
num
lbl
1
1000
11
2
1000
12
3
1000
13
4
1000
14
5
1000
21
6
1000
22
7
1000
23
8
1000
24
9
1000
31
10
1000
32
11
12
33
12
12
34
13
1000
41
14
1000
42
15
12
43
16
6
44
17
1000
51
18
1000
52
19
1
53
20
1
54
If you look at the row 12 in the result ( lbl 34 ), you will notice the match rate drops substantially, suggesting the compiler does indeed distribute the file (with the wrong hashed field) and disregard the LOOKUP option.
My conclusion is therefore that as always, it remains the developer's responsibility to ensure the distribution is right ahead of the join REGARDLESS of which join options are being used.
The manual page could be better. LOOKUP by itself is properly documented. and LOCAL by itself is properly documented. However, they represent two different concepts and can be combined without issue so that JOIN(,,, LOOKUP, LOCAL) makes sense and can be useful.
It is probably best to consider LOOKUP as a specific kind of JOIN matching algorithm and to consider LOCAL as a way to tell the compiler that you are not a novice and that you are absolutely sure the data is already where it needs to be to accomplish what you intend.
For a normal LOOKUP join the LEFT-hand side doesn't need to be sorted or distributed in any particular way and the whole RHS-hand side is copied to every slave. No matter what join value appears on the LEFT, if there is a matching value on the RIGHT then it will be found because the whole RIGHT dataset is present.
In a 400-way system with well-distributed join values, IF the LEFT side is distributed on the join value, then the LEFT dataset in each worker only contains 1/400th of the join values and only 1/400th of the values in the RIGHT dataset will ever be matched. Effectively, within each worker, 399/400th of the RIGHT data will be unused.
However, if both the LEFT and RIGHT datasets are distributed on the join value ... and you are not a novice and know that using LOCAL is what you want ... then you can specify a LOOKUP, LOCAL join. The RIGHT data is already where it needs to be. Any join value that appears in the LEFT data will, if the value exists, find a match locally in the RIGHT dataset. As a bonus, the RIGHT data only contains join values that could match ... it is only 1/400th of the LOOKUP only size.
This enables larger LOOKUP joins. Imagine your 400-way system and a 100GB RIGHT dataset that you would like to use in a LOOKUP join. Copying a 100GB dataset to each slave seems unlikely to work. However, if evenly distributed, a LOOKUP, LOCAL join only requires 250MB of RIGHT data per worker ... which seems quite reasonable.
HTH
Use case is as follows
We have list of faces in our system
User will upload one image
We want show list of faces which matches with uploaded image with say >0.8 confidence
Now looking at how to, i understood as follows
Using Face Detect API, We need to first upload all images including image with we want to verify
We can add all faces from our system in one of PersonGroupId
We then need to call Face-Verify API & pass image to verify & PersonGroupId to start comparing
In response we will get all faceId with isIdentical & confidence data ??
Is this is the right way?
After applying filters, our system can have say around 1000-3000 images.
BTW, in given link, it is mentioned that faceid will be expired after 24 hours after detection call :(
We also need to take care of performance in this case, so we are thinking of async call and then will return result somewhere in our system which can be retrieved later on.
What can be the best approach for this?
Pricing
i can see that 1st 30,000 transactions are free (with limitation of 20/m)
Face Storage cost is 16.53/m for 1000 images, does it means that Face-Detect API will store in Azure Blob storage? If yes and still faceId will be deleted after 24 hours ?
Face Storage - Stores images sized up to 4 MB each - whereas Face-Detect says, can store up to 6 MB
I might be missing something here, it would be great if someone can throw lights on it
Let's see the process that you will need to implement.
In the documentation here it says;
Face APIs cover the following categories:
...
FaceList: Used to manage a FaceList for Find Similar.
(Large)PersonGroup: Used to manage a (Large)PersonGroup dataset for Identification.
(Large)PersonGroup Person: Used to manage (Large)PersonGroup Person Faces for Identification.
In your case, it looks like you want to identify faces so you will use PersonGroup with PersonGroup Person items inside.
Step 1 - Generate your list of known faces
Details
So first you need to store your known faces in a group (called PersonGroup or LargePersonGroup given the number of items you have to store), in order to query these items with the image uploaded by your user. It will persist the items, there is no "24hour limit" with those groups.
If you want to understand the differences between "normal" and "large-scale" groups, see reference here: there are some differences that you must consider, in particular regarding the training process.
So let's use a normal PersonGroup, not large. Please note that the amount of items depend on your subscription:
Free-tier subscription quota: 1,000 person groups. Each holds up to 1,000 persons.
S0-tier subscription quota: 1,000,000 person groups. Each holds up to 10,000 persons.
Actions
Please also note that here I'm pointing to the API operations but all these actions can be performed in any language with those API calls, but also directly with the provided SDK for some languages (see list here)
Create a PersonGroup with PersonGroup - Create operation. You will specify a personGroupId in your request, that you will use below
Then for each person of your known faces:
Create a Person with PersonGroup Person - Create operation, giving the previous personGroupId in the request. You will got a personId guid value as a result, like "25985303-c537-4467-b41d-bdb45cd95ca1"
Add Faces of this user to its newly created Person by calling PersonGroup Person - Add Face operation and providing personGroupId, personId, additional optional information in the request and your image url in the body.
Note that for this operation:
Valid image size is from 1KB to 4MB. Only one face is allowed per
image.
Finally, once you have added your persons with their faces:
Call PersonGroup - Train operation
Check the training status with PersonGroup - Get Training Status operation
Then you are ready to identify people based on this group!
Step 2 - Search this FaceId inside your known faces
Easy, just 2 actions here:
Call Face - Detect operation to find faces inside your image. The result will be an array of item containing faceId and other attributes
If you have detected faces, call Face - Identify operation with the following parameters:
faceId, which is the value from the detect operation
personGroupId: the Id of the group you have created during step 1
confidenceThreshold: your confidence threshold, like 0.8
maxNumOfCandidatesReturned: Number of candidates returned (between 1 and 100, default is 10)
Request sample:
{
"personGroupId": "sample_group",
"faceIds": [
"c5c24a82-6845-4031-9d5d-978df9175426",
"65d083d4-9447-47d1-af30-b626144bf0fb"
],
"maxNumOfCandidatesReturned": 1,
"confidenceThreshold": 0.8
}
Other questions
Face Storage cost is 16.53/m for 1000 images, does it means that
Face-Detect API will store in Azure Blob storage? If yes and still
faceId will be deleted after 24 hours ?
Face-Detect API is not storing the image. The storage cost is about using PersonGroup or FaceLists
Face Storage - Stores images sized up to 4 MB each - whereas
Face-Detect says, can store up to 6 MB
As said, storage is about persisting faces like when you use PersonGroup Person - Add Face, where the limit is 4MB, not 6
I have created an MVP for a nodejs project, following are some of the features that are relevant to the question I am about to ask:
1-The application has a list of IP addresses with CRUD actions.
2-The application will ping each IP address after every 5 seconds.
3- And display against each IP address it's status i.e alive or dead and the uptime if alive
I created a working MVP on nodejs with the help of the library net-ping, express, mongo and angular. Now I have a new feature request that is:
"to calculate the round trip time(latency) for each ping that is generated for each IP address and populate a bar chart or any type of chart that will display the RTT(latency) history(1 months-1 year) of every connection"
I need to store the response of each ping in the database, Assuming the best case that if each document that I will store is of size 0.5 kb, that will make 9.5MB data to be stored in each day,285MB in each month and 3.4GB in a year for a single IP address and I am going to have 100-200 IP addresses in my application.
What is the best solution (including those which are paid) that will suit the best for my requirements considering the app can scale more?
Time series data require special treatment from a database perspective as they introduce challenges to the traditional database management from capacity, query performance, read/write optimisation targets, etc.
I wouldn't recommend you store this data in a traditional RDBMS, or object/document database.
Best option is to use a specialised time-series database engine, like InfluxDB, that can support downsampling (aggregation) and raw data retention rules
So I changed The schema design for the Time-series data after reading this and that reduced the numbers in my calculation of size massively
previous Schema looked like this:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: "Latency",
value: 1000000
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: "Latency",
value: 15000000
}
Size of each document: 0.22kb
number of document created in an hour= 720
size of data generated in an hour=0.22*720 = 158.4kb
size of data generated by one IP address in a day= 158 *24 = 3.7MB
Since every next time_Stamp is just the increment of 5 seconds from the previous one, the schema can be optimized to cut the redundant data.
The new schema looks like this :
{
timestamp_hour: ISODate("2013-10-10T23:06:00.000Z"),// will contain hours
type: “Latency”,
values: {//will contain data for all pings in the specific hour
0: 999999,
…
37: 1000000,
38: 1500000,
…
720: 2000000
}
}
Size of each document: 0.5kb
number of document created in an hour= 1
size of data generated in an hour= 0.5kb
size of data generated by one IP address in a day= 0.5 *24 = 12kb
So I Am assuming the size of the data will not be an issue anymore, and I although there is a debate for what type of storage should be used in such scenarios to ensure best performance but I am going to trust mongoDB in my case.
I would like to process/update every document in a Mongodb collection periodically (every 5 mins or so) and save the results back to the DB. The update function requires actual code to execute on each document (as far as I know) because it needs to perform computations such as taking the difference in timestamps and taking exponents with Math.pow, which the standard MongoDB update operators do not cover.
What is the best way to do this in NodeJS?
Full context: I am trying to implement the Hacker News ranking algorithm, which is time-dependent. The discussion I've seen around this involves using a separate thread/process to periodically update the scores on documents.
without wasting back and forth investigation it seems you have fields that i will call points, time of initial creation created_date and, then the ycombinator result of (p - 1) / (t + 2)^1.5
the easiest is to write a very simple 3 liner mongo shell script.
db.ycombinator.find().forEach(function(doc) {
var diff = ISODate() - doc.created_date; // subtract date using some form of date ISODate is available in mongo shell
var hours = diff.tomagicalhours(); // some regulr javascript
var result = (doc.points - 1) / Math.pow((hours + 2), 1.5); // perform yc algo
db.ycombinator.update({"_id":doc._id}, {$set:{"result": result} }); // write back into same collection and field, result
})
that goes into a file ycombinator_update.js and then do a 5 minute crontab.
*/5 * * * * mongo ycombinator_update.js
the performance of your reads will be noticeably slower during the writes operation contingent on the number of records in that collection.
you could assign scores based on the document timestamp at lookup time, and only keep the raw timestamps in the database. Since the score is a function of the timestamp anyway, the scoring algorithm can incorporate the exponential decay logic on the unmodified data. Scores can be converted to timestamps if to search by score.
Another option that isn't represented here is the MongoDB MapReduce or Aggregation frameworks.
Both these frameworks provide a way to iterate over all elements in a collection and output some results into a different collection. The aggregation API does not directly include the primitives we need to calculate the 1.5 exponent in the HN algorithm (no $sqrt or $pow), but there is a workaround.
I'm not certain at this point which approach is the most performant for this use case (and how it compares to the MongoDB shell script suggested by Gabe Rainbow).
I believe the next step is to run the update operations in a separate process, which is either scheduled with something like cron, or it could be kicked off via the node app itself using fork with the following logic:
On request for front page:
# when did we last update the scores for the front page?
if last_update was within last X minutes:
return list sorted by score right away
else
fork a process to sort the front page
last_update := Date.Now
return list sorted by score (either right away [stale], or after the update completes [takes a while])