I have large amount of data that consists of users who visit web sites. I have time stamp for each visit. Using the http://jexp.de/blog/2012/10/parallel-batch-inserter-with-neo4j/ script, I created a graph that has a separate path for each page
U1-->T1-->P1
|
--->T2-->P2
etc.
Now, I would want to have the following structure:
U1->T1->P1->T2->P2...
Obviously, each user visits different number of pages. I have the file that looks like this:
person,time,place
U1,t1,P1
U1,t2,P2
U1,t3,P3
U2,t4,P1
U2,t5,P6
each user sequence is ordered by visit time, so t1about me->blog etc.
Is the above structure U1->T1->P1->T2->P2 a good approach? (I have around 30 million entries)
I need to modify the groovy script so that it can automatically add relationships and nodes in the same sequence. I was thinking to keep the previous user id in memory and if new user id=old id, then I will add only relationship and place. Otherwise, I will create a new user and build new path.
I assume that your nodes are labeled U for users, T for timestamps, and P for pages.
You do not need timestamp nodes. You can, instead, put the timestamp value in the relationship between a U and a P. This will greatly reduce the number of nodes and relationships.
For example, instead of this (I am making up the relationship
types):
(:U)-[:VISITED_AT]->(:T {timestamp: 123})-[:PAGE]->(:P)
you can use this, which saves you 1 node and 1 relationship per visit:
(:U)-[:VISITED {timestamp: 123}]->(:P)
What you describe seems reasonable, BUT you could create multiple nodes for the same page (e.g., P1 in your example file, since it appears twice), whereas you really want to have one node per page. Also, if the file were to contain another U1 row after the U2 rows, you'd create a second U1 node. To prevent such duplication, you should use MERGE instead of CREATE for your U and P nodes. MERGE will create a node only if it does not already exist, else it just returns the existing node. Once you have the nodes, you can go ahead and CREATE the relationship (with the timestamp as a property) linking them together.
Related
I want to create a new node (event nodes) among a set of nodes (report nodes) according to the indicator nodes (each report node has several indicator nodes related to it). I want to set the new event nodes with the rules:
a report nodes is only connected one event node
if more than one indicator nodes has the same property "pattern", then they belong to the same event node
here are my query code :
OPTIONAL MATCH
(indicator_1_1:indicator)<-[:REFERS_TO]-(report_1:report)-[:REFERS_TO]->(indicator_1_2:indicator),
(indicator_2_1:indicator)<-[:REFERS_TO]-(report_2:report)-[:REFERS_TO]->(indicator_2_2:indicator)
WHERE
indicator_1_1.pattern=indicator_2_1.pattern
and
indicator_1_2.pattern=indicator_2_2.pattern
MERGE
(report_1)-[:related_to]->(event:EVENT)<-[:related_to]-(report_2)
and get the result as below,
But i want the three report nodes belong to one event node.
I want to know what changes should I make to my query ,or what next step should I take after getting the two event nodes.
What's more , I want to know wheter there is a more efficient query code than mine.
Thanks!
I don't have any data to confirm, but I think a small change to your Cypher query will produce what you want.
From the Neo4j Cypher Manual chapter on MERGE (my emphasis added).
When using MERGE on full patterns, the behavior is that either the
whole pattern matches, or the whole pattern is created. MERGE will
not partially use existing patterns — it’s all or nothing. If
partial matches are needed, this can be accomplished by splitting a
pattern up into multiple MERGE clauses.
So, following this, I think if you change
MERGE (report_1)-[:related_to]->(event:EVENT)<-[:related_to]-(report_2)
to
MERGE (report_1)-[:related_to]->(event:EVENT)
MERGE (event)<-[:related_to]-(report_2)
... you will prevent the extra :EVENT nodes from being created and get the graph you are looking for.
Finally, I find the answer. My solution is merge the :event node ,and then the relaionships
step 1 : merge the :event nodes
MATCH ()-[r_1:related_to]->(event_1:EVENT)<-[r_2:related_to]-()-[r_3:related_to]->(event_2:EVENT)<-[r_4:related_to]-()
call apoc.refactor.mergeNodes([event_1,event_2]) YIELD node
RETURN node
step 2 : merge the dupicate relationships
MATCH (X)-[r]-(Y)
WITH X,Y, TAIL (collect(r)) as rr
FOREACH (r IN rr | DELETE r)
I've got a database that looks like this:
I want to reset the Patient_List and Current_Token children to 0 at a particular time (say 12 AM) automatically for all the users in the database.
I went through the Firebase cloud function docs and these projects on gitHub:
Deleting nodes and Delete unused user accounts using Cron.
Both of the above examples update and change data based on a particular UID i.e update only particular nodes based on the UID passed.
I would like to know if it is possible to update values for the above child elements in all the nodes and if so , how would it be done?
It's not possible to have a single update statement affect a child of all database nodes under some location. You will have to iterate all the nodes (by querying/reading them somehow), then update each one that you find. This is potentially a costly operation in terns of bandwidth.
Only use the general ref, and before that reference all nodes that you want to update, just like this:
var gralRef = firebase.database().ref();
var obj_update={};
obj_update[`users/${firstuser}/name`]='John';
obj_update[`users/${seconduser}/name`]='Mark';
obj_update[`users/${seconduser}/address`]='New York';
gralRef.update(obj_update);
I'm working with Azure Table (storage) in order to store information about websites I'm working with. So, I planned this structure:
Partition Key - domain name
Row key - Webpage address
Valid until (date time) - after this date, the record will be deleted.
Other crucial data here...
Those columns will be stored in a table called as the website address (e.g. "cnn.com").
I have two main use case (high to low):
1. Check if URL "x" is in the table - find by combination of Partition Key and Row Key - very efficient.
2. Delete old data - remove all expired data (according to "Valid until" column). This operation is taking place every mid-night and possibly delete millions of row - very heavy.
So, our first task (check if URL exists) is implemented in efficient way with this data model. The second task, not. I want to avoid batch deletion.
I also worry about making "hot-spots", which will make me low performance. This because the Partition Key. I expect that in some hours, I will query more question for specific domain. This will make this partition hotspot and hit my performance. In order to avoid this, I thought to use hash-function (on the URL) and the result will be the "partition key". Is this good idea?
I also thought about other implementation way and it's looks like they have some problems:
Storing the rows in table that named with the deletion date (e.g. "cnn.com-1-1-2016"). This provide us great deleting performance. But, bad searching experience (the row can be exists in more then one table. e.g. "cnn.com-1-1-2016" or "cnn.com-2-1-2016"...).
What is the right solution for my problem?
Have you seen the Azure Table Storage Design Guide? It describes principles and patterns for designing tables solutions at scale. For hot spots take a look at the prepend / append anti-pattern for some extra information. This is where all your operations occur within a single partition which prevents additional resources from being added. For these types of scenarios you will get better scale if you can distribute the operations across partitions instead.
Let's assume you have a site https://www.yahoo.com/news/death-omar-al-shishani-could-mean-war-against-203132664.html?nhp=1. You can keep PK as domainName + "/news/" + 2 letters of page address, summary https://www.yahoo.com/news/de. RK - other part of the full address. This will split your domain partition on near 1000 partitions. If that's not enough - use 3 first letter in PK.
Remove obsolete data every 15 minutes (create a separate service for it). Your millions will became just tens of thousands. Or keep less data (2 weeks instead of month for.ex.). And do not forget optimize deletion (get PK and RK only, update ETag to "*", remove as DynamicTableEntity, batch if possible).
So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.
My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.
I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?
Any better ideas?
Very interesting question!!! Extending #Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))
Assumptions:
Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
You have one separate worker role which is doing all the reading of entities.
You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.
Possible Solution:
First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.
Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.
Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.
What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.
Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.
As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.
If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.
If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.
In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.
As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.
I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.
I am just stuck in a design problem. I want to assign ranks to user records in a table. They do some action on the site and given a rank on basis of leader board. And the select I want on them could be on Top 10, User's position, Top 10 logged in today etc.
I just can not find a way to store it in Azure table. Than I thought about storing custom collection object (a sorted list) in blob.
Any suggestions?
Table entities are sorted by PartitionKey, RowKey. While you could continually delete and recreate users (thus allowing you to change the PK, RK) to give the correct order, it seems like a bad idea or at least overkill. Instead, I would probably store the data that you use the compute the rankings and periodically compute and store the rankings (as you say). We do this a lot in our work - pre-compute what the data should look like in JSON view, store it in a blob, and let the UI query it directly. The trick is to decide when to re-compute the view. After a user does an item that would cause the rankings to be re-computed, I would probably queue a message and let a worker process go and re-compute the view. This prevents too many workers from trying to update the data at once.