How to run batch transactions in neo4j - node.js

We are working on a recommendation engine use case at the production level.
We had a use case to fetch products sorted based on geography(nearest first), we thought to make use of spatial functions like distance and point to sort them.
For that, we need latitude and longitude properties in Product nodes, which we don’t have yet, but we have a postal address, so we figured, we’ll make use of call apoc.spatial.geocodeOnce, fetch, and set latitude and longitude properties in all the Product nodes.
Well, a bit of problem is it’s becoming an expensive operation, we have around 5000 product nodes, it’s taking an average 1000 ms to update each node, by that calculation, on a single core, it’d take around 90 min to update all the nodes. We are really curious to know if there’s a smarter way to handle the transaction in chunks(maybe, updating 500 products in one tx then, next 500… and so on). We thought, apoc.periodic.iterate is a way to go but we are looking for suggestions on how do we solve this problem efficiently?
P.S - When we tried out a few apoc.spatial.geocodeOnce calls to a couple of products with the postal address we have in our db, we saw a couple of calls returned no result, what could be the possible reasons for this? (maybe, we don’t have standardized postal addresses for such products? if so how shall we address the problem, shall we make use of google geocode API for those products or there are other smarter ways embedded in neo4j)
Our query below for reference:
CALL apoc.periodic.iterate(
"MATCH (p:Product) return p",
"CALL apoc.spatial.geocodeOnce(p.postal_address) YIELD location SET p.latitude=location.latitude, p.longitude=location.longitude",
{batchSize:500, iterateList:true, parallel:true}
)

It's due to throttle of apoc.spatial.geocodeOnce API calls. There's a rate-limiter on the number of requests you can make to the server.
You can buy google geocode API Key, configure inside apoc.conf file like,
apoc.spatial.geocode.provider=google
apoc.spatial.geocode.google.throttle=1
apoc.spatial.geocode.google.key={YOUR_API_KEY}
Note - throttle here, is in mSec

Related

Best way to Fetch N rows in ScyllaDB

I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.

How to create Accumulo Embedded Index with Rounds strategy?

I am a beginner in Accumulo and using Accumulo 1.7.2.
As an Indexing strategy, I am planning to use Embedded Index with Rounds Strategy (http://accumulosummit.com/program/talks/accumulo-table-designs/ on page 21). For the same, I couldn't find any documents anywhere. I am wondering if any of you could help me here.
My description of that strategy was mostly just to avoid sending a query to all the servers at once by simply querying one portion of the table at a time. Adding rounds to an existing 'embedded index' example might be the easiest place to start.
The Accumulo O'Reilly book includes an example that starts on page 284 in a section called 'Index Partitioned by Document' whose code lives here: https://github.com/accumulobook/examples/tree/master/src/main/java/com/accumulobook/designs/multitermindex
The query portion of that example is in the class WikipediaQueryMultiterm.java. It uses a BatchScanner configured with a single empty range to send the query to all tablet servers. To implement the by-rounds query strategy this could be replaced with something that goes from one tablet server to the next, either in a round-robin fashion, or perhaps going to 1, then if not enough results are found, going to the next 2, then 4 and so on, to mimic what Cassandra does.
Since you can't target servers directly with a query and since the table is using some partitioning IDs you could configure your scanners to scan all the key values within the first partition ID, then querying the next partition ID and so on, or perhaps visiting the partitions in random order to avoid congestion.
What some others have mentioned, adding additional indexes to help narrow the search space before sending a query to multiple servers hosting an embedded index, is beyond the scope of what I described and is a strategy that I believe is employed by the recently released DataWave project: https://github.com/NationalSecurityAgency/datawave

azure search lookup document count as query

If i look up document to bring data from azure search - does it affect the queries per second per index indicate in here
I want to know if i can use the azure search to host some data and access it without affecting the search performance.
thanks
Yes a lookup is considered a query. Please note that we do not throttle your queries and this number listed in the page you point to is only meant as a very rough indication of what a single search unit with an "average" index and an "average" set of queries could handle. In some cases (for example, if you were just doing lookups which are very simple queries), you might very likely get more than 15 QPS with a very good latency rate. In some cases (for example, if you have queries with a huge number of facets), you might get less. Please note, that although we do not throttle you, it is also possible that you could exceed the resources of the units allocated to you and will start to receive throttling http responses.
In general, the best thing to do is track the latency of your queries. If you start seeing the latency go higher then what you find acceptable, that is typically a good time to consider adding another replica.
Ultimately, the only way to know for sure is to test your specific index with the types of queries and load you expect.
I hope that helps.
Liam

Cassandra count use case

I'm trying to figure out an appropriate use case for Casandra's counter functionality. I thought of a situation and I was wondering if this would be feasible. I'm not quite sure because I'm still experimenting with Cassandra so any advice would be appreciated.
Lets say you had a small video service, you record the log of views in Cassandra while recording what video was played, which user played it, country, referer etc. You obviously want to show a count of how many times that video was played would incrementing a counter every time you insert a play event be a good solution to this? Or would there be a better alternative. Counting all the events on read every time would take a pretty big performance hit and even if you cached the results the cache would be invalidated pretty quickly if you had a busy site.
Any advice would be appreciated!
Counters can be used for whatever you need to count within an application -- both "frontend" data and "backend" one. I personally use them to store user's behaviour information (for backend analysis) and frontend ratings (each operation a user do in my platform give to the user some points). There is no real limitation on use case -- the limitation is given by few technical limitations, the bigger coming to my mind:
a counter cf can be made only by counters columns (except PK, obviously)
counters can't be reset: to set 0 value to a counter you need to read and calculate before writing (with no guarantee about the fact that someone else updated before you)
no ttl and no indexing/deletion
As far as your video service it all depends on how you choose to model data -- if you find a valid model to hit few partitions each time you write/read and you have a good key distribution I don't see any real problem in its implementation.
btw: you tagged Cassandra 2.0 but if you have to use counters you should think about 2.1 for the reasons described here

CloudTable.ExecuteQuery : how to get number of transactions

AFAIK the ExecuteQuery handles segmented queries internally. And every request (=call to storage) for a segment counts as a transaction to the storage (I mean billing transaction - http://www.windowsazure.com/en-us/pricing/details/storage/). Right?
Is there a way to understand in how many segments is splitted a query? If I run a query and I get 5000 items, I can suppose that my query was splitted into 5 segments (due to the limit of 1000 items per segment). But in case of a complex query there is also the timeout of 5 seconds per call.
I don't believe there's a way to get at that in the API. You could set up an HTTPS proxy to log the requests, if you just want to check it in development.
If it's really important that you know, use the BeginExecuteSegmented and EndExeceuteSegmented calls instead. Your code will get a callback for each segment, so you can easily track how many calls there are.

Resources