What are atomic operations for newbies?

What are atomic operations for newbies? - multithreading

I am a newbie to operating systems and every answer I've found on Stackoverflow is so complicated that I am unable to understand. Can someone provide an explanation for what is an
atomic operation
For a newbie?
My understanding: My understanding is that atomic operation means it executes fully with no interruption? Ie, it is a blocking operation with no scope of interruption?

Pretty much, yes. "Atom" comes from greek "atomos" = "uncuttable", and has been used in the sense "indivisible smallest unit" for a very long time (till physicists found that, in fact, there are smaller things than atoms). In concurrent programming, it means that there will be no context switch during it - nothing can affect the execution of atomic command.
An example: a web poll, open-ended questions, but we want to sum up how many people give the same answer. You have a database table where you insert answers and counts of that answer. The code is straightforward:
get the row for the given answer
if the row didn't exist:
create the row with answer and count 1
else:
increment count
update the row with new count
Or is it? See what happens when multiple people do it at the same time:
user A answers "ham and eggs" user B answers "ham and eggs"
get the row: count is 1 get the row: count is 1
okay, we're updating! okay, we're updating!
count is now 2 count is now 2
store 2 for "ham and eggs" store 2 for "ham and eggs"
"Ham and eggs" only jumped by 1 even though 2 people voted for it! This is clearly not what we wanted. If only there was an atomic operation "increment if it exists or make a new record"... for brevity, let's call it "upsert" (for "update or insert")
user A answers "ham and eggs" user B answers "ham and eggs"
upsert by incrementing count upsert by incrementing count
Here, each upsert is atomic: the first one left count at 2, the second one left it at 3. Everything works.
Note that "atomic" is contextual: in this case, the upsert operation only needs to be atomic with respect to operations on the answers table in the database; the computer can be free to do other things as long as they don't affect (or are affected by) the result of what upsert is trying to do.

Related

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Find documents in MongoDB with non-typical limit

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.

Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?

You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.

For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

cypher pagination total result count

I have a monstrosity of a cypher query and I need to paginate the results of it. What I am trying to do is to get the total number of results before limit is done.
Here is my test graph: http://console.neo4j.org/?id=6hq9tj
I tried to use count(o) in all parts of the query but I always get the same result: 'total_count: 1'. Like in here: http://console.neo4j.org/?id=konr7. The result what I am trying to get should be: 'total_count: 6'.
I always could make an another query just to count the results but it makes no sense to execute two queries.
Please can any one help me one this? Thanks!

Something like this should work:
MATCH (o:Brand)
WITH o
ORDER BY o.name
WITH collect({uuid:o.uuid, name:o.name}) AS brands, COUNT(distinct o.uuid) AS total
UNWIND brands AS brand_row
WITH total, brand_row
SKIP 5
LIMIT 5
RETURN COLLECT(brand_row) AS brands, total;
Note: this is untested, something similar worked for me. Also, not sure how performant it is.

The only way I've gotten this to work is by defining the query twice, I'm not sure though what the impact is on performance, I would guess or hope it was cached the first time. Be warned: This is not a real solution as my comment above to the question states, if you use an offset out of range, nothing is returned!
// first query only to get count
MATCH (x:Brand)
WITH count(*) as total
// query again to get results :(
MATCH (o:Brand)
WITH total, o
ORDER BY o.name SKIP 5 LIMIT 5
WITH total, collect({uuid:o.uuid, name:o.name}) AS brands
RETURN {total:total, brands:brands}
If anyone comes up with a better solution, I as well would love to see it, spent enough time trying to get this to work properly.
Slightly better solution that can handle offset out of range...
// first query to get results
MATCH (o:Brand)
WITH o
ORDER BY o.name SKIP 5 LIMIT 5
WITH collect({uuid:o.uuid, name:o.name}) AS brands
// then query again to get count
MATCH (x:Brand)
WITH brands, count(*) as total
RETURN {total:total, brands:brands}
But it's still two queries and isn't a valid answer to the original question

Hector to get the resulting counter value after doing incrementCounter

We are doing the following to update the value of a counter, now we wonder if there is a straightforward way to get back the updated counter value immediately.
mutator.incrementCounter(rowid1, "cf1", "counter1", value);

There's no single 'incrementAndGet' operation in Cassandra thrift API.
Counters in Cassandra are eventually consistent and non-atomic. Fragile ConsistencyLevel.ALL operation is required to get "guaranteed to be updated" counter value, i.e. perform consistent read. ConsistencyLevel.QUORUM is not sufficient (as specified in counters design document: https://issues.apache.org/jira/secure/attachment/12459754/Partitionedcountersdesigndoc.pdf).
To implement incrementAndGet method that looks consistent, you might want at first read counter value, then issue increment mutation, and return (read value + inc).
For example, if previous counter value is 10 to 20 (on different replicas), and one add 50 to it, read-before-increment will return either 60 or 70. And read-after-increment might still return 10 or 20.

The only way to do it is query for it. There is no increment-then-read functionality available in Cassandra.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string