Mongodb mapreduce performance

Mongodb mapreduce performance - multithreading

I have a mapreduce function that I use to prepare data for my web app to be used in realtime.
It works fine but it doesn't require my performance requirements.
My aim is (and I know that it's not meant to be this way) to perform it when the webapp user request for it (more or less in realtime).
I do use Mapreduce because the transformation of the data needs a lot of if/else conditions due to functional requirements.
My subset of initial data to be transformed is about 100k rich documents ( < 1kB ).
The result is stored in a collection (in Replace mode) that will be then used by the webapp.
The duration of processing now is about 6-9 seconds and the CPU and RAM usage are very low.
The acceptable waiting time for my users should be less than 5 seconds.
So, to benefit from the not used CPU, I tried to divide my initial input data into subsets and perform the mapreduce in each subset by a different thread (20k documents per thread).
For that I had to change the Replae mode to Merge mode to be able to collect the result into the same collection.
But it didn't help. It consumes more CPU but the total execution time is the more or less the same.
Setting "nonAtomic" to true in my mapReduce calls didn't help neither.
I read somewhere that there are (at least) 2 issues with running it this way :
My threads are not running in parallel for the inserts as the insert locks the output collection.
My threads are not running in parallel during processing because the js engine used by mongodb is not thread safe.
Are these points correct? And do you know any other better solutions?
PS: My mapreduce doesn't group data, it only tranforms it based on functional conditions (a lot of them). All emitted documents are unique (so reduce is always 0).
EDIT:
Here is an example:
My input objects are a products groups. ie:
{
_id : "1",
products : [
{code : "P1", name : "P1", price : 22.1 ...., competitors : [{code : "c1", price : 22.2},{code : "c2", price : 21.9}]},
{code : "P2", name : "P2", price : 22.1 ...., competitors : [{code : "c1", price : 22.2},{code : "c2", price : 21.9}]},
]
}
Users should be able to define dynamically functional groups based on some criterias applied to each product and define a pricing strategy for each one of them.
As a simple example of functional groupping, they could define 4 groups like this :
Cheap Products (whose price is less than 20)
Products that are sold by both competitors "C1" and "C2"
Products that are sold only by the competitor "C3"
Products that are sold by the competitor "C4" and is not in Promo
...
All these groups are defined based on properties of the Product object and because 1 product can possibly fit more than 1 group, the first encountred should be the used one (if it fits in the first group, it must not appear in any other one).
Once the groups criteria defined, users can define for each group a strategy to apply to calculate a new price for each product based on some conditions (also uses Product properties BUT ALSO other products properties on the same Array of the original input object).
The result is a collection of separate products with its functinal group, its new price and some other calculated stats and values.

Related

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Best way to store high frequency, periodic time-series data?

I have created an MVP for a nodejs project, following are some of the features that are relevant to the question I am about to ask:
1-The application has a list of IP addresses with CRUD actions.
2-The application will ping each IP address after every 5 seconds.
3- And display against each IP address it's status i.e alive or dead and the uptime if alive
I created a working MVP on nodejs with the help of the library net-ping, express, mongo and angular. Now I have a new feature request that is:
"to calculate the round trip time(latency) for each ping that is generated for each IP address and populate a bar chart or any type of chart that will display the RTT(latency) history(1 months-1 year) of every connection"
I need to store the response of each ping in the database, Assuming the best case that if each document that I will store is of size 0.5 kb, that will make 9.5MB data to be stored in each day,285MB in each month and 3.4GB in a year for a single IP address and I am going to have 100-200 IP addresses in my application.
What is the best solution (including those which are paid) that will suit the best for my requirements considering the app can scale more?

Time series data require special treatment from a database perspective as they introduce challenges to the traditional database management from capacity, query performance, read/write optimisation targets, etc.
I wouldn't recommend you store this data in a traditional RDBMS, or object/document database.
Best option is to use a specialised time-series database engine, like InfluxDB, that can support downsampling (aggregation) and raw data retention rules

So I changed The schema design for the Time-series data after reading this and that reduced the numbers in my calculation of size massively
previous Schema looked like this:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: "Latency",
value: 1000000
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: "Latency",
value: 15000000
}
Size of each document: 0.22kb
number of document created in an hour= 720
size of data generated in an hour=0.22*720 = 158.4kb
size of data generated by one IP address in a day= 158 *24 = 3.7MB
Since every next time_Stamp is just the increment of 5 seconds from the previous one, the schema can be optimized to cut the redundant data.
The new schema looks like this :
{
timestamp_hour: ISODate("2013-10-10T23:06:00.000Z"),// will contain hours
type: “Latency”,
values: {//will contain data for all pings in the specific hour
0: 999999,
…
37: 1000000,
38: 1500000,
…
720: 2000000
}
}
Size of each document: 0.5kb
number of document created in an hour= 1
size of data generated in an hour= 0.5kb
size of data generated by one IP address in a day= 0.5 *24 = 12kb
So I Am assuming the size of the data will not be an issue anymore, and I although there is a debate for what type of storage should be used in such scenarios to ensure best performance but I am going to trust mongoDB in my case.

Find documents in MongoDB with non-typical limit

I have a problem, but don't have idea how to resolve it.
I've got PointValues collection in MongoDB.
PointValue schema has 3 parameters:
dataPoint (ref to DataPoint schema)
value (Number)
time (Date)
There is one pointValue for every hour (24 per day).
I have API method to get PointValues for specified DataPoint and time range. Problem is I need to limit it to max 1000 points. Typical limit(1000) method isn't good way, because I need point for whole, specified time range, with time step depends on specified time range and point values count.
So... for example:
Request data for 1 year = 1 * 365 * 24 = 8760
It should return 1000 values but approx 1 value per (24 / (1000 / 365)) = ~9 hours
I don't have idea what method i should use to filter that data in MongoDB.
Thanks for help.

Sampling exactly like that on the database would be quite hard to do and likely not very performant. But an option which gives you a similar result would be to use an aggregation pipeline which $group's the $first best value by $year, $dayOfYear, and $hour (and $minute and $second if you need smaller intervals). That way you can sample values by time steps, but your choices of step lengths are limited to what you have date-operators for. So "hourly" samples is easy, but "9-hourly" samples gets complicated. When this query is performance-critical and frequent, you might want to consider to create additional collections with daily, hourly, minutely etc. DataPoints so you don't need to perform that aggregation on every request.
But your documents are quite lightweight due to the actual payload being in a different collection. So you might consider to get all the results in the requested time range and then do the skipping on the application layer. You might want to consider combining this with the above described aggregation to pre-reduce the dataset. So you could first use an aggregation-pipeline to get hourly results into the application and then skip through the result set in steps of 9 documents. Whether or not this makes sense depends on how many documents you expect.
Also remember to create a sorted index on the time-field.

Cassandra approach of RDBMS nested insertions

I receive regularly two types of sets of data:
Network flows, thousands per second:
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
(... etc ..)
}
And interface tables, every hour:
[
{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 19
'iface_name' : 'Fa0/0'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 20
'iface_name' : 'Fa0/1'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.157.38',
'iface_id' : 20
'iface_name' : 'Gi0/3'
}
]
I want to insert those flows in Cassandra, with interface names instead of IDs, based on the latest matching host/iface_id value. I cannot rely on a memory-only solution, otherwise I may loose up to one hour of flows every time I restart the application.
What I had in mind, is to use two Cassandra tables: One that holds the flows, and one that holds the latest host/iface_id table. Then, when receiving a flow, I would use this data to properly fill interface name.
Ideally, I would like to let Cassandra take care of this. In my mind, it seems more efficient than pulling out interface names from the application side every time.
The thing is that I cannot figure out how to do that - and having never worked with NoSQL before, I am not even sure that this is the right approach... Could someone point me in the right direction?
Inserting data in the interface table and keeping only the latest version is quite trivial, but I cannot wrap my mind around the 'inserting interface name in flow record' part. In a traditional RDBMS I would use a nested query, but those don't seem to exist in Cassandra.

Reading your question, I can hope that the data hourly received in interface table is not too big. So we can keep that data (single row) in memory as well as in cassandra database. For every hour, the in memory data will get updated as well as a new inserted in to database. We can save interface data with below table definition -
create table interface_by_hour(
year int,
month int,
day int,
hour int,
data text, -- enitre json string for one hour interface data.
primary key((year,month,day,hour)));
Few insert statements --
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,23,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,00,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,28,1,'{complete json.........}');
keep every hours interface data in this table and update it in memory as well. Benefit of having in memory data is that you don't have to read it from table thousand of time every second. If application goes down, you can read the current/previous hour data from table using below query, and build the in memory cache.
cqlsh:mykeyspace> select * from interface_by_hour where year=2017 and month=1 and day=27 and hour=0;
year | month | day | hour | data
------+-------+-----+------+--------------------------
2017 | 1 | 27 | 0 | {complete json.........}
Now comes the flow data --
As we have current hour interface table data cached in memory, we can quickly map interface name to host. Use below table to save flow data.
create table flow(
iface_name text,
createdon bingint, -- time stamp in milliseconds.
host text, -----this is optionl, if you want dont use it as column.
flowdata text, -- entire json string
primarykey(iface_name,createdon,host));
Only issue I see in above table is that it will not distribute data evenly across the partitions, if you have too many flow data for one interface name, whole data will inserted in to one partition.
I designed this table just to save the data, if you could have specified how you going to use this data, I would have done some more Thinking.
hope this helps.

Hi as far as I can tell the interface data is not so heavy on the writes that it would need partitioning by time. It changes only once per hour so it's not necessary to save data for every hour just the latest version. Also I will assume that you want to query this in some way I'm not sure how so I'll just propose something general for interface and will threat the flows as time series data:
create table interface(
iface_name text primary key,
iface_id int,
host text,
stamp timestamp
);
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/0', 19, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/1', 20, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Gi0/3', 20, '192.168.157.38', '2017-01-19 03:00:00');
usually this is an antipatern with cassandra:
cqlsh:test> select * from interface;
iface_name | host | iface_id | stamp
------------+----------------+----------+---------------------------------
Fa0/0 | 192.168.2.6 | 19 | 2017-01-19 02:00:00.000000+0000
Gi0/3 | 192.168.157.38 | 20 | 2017-01-19 02:00:00.000000+0000
Fa0/1 | 192.168.2.6 | 20 | 2017-01-19 02:00:00.000000+0000
But as far as I can see you don't have that many interfaces
So basically anything up to thousands will be o.k. here in worst case you might want to
use the token function to get the data out from partitions but the thing is this will save you
a lot on space and you don't need to save this by the hour.
I would simply keep this table in memory also and then enrich the data as it comes in.
If there is updates, update the in memory cache ... but also put writes to cassandra.
If something fails then simply restore from interface table and continue.
basically your flow info would then become
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
'iface_name' : 'key put from in memory cache',
}
This is how you will get the bigest performance now saving flows is just
time series data then, take into account that you are hitting the cluster
with thousands per second and that when you are paritioning by time you
get at least 7000 if not more columns every second in (with the model
I'm proposing here) usually you will want to have up to 100 000 columns
within single partition, which would say that your partition goes over
ideal size withing 20 seconds or even less so I would even suggest using
random buckets (when inserting just use some number in defined range
let's say 10):
create table flow(
time_with_minute text,
artificial_bucket int,
stamp timeuuid,
host text,
ip_src text,
ip_dst text,
iface_in int,
iface_out int,
iface_name text,
primary key((time_with_minute, artificial_bucket), stamp)
);
When wanting to fetch flows over time you would simply use the parts of
a timestamp plus make 10 queries at the same time or one by one to access all the data. There are various techniques here, you simply need to tell more about your use case.
inserting is then something like:
insert into flow(time_with_minute, artificial_bucket, stamp, host, ip_src, ip_dst, iface_in, iface_out, iface_name)
values ('2017-01-19 01:37', 1, now(), '192.168.2.6', '10.29.78.3', '8.8.4.4', 19, 20, 'Fa0/0');
I used now just for an example, use https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/utils/UUIDGen.java to
generate timeuuid with the time when you inserted flow. Also I inserted 1 into artificial bucket, here you would insert random number with range, let's say 0-10 at least. Some people, depending on the load insert multiple random buckets, even 60 or more. It all depends on how heavy writes are. If you just put it to minute every minute a group of nodes within the cluster will be hot and this will switch around. Having hot nodes is usually not a good idea.
With cassandra you are writing the information that you need right away, you are not doing
any joins during write or something similar. Keep the data in memory that you need to
stamp the data with the information that you need and just insert without denormalisation.
Also you can model the solution in a relational way and just tell how you would like to
access the data then we can go into details.

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?

You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.

For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string