Based on the following collections:
data_invoices (document, 100,000 total records, 2 tenants)
hash: tenantId
persistent: createdOn
data_jobs (document, 10,000 total records, 2 tenants)
hash: tenantId
persistent: createdOn
data_links (edge, 100,000 total records)
persistent: createdOn
persistent (sparse): replacedOn
The links collection will connect one invoice to a random job, so a job may have zero or more invoices. An invoice should have one or more jobs, but in my data, each invoice is matched to only one job. The date filter does not actually filter out any data (they are all less than the specified date value) and neither does the tenantId filter since all the data is either xxx or yyy.
The generic structure of data_jobs and data_invoices is:
tenantId: string;
createdOn: number;
data: [{
createdOn: number;
values: {
...collection specific data here...
};
}];
The collection specific data structure for data_invoices is:
number: number;
amount: number;
The collection specific data structure for data_jobs is:
name: string;
The structure of the data_links table is:
createdOn: number;
replacedOn?: number; // though I don't have any records with this value set
The createdOn field is the date value represented as ticks from 1970, and is a random date from 01 Jan 2000 to today.
The amount field is a random currency value (2 decimal places) from 10 to 10,000.
The number field is an autonumber type field.
I have two very similar (in my opinion) queries, one way (jobs to invoices) works very, very quickly, the other one takes ages.
This query takes 1.85 seconds:
LET date = 1495616898128
FOR job IN data_jobs
FILTER job.tenantId IN ['xxx', 'yyy']
FILTER job.createdOn<=date
LET jobData = (job.data[* FILTER CURRENT.createdOn<=date LIMIT 1])[0]
FILTER CONTAINS(jobData.values.name, 'a')
LET invoices = (
FOR invoice, link IN 1 INBOUND job data_links
FILTER link.createdOn<=date AND (link.replacedOn == NULL OR
link.replacedOn>date)
LET invoiceData = (invoice.data[* FILTER CURRENT.createdOn<=date LIMIT 1])[0]
FILTER invoiceData.values.amount>1000
COLLECT WITH COUNT INTO count
RETURN {
count
}
)[0]
FILTER invoices.count>0
SORT jobData.values.name ASC
LIMIT 0,8
RETURN job
This query takes 8.5 seconds:
LET date = 1495616898128
FOR invoice IN data_invoices
FILTER invoice.tenantId IN ['xxx', 'yyy']
FILTER invoice.createdOn<=date
LET invoiceData = (invoice.data[* FILTER CURRENT.createdOn<=date LIMIT 1])[0]
FILTER invoiceData.values.amount>1000
LET jobs = (
FOR job, link IN 1 OUTBOUND invoice data_links
FILTER link.createdOn<=date AND (link.replacedOn == NULL
OR link.replacedOn>date)
LET jobData = (job.data[* FILTER CURRENT.createdOn<=date LIMIT 1])[0]
FILTER CONTAINS(jobData.values.name, 'a')
COLLECT WITH COUNT INTO count
RETURN {
count
}
)[0]
FILTER jobs.count>0
SORT invoiceData.values.amount ASC
LIMIT 0,8
RETURN invoice
I realise that both queries are providing different data, but the processing time should be the same shouldn't it? They are both filtering both tables through the links table and both performing aggregations on the other. I don't understand why one way is much quicker than the other way. Is there anything I can do to increase the performance of these queries please?
Okay, strange but I have stumbled upon a very counter-intuitive (at least to me) solution. Sort first, then filter...???
This query now takes 1.4 seconds:
LET date = 1495616898128
FOR invoice IN data_invoices
FILTER invoice.tenantId IN ['xxx', 'yyy']
FILTER invoice.createdOn<=date
LET invoiceData = (invoice.data[* FILTER CURRENT.createdOn<=date LIMIT 1])[0]
SORT invoiceData.values.amount ASC
FILTER invoiceData.values.amount>1000
LET jobs = (
FOR job, link IN 1 OUTBOUND invoice data_links
FILTER link.createdOn<=date AND (link.replacedOn == NULL
OR link.replacedOn>date)
LET jobData = (job.data[* FILTER CURRENT.createdOn<=date LIMIT 1])[0]
FILTER CONTAINS(jobData.values.name, 'a')
COLLECT WITH COUNT INTO count
RETURN {
count
}
)[0]
FILTER jobs.count>0
LIMIT 0,8
RETURN invoice
Despite adding a persistent index on data[*].values.amount, it still does not use it (I even tried SORT invoice.data[0].values.amount ASC and it still didn't seem to use the index?)
Can anyone explain this please?
Related
I have two large datasets. There are multiple groupings of the same ids. Each group has a score. I'm trying to broadcast the score to each id in each group. But I have a nice constraint that I don't care about groups with more than 1000 ids.
Unfortunately, Spark keeps reading the full grouping. I can't seem to figure out a way to push down the limit so that Spark only reads up to 1000 records, and if there are any more gives up.
So far I've tried this:
def run: Unit = {
// ...
val scores: RDD[(GroupId, Score)] = readScores(...)
val data: RDD[(GroupId, Id)] = readData(...)
val idToScore: RDD[(Id, Score)] = scores.cogroup(data)
.flatMap(maxIdsPerGroupFilter(1000))
// ...
}
def maxIdsPerGroupFilter(maxIds: Int)(t: (GroupId, (Iterable[Score], Iterable[Id]))): Iterator[(Id, Score)] = {
t match {
case (groupId: GroupId, (scores: Iterable[Score], ids: Iterable[Id])) =>
if (!scores.iterator.hasNext) {
return Iterator.empty
}
val score: Score = scores.iterator.next()
val iter = ids.iterator
val uniqueIds: mutable.HashSet[Id] = new mutable.HashSet[Id]
while (iter.hasNext) {
uniqueIds.add(iter.next())
if (uniqueIds.size > maxIds) {
return Iterator.empty
}
}
uniqueIds.map((_, score)).iterator
}
}
(Even with variants where the filter function just returns empty iterators, Spark still is insistent on reading all the data)
The side effect of this is that because some groups have too many ids, I have a lot of skew in the data and the job can never finish when processing the full scale of data.
I want the reduce-side to only read in the data it needs, and not crap out because of data skew.
I have a feeling that somehow I need to create a transform that is able to push down a limit or take clause, but I can't figure out how.
Can't we just filter out those groups which have records more than 1k using count() in grouped data?
or if you want to have those groups also which have more than 1k records but only to pick upto 1k records then in spark sql query you can use ROW_NUMBER() OVER (PARTITION BY id ORDER BY someColumn DESC) AS rn and then put condition rn<=1000.
We have use case where we are receiving message from kafka that needs to be aggregated. This has to be aggregated in a way that if an updates comes on same id then existing value if any needs to be subtracted and the new value has to be added.
From various forum i got to know that jet doesnt store raw values rather aggregated result and some internal data.
In such case how can i achieve this?
Example
Balance 1 {id:1, amount:100} // aggregated result 100
Balance 2 {id:2, amount:200} // 300
Balance 3 {id:1, amount:400} // 600 after removing 100 and adding 400
I could achieve a simple use where every time add. But i was not able to achieve the aggregation where existing value needs to be subtracted and new value has to be added.
rollingAggregation(AggregatorOperations.summingDouble(<login to add remove>))
.drainTo(Sinks.logger()).
Balance 1,2,3 are sequnce of messages
The comment shows whats the aggregated value at each message performed by jet.
My aim is to add new amount (if id comes for the first time) and subtract amount if an updated balance comes i. e. Id is same as earlier.
You can try a custom aggregate operation which will emit the previous and currently seen values like this:
public static <T> AggregateOperation1<T, ?, Tuple2<T, T>> previousAndCurrent() {
return AggregateOperation
.withCreate(() -> new Object[2])
.<T>andAccumulate((acc, current) -> {
acc[0] = acc[1];
acc[1] = current;
})
.andExportFinish((acc) -> tuple2((T) acc[0], (T) acc[1]));
}
The output should be a Tuple of the form (previous, current). Then you can apply rolling aggregate again to the output. To simplify the problem as input I have a pair of (id, amount) pairs.
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Integer, Long>mapJournal("map", START_FROM_OLDEST)) // (id, amount)
.groupingKey(Entry::getKey)
.rollingAggregate(previousAndCurrent(), (key, val) -> val)
.rollingAggregate(AggregateOperations.summingLong(e -> {
long prevValue = e.f0() == null ? 0 : e.f0().getValue();
long newValue = e.f1().getValue();
return newValue - prevValue;
}))
.drainTo(Sinks.logger());
JetConfig config = new JetConfig();
config.getHazelcastConfig().addEventJournalConfig(new EventJournalConfig().setMapName("map"));
JetInstance jet = Jet.newJetInstance(config);
IMapJet<Object, Object> map = jet.getMap("map");
map.put(0, 1L);
map.put(0, 2L);
map.put(1, 10L);
map.put(1, 40L);
jet.newJob(p).join();
This should produce as output: 1, 2, 12, 42.
At first - I am a beginner with mongodb. So i have next probleb. I am using such a model as below with mongoengine:
class Stats(Document):
Name = StringField(max_length=250)
timestamp = LongField(default=mktime(datetime.now().timetuple()))
count = IntField()
<some other fields>
What exactly I want is to filter by the name (it's clear) and use aggregation operation sum over field count. But I want to count the sum of records grouped by hours/days/months.
As example, if we have records with such timestamps [1532970603, 1532972103, 153293600, 1532974500], then 1-2 form first group, and 3-4 form second group.
And that is where I have stuck. I have some ideas about grouping by every n records, or by dividing timestamp on 3600
(1 hour = 3600 seconds), but how to make it with mongoengine. Or even how to insert some expressions with python in a pipeline?
I will very appreciate any help.
I would recommend to use ISO date format and store complete date in timestamp. Here is your model
class Stats(Document):
Name = Document.StringField(max_length=250)
timestamp = Document.DateTime(default=datetime.utcnow()) //ISO time format recommended
count = Document.FloatField()
meta = {'strict': False}
Now you can aggregate them accordingly.
Stats.objects.aggregate(
{
'$group': {
'_id': {'year': {$year: '$timestamp'},
'month': {$month: '$timestamp'},
'day' : {$dayOfMonth: '$timestamp'},
'hour': {'$hour: '$timestamp'},
}
}
}
)
My architecture:
1 EventHub with 8 Partitions & 2 TPUs
1 Streaming Analytics Job
6 Windows based on the same input (from 1mn to 6mn)
Sample Data:
{side: 'BUY', ticker: 'MSFT', qty: 1, price: 123, tradeTimestamp: 10000000000}
{side: 'SELL', ticker: 'MSFT', qty: 1, price: 124, tradeTimestamp:1000000000}
The EventHub PartitionKey is ticker
I would like to emit every second, the following data:
(Total quantity bought / Total quantity sold) in the last minute, last 2mn, last 3mn and more
What I tried:
WITH TradesWindow AS (
SELECT
windowEnd = System.Timestamp,
ticker,
side,
totalQty = SUM(qty)
FROM [Trades-Stream] TIMESTAMP BY tradeTimestamp PARTITION BY PartitionId
GROUP BY ticker, side, PartitionId, HoppingWindow(second, 60, 1)
),
TradesRatio1MN AS (
SELECT
ticker = b.ticker,
buySellRatio = b.totalQty / s.totalQty
FROM TradesWindow b /* SHOULD I PARTITION HERE TOO ? */
JOIN TradesWindow s /* SHOULD I PARTITION HERE TOO ? */
ON s.ticker = b.ticker AND s.side = 'SELL'
AND DATEDIFF(second, b, s) BETWEEN 0 AND 1
WHERE b.side = 'BUY'
)
/* .... More windows.... */
/* FINAL OUTPUT: Joining all the windows */
SELECT
buySellRatio1MN = bs1.buySellRatio,
buySellRatio2MN = bs2.buySellRatio
/* more windows */
INTO [output]
FROM buySellRatio1MN bs1 /* SHOULD I PARTITION HERE TOO ? */
JOIN buySellRatio2MN bs2 /* SHOULD I PARTITION HERE TOO ? */
ON bs2.ticker = bs1.ticker
AND DATEDIFF(second, bs1, bs2) BETWEEN 0 AND 1
Issues:
This requires 6 EventHub Consumer groups (each one can only have 5 readers), why ? I don't have 5x6 SELECT statements on the input, why then ?
The output doesn't seem consistent (I don't know if my JOINs are correct).
Sometimes the job doesn't output at all (maybe some partitioning problem ? see the comments in the code about partitioning)
Briefly, is there a better way to achieve this ? I couldn't find anything in the doc and examples about having multiple windows and joining them then joining the results of the previous joins from only 1 input.
For the first question, this depend of the internal implementation of the scale out logic. See details here.
For the output of the join, I don't see the whole query but if you join a query with a 1 minute window with a query with a 2 minute window with a 1s time "buffer" you will only an output every 2 minutes. UNION operator will be better for this.
From your sample and your goal, I think there is a much easier way to write this query using UDA (User Defined Aggregate).
For this I will define a UDA function called "ratio" first:
function main() {
this.init = function () {
this.sumSell = 0.0;
this.sumBuy = 0.0;
}
this.accumulate = function (value, timestamp) {
if (value.side=="BUY") {this.sumBuy+=value.qty};
if (value.side=="SELL") {this.sumSell+=value.qty};
}
this.computeResult = function () {
if(this.sumSell== 0) {
result = 0;
}
else {
result = this.sumBuy/this.sumSell;
}
return result;
}
}
Then I can simply use this SQL query for a 60 seconds window:
SELECT
windowEnd = System.Timestamp,
ticker,
uda.ratio(iothub) as ratio
FROM iothub PARTITION BY PartitionId
GROUP BY ticker, PartitionId, SlidingWindow(second, 60)
I have a data structure in Postgres like the following:
Donations
id amount status
1 5000 OK
2 7500 OK
Donation Items
id donationId amount includeInTotals
1 1 2500 false
2 1 2500 false
3 2 7500 false
In pseudo SQL I'm trying to do something like:
SELECT * FROM donations
WHERE SUM(SELECT donation.items WHERE "donationId" = "donations"."id"
AND "includeInTotals" = false) >= "donations"."amount"
Essentially wanting to remove donations where the total of the "includeInTotals = false" donation items is the same as the donation value.
This is the code I've got so far for Sequelize:
Models.donations.findAndCountAll({
where: { status: 'OK' },
include: {
model: Models.donation_items,
as: 'items',
duplicating: false,
},
having: sequelize.literal('COALESCE(SUM(CASE WHEN "items"."includeInTotals" = false THEN items.amount ELSE 0 END), 0) < "donations"."amount"'),
group: ['"donations.id"', '"items.id"']
});
This is working for donation ID: 2 – it's not returned by the query, but donation ID: 1 is, despite the two amounts equaling 5000.
Is there something wrong I'm doing with SUM?
I can offer an actual query which should work, which is half the answer:
SELECT d1.*
FROM Donations d1
LEFT JOIN
(
SELECT donationId, SUM(amount) AS amount
FROM DonationItems di
WHERE includeInTotals = false
GROUP BY donationId
) d2
ON d1.id = d2.donationId AND d1.amount = d2.amount
WHERE
d2.donationId IS NULL;
While this doesn't touch your Sequelize problem, having the correct raw query is a total prerequisite to doing anything in PHP which will bear fruit later on. There are other, more efficient, ways we could write this query, but the above version would probably be easiest to integrate into your PHP code.
It looks like the GROUP clause here was joining the donation items and returning an incorrect SUM, so I switched to a sub-query in the WHERE clause. I added this to my where which worked for me!
amount: {
$gt: globals.sequelize.literal(`
(SELECT COALESCE(SUM(amount), 0) FROM donation_items
WHERE "donation_items"."includeInTotals" = false
AND "donation_items"."donationId" = "donations"."id")
`),
},