We have use case where we are receiving message from kafka that needs to be aggregated. This has to be aggregated in a way that if an updates comes on same id then existing value if any needs to be subtracted and the new value has to be added.
From various forum i got to know that jet doesnt store raw values rather aggregated result and some internal data.
In such case how can i achieve this?
Example
Balance 1 {id:1, amount:100} // aggregated result 100
Balance 2 {id:2, amount:200} // 300
Balance 3 {id:1, amount:400} // 600 after removing 100 and adding 400
I could achieve a simple use where every time add. But i was not able to achieve the aggregation where existing value needs to be subtracted and new value has to be added.
rollingAggregation(AggregatorOperations.summingDouble(<login to add remove>))
.drainTo(Sinks.logger()).
Balance 1,2,3 are sequnce of messages
The comment shows whats the aggregated value at each message performed by jet.
My aim is to add new amount (if id comes for the first time) and subtract amount if an updated balance comes i. e. Id is same as earlier.
You can try a custom aggregate operation which will emit the previous and currently seen values like this:
public static <T> AggregateOperation1<T, ?, Tuple2<T, T>> previousAndCurrent() {
return AggregateOperation
.withCreate(() -> new Object[2])
.<T>andAccumulate((acc, current) -> {
acc[0] = acc[1];
acc[1] = current;
})
.andExportFinish((acc) -> tuple2((T) acc[0], (T) acc[1]));
}
The output should be a Tuple of the form (previous, current). Then you can apply rolling aggregate again to the output. To simplify the problem as input I have a pair of (id, amount) pairs.
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Integer, Long>mapJournal("map", START_FROM_OLDEST)) // (id, amount)
.groupingKey(Entry::getKey)
.rollingAggregate(previousAndCurrent(), (key, val) -> val)
.rollingAggregate(AggregateOperations.summingLong(e -> {
long prevValue = e.f0() == null ? 0 : e.f0().getValue();
long newValue = e.f1().getValue();
return newValue - prevValue;
}))
.drainTo(Sinks.logger());
JetConfig config = new JetConfig();
config.getHazelcastConfig().addEventJournalConfig(new EventJournalConfig().setMapName("map"));
JetInstance jet = Jet.newJetInstance(config);
IMapJet<Object, Object> map = jet.getMap("map");
map.put(0, 1L);
map.put(0, 2L);
map.put(1, 10L);
map.put(1, 40L);
jet.newJob(p).join();
This should produce as output: 1, 2, 12, 42.
Related
I have two large datasets. There are multiple groupings of the same ids. Each group has a score. I'm trying to broadcast the score to each id in each group. But I have a nice constraint that I don't care about groups with more than 1000 ids.
Unfortunately, Spark keeps reading the full grouping. I can't seem to figure out a way to push down the limit so that Spark only reads up to 1000 records, and if there are any more gives up.
So far I've tried this:
def run: Unit = {
// ...
val scores: RDD[(GroupId, Score)] = readScores(...)
val data: RDD[(GroupId, Id)] = readData(...)
val idToScore: RDD[(Id, Score)] = scores.cogroup(data)
.flatMap(maxIdsPerGroupFilter(1000))
// ...
}
def maxIdsPerGroupFilter(maxIds: Int)(t: (GroupId, (Iterable[Score], Iterable[Id]))): Iterator[(Id, Score)] = {
t match {
case (groupId: GroupId, (scores: Iterable[Score], ids: Iterable[Id])) =>
if (!scores.iterator.hasNext) {
return Iterator.empty
}
val score: Score = scores.iterator.next()
val iter = ids.iterator
val uniqueIds: mutable.HashSet[Id] = new mutable.HashSet[Id]
while (iter.hasNext) {
uniqueIds.add(iter.next())
if (uniqueIds.size > maxIds) {
return Iterator.empty
}
}
uniqueIds.map((_, score)).iterator
}
}
(Even with variants where the filter function just returns empty iterators, Spark still is insistent on reading all the data)
The side effect of this is that because some groups have too many ids, I have a lot of skew in the data and the job can never finish when processing the full scale of data.
I want the reduce-side to only read in the data it needs, and not crap out because of data skew.
I have a feeling that somehow I need to create a transform that is able to push down a limit or take clause, but I can't figure out how.
Can't we just filter out those groups which have records more than 1k using count() in grouped data?
or if you want to have those groups also which have more than 1k records but only to pick upto 1k records then in spark sql query you can use ROW_NUMBER() OVER (PARTITION BY id ORDER BY someColumn DESC) AS rn and then put condition rn<=1000.
In my AutoQuery request I have a left join specified so I can query on properties in the joined table.
public class ProductSearchRequest : QueryDb<Book>
, ILeftJoin<Book, BookAuthor>, ILeftJoin<BookAuthor, Author>
{}
If I use standard way of autoquery like so:
var q = AutoQuery.CreateQuery(request, base.Request);
var results = AutoQuery.Execute(request, q);
And 100 are being requested, then often less than 100 will be retuned as the Take() is based on results with a left join.
To remedy this I am doing this instead:
var q = AutoQuery.CreateQuery(request, base.Request);
q.OrderByExpression = null //throws error if orderby exists
var total = Db.Scalar<int>(q.Select(x => Sql.CountDistinct(x.Id))); //returns 0
var q1 = AutoQuery.CreateQuery(request, base.Request).GroupBy(x => x);
var results = Db.Select<Book>(q1);
return new QueryResponse<Book>
{
Offset = q1.Offset.GetValueOrDefault(0),
Total = total
Results = results
};
The group by appears to return correct number of results so paging works but the Total returns 0.
I also tried:
var total2 = (int)Db.Count(q1);
But even though q1 has a GroupBy() it returns the number of results including the left join and not the actual query
How can I get the true total of the query?
(Getting some official docs on how to do paging and totals with autoquery & left join would be very helpful as right now it's a bit confusing)
Your primary issue stems from trying to return a different total then the actual query AutoQuery executes. If you have multiple left joins, the total is the total results of the query it executes not the number of rows in your source table.
So you're not looking for the "True total", rather you're looking to execute a different query to get a different total than the query that's executed, but still deriving from the original query as its basis. First consider using normal INNER JOINS (IJoin<>) instead of LEFT JOINS so it only returns results for related rows in joined tables which the total will reflect accordingly.
Your total query that returns 0 is likely returning no results, so I'd look at looking at the query in an SQL Profiler so you can see the query that's executed. You can also enable logging of OrmLite queries with Debug logging enabled and in your AppHost:
OrmLiteUtils.PrintSql();
Also note that GroupBy() of the entire table is unusual, you would normally group by a single or multiple explicit selected columns, e.g:
.GroupBy(x => x.Id);
.GroupBy(x => new { x.Id, x.Name });
I have this autoquery implementation
var q = AutoQuery.CreateQuery(request, base.Request).SelectDistinct();
var results = Db.Select<ProductDto>(q);
return new QueryResponse<ProductDto>
{
Offset = q.Offset.GetValueOrDefault(0),
Total = (int)Db.Count(q),
Results = results
};
The request has some joins:
public class ProductSearchRequest : QueryDb<GardnerRecord, ProductDto>
, ILeftJoin<GardnerRecord, RecordToBicCode>, ILeftJoin<RecordToBicCode, GardnerBicCode>
{
}
The records gets returned correctly but the total is wrong. I can see 40,000 records in database but it tells me there is 90,000. There is multiple RecordToBicCode for each GardnerRecord so it's giving me the number of records multiplied by the number of RecordToBicCode.
How do I match the total to the number of GardnerRecord matching the query?
I am using PostgreSQL so need the count statement to be like
select count(distinct r.id) from gardner_record r etc...
Dores OrmLite have a way to do this?
I tried:
var q2 = q;
q2.SelectExpression = "select count(distinct \"gardner_record\".\"id\")";
q2.OrderByExpression = null;
var count = Db.Select<int>(q2);
But I get object reference not set error.
AutoQuery is returning the correct total count for your query of which has left joins so will naturally return more results then the original source table.
You can perform a Distinct count with:
Total = Db.Scalar<long>(q.Select(x => Sql.CountDistinct(x.Id));
I'm in trouble for losing data on insert in my Cassandra.
I am doing great bulk inserts from csv files which I read via Stream. The data is duplicated into two tables, because of different queries. Every 30,000th element I split my data to new partition (chunkCounter).
private PersistenceInformation persist(final String period, final String tradePartner, final Integer version, Stream<Transaction> transactions) {
int elementsInChunkCounter = 0;
int chunkCounter = 1;
int elementCounter = 0;
Iterator<Transaction> iterator = transactions.filter(beanValidator).iterator();
List<List<?>> listImportData = new ArrayList<>(30000);
List<List<?>> listGtins = new ArrayList<>(30000);
while (iterator.hasNext()) {
Transaction tr = iterator.next();
List<Object> importTemp = new ArrayList<>(9);
importTemp.add(period);
importTemp.add(tradePartner);
importTemp.add(version);
importTemp.add(chunkCounter);
importTemp.add(tr.getMdhId());
importTemp.add(tr.getGtin());
importTemp.add(tr.getQuantity());
importTemp.add(tr.getTransactionId());
importTemp.add(tr.getTimestamp());
listImportData.add(importTemp);
List<Object> gtinTemp = new ArrayList<>(8);
gtinTemp.add(period);
gtinTemp.add(tradePartner);
gtinTemp.add(version);
gtinTemp.add(chunkCounter);
gtinTemp.add(tr.getMdhId());
gtinTemp.add(tr.getGtin());
gtinTemp.add(tr.getQuantity());
gtinTemp.add(tr.getTimestamp());
listGtins.add(gtinTemp);
elementsInChunkCounter++;
elementCounter++;
if (elementsInChunkCounter == 30000) {
elementsInChunkCounter = 0;
chunkCounter++;
ingestImportData(listImportData);
listImportData.clear();
ingestGtins(listGtins);
listGtins.clear();
}
}
if (!listImportData.isEmpty()) {
ingestImportData(listImportData);
}
if (!listGtins.isEmpty()) {
ingestGtins(listGtins);
}
return new PersistenceInformation();
}
private void ingestImportData(List<List<?>> list) {
String cqlIngest = "INSERT INTO import_data (pd, tp , ver, chunk, mdh_id, gtin, qty, id, ts) VALUES (?,?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
private void ingestGtins(List<List<?>> list) {
String cqlIngest = "INSERT INTO gtins (pd, tp, ver, chunk, mdh_id, gtin, qty, ts) VALUES (?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
This worked pretty well until I noticed that sometimes a dataset goes missing. There is an entry in the second table (gtins) but the data set in the main table was not inserted. The application counted it but the database did not write it.
The table is built this way:
CREATE TABLE import_data (
tp text,
pd text,
ver int,
chunk int,
mdh_id uuid,
gtin text,
qty float,
id text,
ts timestamp
PRIMARY KEY ((tp, pd, ver, chunk), ts, mdh_id)) WITH CLUSTERING ORDER BY (ts DESC);
The mdh_id is a UUID from my application, so that every data set has a unique key and is not accidentally overridden.
The Cassandra log files didn't even show a warning.
At the moment I am evaluating BatchStatement but I need to insert every 8th dataset because of the 5kb limit, otherwise the database lost even more entries.
Any suggestions whats going wrong in my application is highly appreciated. Thanks a lot.
I have column named record_number of type varchar that has the following format data: [currentYear]-[Number] ex:2015-11
I need to search for the maximum number of this column; ie: if the value of the column that holds the maximum is 2015-15 and then the value should be 15, however if the column has a value of 2016-2, then the max should be 2.
how can I do it in jpql?
I'm using Postgres and EJB 3.1
You can use the SUBSTRING method of the JPA:
select table From Table table order by SUBSTRING(table.record_number, 5) desc;
To get only the first result, you need to use the method maxResults, like this:
em.createQuery("select table From Table table order by SUBSTRING(table.record_number, 5) desc")
.setMaxResults(1) -- only the first result
.getResultList()
I managed to fix the problem based on the comment of Dherik:
I used the following query to get the object that holds the correct value which seems more optimized than the one porposed by Dherik:
final TypedQuery<Table> query = createTypedQuery("from Table t where t.recordNumber= (select max(t.recordNumber) from t)", Table.class);
Table t= null;
try {
t = query.getSingleResult();
}catch(Exception e){
//handle Exception Here
}
return t;
The trick is since it's my app which creates the record number, I changed the method that creates the record number to format the number on 2 digits to avoid having wrong string comparaison (the case when '9' is considered greater than '10')
// format the number <10 so that is on 2 digits
final String formattedNumber = String.format("%02d", number);
final int year = SomeUtilClass.getYearFromDate(new Date());
return new StringBuilder().append(year).append("-").append(formattedNumber).toString();