We have a requirement to group by multiple fields in a dynamic way on a huge data set. The data is stored in Hazelcast Jet cluster. Example: if Person class contains 4 fields: age, name, city and country. We first need to group by city and then by country and then we may group by name based on conditional parameters.
We already tried using Distributed collection and is not working. Even when we tried using Pipeline API it is throwing error.
Code:
IMap res= client.getMap("res"); // res is distrbuted map
Pipeline p = Pipeline.create();
JobConfig jobConfig = new JobConfig();
p.drawFrom(Sources.<Person>list("inputList"))
.aggregate(AggregateOperations.groupingBy(Person::getCountry))
.drainTo(Sinks.map(res));
jobConfig = new JobConfig();
jobConfig.addClass(Person.class);
jobConfig.addClass(HzJetListClientPersonMultipleGroupBy.class);
Job job = client.newJob(p, jobConfig);
job.join();
Then we read from the map in the client and destroy it.
Error Message on the server:
Caused by: java.lang.ClassCastException: java.util.HashMap cannot be cast to java.util.Map$Entry
groupingBy aggregates all the input items into a HashMap where the key is extracted using the given function. In your case it aggregates a stream of Person items into a single HashMap<String, List<Person>> item.
You need to use this:
p.drawFrom(Sources.<Person>list("inputList"))
.groupingKey(Person::getCountry)
.aggregate(AggregateOperations.toList())
.drainTo(Sinks.map(res));
This will populate the res map with a list of persons in each city.
Remember, without groupingKey() the aggregation is always global. That is, all items in the input will be aggregated to one output item.
Related
I am new to dynamoDB and need some suggestion from experienced people here . There is a table created with below model
orderId - PartitionKey
stockId
orderDetails
and there is a new requirement to fetch all the orderIds which includes particular stockId. The item in the table looks like
{
"orderId":"ord_12234",
"stockId":[
123221,
234556,
123231
],
"orderDetails":{
"createdDate":"",
"dateOfDel":""
}
}
provided the scenario that stockId can be an array of id it cant be made as GSI .Performing scan would be heavy as the table has large number of records and keeps growing . what would be the best option here , How the existing table can be modified to achieve this in efficient way
You definitely want to avoid scanning the table. One option is to modify your schema to a Single Table Design where you have order items and order/stock items.
For example:
pk
sk
orderDetails
stockId
...
order#ord_12234
order#ord_12234
{createdDate:xxx, dateOfDel:yyy}
...
order#ord_12234
stock#123221
23221
...
order#ord_12234
stock#234556
234556
...
order#ord_12234
stock#123231
123231
...
You can then issue the following queries, as needed:
get the order details with a query on pk=order#ord_12234, sk=order#ord_12234
get the stocks for a given order with a query on pk=order#ord_12234, sk=stock#
get everything associated with the order with a query on pk=order#ord_12234
I am using a JDBC poller to process some db records and when the workflow is finished, i need to update those records. I could not find a way to make if work for tables with compound keys.
This is my example. Table EVENTS. with primary key (DATETIME, EVENT_LOCATION, EVENT_TYPE). I cannot change the schema.
Rows are mapped into a POJO with the property names: dateTime, location, type.
<int-jdbc:inbound-channel-adapter
query="select * from events where uploaded = 0”
channel="fromdb" data-source="dataSource"
max-rows="${app.maxrows}"
row-mapper=“eventRowMapper”
update="update events set uploaded=1 where DATETIME =:dateTime AND EVENT_LOCATION=:location AND EVENT_TYPE = :type”>
<int:poller fixed-delay="${app.intervalmsecs}" />
But I get a syntax error response from the server when the poller tries to update those records.
After reading the docs, it seems that the poller uses ´(:id)´ to update the rows , but it assumes a single-column PK. I could not find any information about updating rows with multiple columns in the primary key
Is there any way to update rows with multiple column Primary Key? Or should i use an outbound jdbc or code my own update solution?
Show the complete stack trace; your Event object and row mapper; I just changed one of the tests from
JdbcPollingChannelAdapter adapter = new JdbcPollingChannelAdapter(embeddedDatabase,
"select * from item where id not in (select id from copy)");
adapter.setUpdateSql("insert into copy values(:id,10)");
to
JdbcPollingChannelAdapter adapter = new JdbcPollingChannelAdapter(embeddedDatabase,
"select * from item where id not in (select foo from copy)");
adapter.setUpdateSql("insert into copy values(:foo,:status)");
and it worked just fine.
As long as the column appears as a property of the result of the select query it will work. (The result being the object created by the rowmapper).
i.e. dateTime, location and type must be properties on Event.
Also, based on your update-query, it looks like you should have update-per-row set to true since it only updates one row.
I have a requirement where I need to process log lines using Spark. One of the steps in processing is to lookup certain value in external DB.
For ex:
my log line contains multiple key-value pair. One of the key that is present in log is "key1". This key needs to be used for lookup call.
I dont want to make multiple lookup call sequentially in external DB for each value of "key1" in RDD .Rather I want to create a list of all values of "key1" present in RDD and then make a single lookup call in external DB.
My code to extract key from each log line would look as follows:
lines.foreachRDD{rdd => rdd.map(line => extractKey(line))
// next step is lookup
// then further processing
The .map function would be called for each log line, so I am not sure , how can I create a list of Keys which can be used for external lookup.
Thanks
Use collect.
lines.foreachRDD{rdd =>
val keys = rdd.map(line => extractKey(line)).collect()
// here you can use keys List
Probably you'll have to use mapPartitions also:
lines.foreachRDD{rdd =>
rdd.foreachPartition(iter => {
val keys = iter.map(line => extractKey(line)).toArray
// here you can use keys Array
}
}
There will be 1 call per 1 partition, this method avoids serialization problems
It looks like you want this:
lines.groupByKey().filter()
Could you provide more info?
Looking for a way to do a batch update using slick. Is there an equivalent updateAll to insertALL? Goole research has failed me thus far.
I have a list of case classes that have varying status. Each one having a different numeric value so I cannot run the typical update query. At the same time, I want to save the multiple update requests as there could be thousands of records I want to update at the same time.
Sorry to answer my own question, but what i ended up doing is just dropping down to JDBC and doing batchUpdate.
private def batchUpdateQuery = "update table set value = ? where id = ?"
/**
* Dropping to jdbc b/c slick doesnt support this batched update
*/
def batchUpate(batch:List[MyCaseClass])(implicit subject:Subject, session:Session) = {
val pstmt = session.conn.prepareStatement(batchUpdateQuery)
batch map { myCaseClass =>
pstmt.setString(1, myCaseClass.value)
pstmt.setString(2, myCaseClass.id)
pstmt.addBatch()
}
session.withTransaction {
pstmt.executeBatch()
}
}
It's not clear to me what you are trying to achieve, insert and update are two different operation, for insert makes sense to have a bulk function, for update it doesn't in my opinion, in fact in SQL you can just write something like this
UPDATE
SomeTable
SET SomeColumn = SomeValue
WHERE AnotherColumn = AnotherValue
Which translates to update SomeColumn with the value SomeValue for all the rows which have AnotherColumn equal to AnotherValue.
In Slick this is a simple filter combined with map and update
table
.filter(_.someCulomn === someValue)
.map(_.FieldToUpdate)
.update(NewValue)
If instead you want to update the whole row just drop the map and pass a Row object to the update function.
Edit:
If you want to update different case classes I'm lead to think that these case classes are rows defined in your schema and if that's the case you can pass them directly to the update function since it's so defined:
def update(value: T)(implicit session: Backend#Session): Int
For the second problem I can't suggest you a solution, looking at the JdbcInvokerComponent trait it looks like the update function invokes the execute method immediately
def update(value: T)(implicit session: Backend#Session): Int = session.withPreparedStatement(updateStatement) { st =>
st.clearParameters
val pp = new PositionedParameters(st)
converter.set(value, pp, true)
sres.setter(pp, param)
st.executeUpdate
}
Probably because you can actually run one update query at the time per table and not multiple update on multiple tables as stated also on this SO question, but you could of course update multiple rows on the same table.
Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.