Quick processing of large file

Quick processing of large file - multithreading

I have a Grails project for Ecommerce app which has background job for data processing as well. Recently we received an XML file from a customer with around 300+ orders(around 35000+ lines in file). This file is taking around 9 hours to process, though the customer is expecting this to process in 2-3 hours.
The most obvious solution that came to mind is using threading to concurrently process 10-15 orders to reduce processing time. I implemented the same using Java executors framework and created a pool of say 3 threads. The Grails project uses hibernate as ORM.
The process followed is as below :
1.) Address details are verified and If not present created in DB.
2.) Item details are verified and if not present created in DB.
3.) Order against the saved item is created/updated.
Now, for testing purpose, 3 orders are submitted to 3 threads in pool. The first thread which reaches ITem creation(Step 2) locks the item object and other two threads keep waiting at that point. Once the first thread completes all of its operations, the other two threads fail with below exception.
org.springframework.dao.CannotAcquireLockException: could not insert: [com.test.ABCITem]; SQL [insert into party_item_relations (version, active_flag, amazonasin, b2b_downloaded, baseuom, bom, card_size, case_size, colour, configured, cost, country_of_origin, currency, date_created, drop_ship, end_date, external_id, flex_field1, flex_field10, flex_field2, flex_field3, flex_field4, flex_field5, flex_field6, flex_field7, flex_field8, flex_field9, gender, gtin_code, height, heightuom, image_file, ingredients, internal_item, item_category, item_description, kanban_enabled, key_terms, last_updated, lead_time, length, lengthuom, long_description, manufacturer, model, moq, number_of_cards, over_ship_percent, pack_size, parent_item_category, parentitem_id, party1_id, party1item_number, party2_id, party2item_number, planner_code, price, product_family, product_group, product_type_name, sales_channel, scent, size, source_rank, start_date, status, tam_split, upc_code, variation_theme, weight, weightuom, width, widthuom) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]; nested exception is org.hibernate.exception.LockAcquisitionException: could not insert: [com.test.ABCITem]
at org.springframework.orm.hibernate3.SessionFactoryUtils.convertHibernateAccessException(SessionFactoryUtils.java:639)
at org.springframework.orm.hibernate3.HibernateAccessor.convertHibernateAccessException(HibernateAccessor.java:412)
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:411)
at org.springframework.orm.hibernate3.HibernateTemplate.execute(HibernateTemplate.java:339)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.SavePersistentMethod.performSave(SavePersistentMethod.java:56)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.AbstractSavePersistentMethod.doInvokeInternal(AbstractSavePersistentMethod.java:215)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.AbstractDynamicPersistentMethod.invoke(AbstractDynamicPersistentMethod.java:63)
at sun.reflect.GeneratedMethodAccessor425.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.springsource.loaded.ri.ReflectiveInterceptor.jlrMethodInvoke(ReflectiveInterceptor.java:1259)
at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSite.invoke(PojoMetaMethodSite.java:189)
at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:53)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
at org.codehaus.groovy.grails.orm.hibernate.HibernateGormInstanceApi.save(HibernateGormEnhancer.groovy:911)
at com.test.ABCITem.save(ABCITem.groovy)
at com.test.ABCITem$save$0.call(Unknown Source)
at com.test.MasterDataService.createPartyItemRelation(MasterDataService.groovy:259)
at grails.plugin.executor.PersistenceContextRunnableWrapper.run(PersistenceContextRunnableWrapper.groovy:34)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.hibernate.exception.LockAcquisitionException: could not insert: [com.test.ABCITem]
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:107)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:63)
at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2346)
at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2853)
at org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:71)
at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:273)
at org.codehaus.groovy.grails.orm.hibernate.support.ClosureEventTriggeringInterceptor.performSaveOrReplicate(ClosureEventTriggeringInterceptor.java:250)
at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:203)
at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:129)
at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:210)
at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:195)
at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:117)
at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:93)
at org.codehaus.groovy.grails.orm.hibernate.support.ClosureEventTriggeringInterceptor.onSaveOrUpdate(ClosureEventTriggeringInterceptor.java:108)
at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:685)
at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:677)
at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:673)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.SavePersistentMethod$1.doInHibernate(SavePersistentMethod.java:58)
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:406)
... 110 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.springsource.loaded.ri.ReflectiveInterceptor.jlrConstructorNewInstance(ReflectiveInterceptor.java:991)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.Util.getInstance(Util.java:386)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1064)
I have looked at various solutions for this error but mostly suggest ordering of where clause fields. Other solutions like increasing timeout of innodb engine have not worked.
My project structure looks like below.(This is pseudo code and not exact classes)
1.) Runnable implementation to execute each order as a new thread.
public class BackGroundTask implements Runnable{
private List<JSONObject> requests;
public BackGroundTask(final List<JSONObject> requests){
this.requests = Lists.newArrayList(requests);
}
#Override
public void run() {
this.dataUtilsService.processData(this.requests);
}
}
2.) Third party service which receives XML file and start new thread from executor pool for each order.
public class ThirdPartyService{
def executorService
public void process(File file){
Document doc = xmlParser.parse(file);
NodeList orders = ordersData.getElementsByTagName(TAG_ORDER)
for (Element order : orders) {
//order converted to List<jsonobject> and passed below.
executorService.submit(new backGroundTask(requests));
}
}
}
3.) Processor service structure which actually processes the data.
public class DataUtilsService{
#Transactional(propagation=PropagationType.REQUIRES_NEW)
public void processData(List<JSONObject> requests){
//Loops through requests and processes data.
item.save();
}
}
Please suggest what else other than threads can be used here for concurrent processing so as to reduce processing time.

Related

How to perform multiple inserts In knex transaction with dependent validity checks between inserts?

I'm writing a multi-step knex transaction that needs to insert multiple records into a table. Each of these records needs to pass a validity check written in Node before it is inserted.
I would batch perform the validity checks then insert all at once except that two records might invalidate each other. I.e. if I insert record 1, record 2 might no longer be valid.
Unfortunately, I can't seem to query the database in a transaction-aware fashion. During the the validity-check of the second record, my queries (used for the validity checks) do not show that the first insert exists.
I'm using the trx (transaction) connection object rather than the base knex object. I expected this would fix it since the transaction connection object is supposed to be promise aware, but alas it did not.
await knex.transaction(async trx => {
/* perform validity checks and insert if valid */
/* Check the record slated to be created
* against the rules to ensure validity. */
relationshipValidityChecks = await Promise.all(
relationshipsToCreate.map(async r => {
const obj = {
fromId: r.fromId,
toId: r.toId,
type: r.type,
...(await relationshipIsValid( // returns an object with validity boolean
r.fromId,
r.toId,
r.type,
trx // i can specify which connection obj to use (trx/knx)
))
};
if (obj.valid) {
await trx.raw(
`
insert into t1.relationship
(from_id, to_id, type) values
(?, ?, ?)
returning relationship_key;
`,
[r.fromId, r.toId, r.type]
);
}
return obj;
})
);
}
When I feed in two records that are valid by themselves but invalidate each other, the first record should be inserted and the second record should return an invalid error. The relationshipIsValid is somewhat complicated so I left it out, but I'm certain it works as expected because if I feed the aforementioned two records in separately (i.e. in two different endpoint calls) the second will return the invalid error.
Any help would be greatly appreciated. Thanks!

Node.js Firebird SQL connector - getting result from INSERT ... RETURNING

I am trying to write a function that lets me insert a value into the firebird database. The query works well, only I get no callback to tell me that the insert went well.
It is the first time I am using a firebird connector. In the past, when using mySql connectors I can recall having some sort of callback when inserting new values. Right now I am using the node-firebird library by Henri Gourvest to accomplish this:
https://github.com/hgourvest/node-firebird/
I tried adding 'RETURNING FEATURE_ID' at the end, but an error "Cursor is not open" was thrown. The feature ID is generated by a trigger.
Any advice would be very kind.
pool.get(function(error, db) {
if (error) {
console.log(error)
res.status(403)
res.send()
}
else {
var date = moment(req.body.date, "DD/MM/YYYY")
var values = " VALUES ('" + date.format("MM/DD/YYYY") + "','Requested','" + req.body.type + "','" + req.body.description + "','" + req.body.memo +"')"
var query = 'INSERT INTO "Features" (FEATURE_REQUESTDATE, FEATURE_STATUS, FEATURE_TYPE, FEATURE_DESCRIPTION, FEATURE_MEMO)' + values
db.query( query , function(err, result) {
if (result) { //why is there no result here?
res.status(200)
res.send('ok')
}
if (err) {
console.log(err)
res.status(403)
res.send('error')
}
})
db.detach();
}
})

I tried adding 'RETURNING FEATURE_ID' at the end, but an error "Cursor is not open" was thrown.
Sure, there can be no cursor. Cursors (AKA rowsets) are only created by queries - SELECT-type SQL statements.
As stated in Firebird documentation, statements with RETURNING clause are not of query type, they are of procedure call type. You should execute them as you do with regular DELETE-type statements, then read the PARAMETERS of the statement executed.
Right now I am using the node-firebird library by Henri Gourvest to accomplish this: https://github.com/hgourvest/node-firebird/
Any advice would be very kind.
There are two advices.
NEVER do splice your data values into SQL command text. It makes your program very fragile. It would give you all the kinds of data conversion errors, and also it opens highways for your database corruption made by unexpected - erroneous or malicious - user input. See http://bobby-tables.com/ and http://en.wikipedia.org/wiki/SQL_injection
"Use the source Luke". The library you mentioned is open-source. So you have to check the examples in that library. Henri is known to be very laconic about documentation. However he supplies his different libraries with vast sets of examples and/or tests. Both suit for you, as they do use the library, and so you can just read how the library was intended to be used by its creator. This particular library has tests. And tests are always examples of intended use.
So you go into test folder and you see there run.js file. Open it.
https://github.com/hgourvest/node-firebird/blob/master/test/run.js
Now press Ctrl+F and search for "RETURNING" word. Not always first time, but one of its occurrences should be exactly test for the SQL feature you need.
Here it is, the very FIRST occurrence of it in the library text you already have on your machine. Granted, the first occurrence adds complexity of working with BLOBs that you do not need right off. So I would quote the THIRD example in the library you downloaded. But even the first example shows you how to properly execute queries with values and with RETURNING clauses.
function test_insert(next) {
....skip.......
// Insert record without blob
database.query('INSERT INTO test (ID, NAME, CREATED, PARENT) VALUES(?, ?, ?, ?) RETURNING ID', [3, 'Firebird 3', now, 862304020112911], function(err, r) {
assert.ok(!err, name + ': insert without blob (buffer) (1) ' + err);
assert.ok(r['id'] === 3, name + ': without blob (buffer) returning value');
next();
});
// Insert record without blob (without returning value)
database.query('INSERT INTO test (ID, NAME, CREATED) VALUES(?, ?, ?)', [4, 'Firebird 4', '2014-12-12 13:59'], function(err, r) {
assert.ok(!err, name + ': insert without blob (buffer) (2) ' + err);
assert.ok(err === undefined, name + ': insert without blob + without returning value');
next();
});

PostGIS query with NodeJS, BookshelfJS and Knex

I'm on a project with NodeJS, BookshelfJS and ExpressJS.
My database is Postgres with Postgis installed.
My table 'organizations' has a 'lat_lon' geometry column.
I would like to query all the organization within a fixed radius of a specific lat/long point.
I tried something like this:
var organizations = await Organization.query(function (qb) {
qb.where('ST_DWithin(lat_lon, ST_GeomFromText("POINT(45.43 10.99)", 4326), 1000 )')
}).fetchAll()
and more combinations but it doesn't work.
It returns me an error
UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): TypeError: The operator "undefined" is not permitted
It seems it is expecting the operator inside the where condition but I'm already working on the 'lat_lon' column.
How can I fix it?
Thanks

Have you tried using knex.raw()?
var organizations = await Organization.query(function (qb) {
qb.where(knex.raw('ST_DWithin(lat_lon, ST_GeomFromText("POINT(45.43 10.99)", 4326), 1000 )'))
}).fetchAll()

I found that whereRaw was the solution that I was looking for when I encountered a similar situation.
Basic Example
If we have the following query using where
qb.where('id', 2')
The whereRaw equivalent is
qb.whereRaw('id = ?', [2])
As Applied to Situation from the Question
I believe that this is roughly equivalent to your query
qb.whereRaw('ST_DWithin(lat_lon, ST_GeomFromText("POINT(45.43 10.99)", 4326), 1000 )')
which could possibly be parameterized as
qb.whereRaw(
'ST_DWithin(lat_lon, ST_GeomFromText("POINT(?, ?)", 4326), ?)',
[45.43, 10.99, 1000]
)
if the Longitude, Latitude, or search radius were to change.

js-data-sql DSSqlAdapter create left join for hasOne Relationships

We are using js-data-sql DSSqlAdapter in our backend nodejs service.
Our model definition has a hasOne Relationship defined as follows:
module.exports = {
name: 'covariance_predictions',
idAttribute: 'id',
relations: {
belongsTo: {
assets: {
localKey: 'asset_1_id',
localField: 'Covariance_Predictions'
}
},
hasOne: {
currencies: {
localField: 'currency',
localKey: 'currency_id'
}
}
}
};
We call the adapter using:
covariancePredictions.findAll(params, {
with: ['currencies']
})
Question
After enabling knex debugging, we figured out, that it does not use a left join statement, but a subsequent SQL query like:
sql: 'select "currencies".* from "currencies" where "id" in (?, ?, ?, ?, ?, ?, ?)' }
Does anyone has any idea how to make js-data-sql DSSqlAdapter build a left join instead? Like:
select "covariance_predictions".id, [...], "currencies".code from "covariance_predictions" left join "currencies" on "currencies".id = "covariance_predictions".currency_id;

I'm one of the maintainers of js-data-sql. This is currently not supported as all relations loaded via with are done so using loadingWithRelations which performs the subsequent select ... where "id" in (...) for each relation requested.
Loading hasOne and belongsTo as part of the original query should definitely be possible via a left outer join, but not an inner join as the loading of the original entity is not dependent on the existence of a relation(s).
I've created a github issue to track this change, although I'm not sure when I'll be able to make it as js-data-sql also needs to port over to extend js-data-adapter for js-data 3.0

How will i know that record was duplicate or it was inserted successfully?

Here is my CQL table:
CREATE TABLE user_login (
userName varchar PRIMARY KEY,
userId uuid,
fullName varchar,
password text,
blocked boolean
);
I have this datastax java driver code
PreparedStatement prepareStmt= instances.getCqlSession().prepare("INSERT INTO "+ AppConstants.KEYSPACE+".user_info(userId, userName, fullName, bizzCateg, userType, blocked) VALUES(?, ?, ?, ?, ?, ?);");
batch.add(prepareStmt.bind(userId, userData.getEmail(), userData.getName(), userData.getBizzCategory(), userData.getUserType(), false));
PreparedStatement pstmtUserLogin = instances.getCqlSession().prepare("INSERT INTO "+ AppConstants.KEYSPACE+".user_login(userName, userId, fullName, password, blocked) VALUES(?, ?, ?, ?, ?) IF NOT EXIST");
batch.add(pstmtUserLogin.bind(userData.getEmail(), userId, userData.getName(), passwordEncoder.encode(userData.getPwd()), false));
instances.getCqlSession().executeAsync(batch);
Here the problem is that if I remove IF NOT EXIST all work fine but if put it back it simply do not insert records in table nor throw any error.
So how will i know that i am inserting duplicate userName ?
I am using cassandra 2.0.1

Use INSERT... IF NOT EXISTS, then you can use ResultSet#wasApplied() to check the outcome:
ResultSet rs = session.execute("insert into user (name) values ('foo') if not exists");
System.out.println(rs.wasApplied());
Notes:
this CQL query is a lightweight transaction, that carries performance implications. See this article for more information.
your example only has one statement, you don't need a batch

Looks like you need an ACID transaction and Cassandra, simply put, is not ACID. You have absolutely no guarantee that in the interval you check if username exists it will not be created from someone else.
Besides that, in CQL standard INSERT and UPDATE do the same thing. They both write a "new" record marking the old ones deleted. If there are old records or not is not important.
IF you want to authenticate or create a new user on the fly, I suppose you can work on a composite key username + password, and to your query as update where username =datum AND password = datum.
IN this way, if the user gives you a wrong password your query fails.
If user is new, he cant give a "wrong" password, and so his account is created.
You can now test for a field like "alreadysubscribed" which you only set after the first login, so in case of a "just created" user will be missing

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Quick processing of large file - multithreading

Related

How to perform multiple inserts In knex transaction with dependent validity checks between inserts?

Node.js Firebird SQL connector - getting result from INSERT ... RETURNING

PostGIS query with NodeJS, BookshelfJS and Knex

js-data-sql DSSqlAdapter create left join for hasOne Relationships

How will i know that record was duplicate or it was inserted successfully?

Categories

Resources