How to query data efficiently in large Cassandra collection?

How to query data efficiently in large Cassandra collection? - node.js

I have one big Casandra collection (1-million docs), and I want to query the whole user table data its around 1 million records. When I run following query its only return around 10K records.
Could you please let me know what is the effective method to query whole documents from Casandra collection
I am using https://www.npmjs.com/package/cassandra-driver npm as casandra driver
Tried
const query = 'select * from users';
db.users.executeQuery(query)
.then(function (result2) {
})
.catch (function (error) {
reject(error);
});

Why you cannot retrieve all the data at once is because there is a limit on the number of item you can read at once which is understandable.
Looking at the documentation you sould use of the stream or eachRow methods. Which allow you to get treat the entries of the collection in multiple iterations.
client.stream(query, parameters, options)
.on('readable', function () {
// readable is emitted as soon a row is received and parsed
let row;
while (row = this.read()) {
// process row
}
})
.on('end', function () {
// emitted when all rows have been retrieved and read
});
Or
client.eachRow(query, parameters, { prepare: true, autoPage : true }, function(n, row) {
// Invoked per each row in all the pages
}, callback);

Related

node mssql multiple queries in one BEGIN TRANSACTION

Would like to do multiple updates via my API using mssql transactions.
Example:
Shipping Table
Listing Table
User_Notes Table
Customer_Login Table
Push_Notification Table
Which is the right way of doing it?
I was thinking at first of doing it with raw queries.
BEGIN TRANSACTION
CREATE IN SHIPPING
UPDATE IN LISTING
CREATE IN USER_NOTES
UPDATE IN CUSTOMER_LOGIN
CREATE IN PUSH_NOTIFICATION
COMMIT
But want to avoid writing a big raw query like this.
Also can I use mssql Transactions and Queries with (request.query).
const transaction = new sql.Transaction(/* [pool] */)
transaction.begin(err => {
// ... error checks
const request = new sql.Request(transaction)
request.query('create in shipping table', (err, result) => {
// ... error checks
transaction.commit(err => {
// ... error checks
console.log("Transaction committed.")
})
})
request.query('Update in Listing Table', (err, result) => {
// ... error checks
transaction.commit(err => {
// ... error checks
console.log("Transaction committed.")
})
})
and so on...
.
.
.
})

Make multi threaded 1 million inserts to MongoDb using Node JS scripts

I have a isolated sync server that pulls a tab limited text file from a external ftp server and updates(saves) to mongodb after processing.
My code looks like this
//this function pulls file from external ftp server
async function upsteamFile() {
try {
let pythonProcess = spawn('python3', [configVar.ftpInbound, '/outbound/Items.txt', configVar.dataFiles.items], {encoding: 'utf8'});
logger.info('FTP SERVER LOGS...' + '\n' + pythonProcess.stdout);
await readItemFile();
logger.info('The process of file is done');
process.exit();
} catch (upstreamError) {
logger.error(upstreamError);
process.exit();
}
}
//this function connects to db and calls processing function for each row in the text file.
async function readItemFile(){
try{
logger.info('Reading Items File');
let dataArray = fs.readFileSync(configVar.dataFiles.items, 'utf8').toString().split('\n');
logger.info('No of Rows Read', dataArray.length);
await dbConnect.connectToDB(configVar.db);
logger.info('Connected to Database', configVar.db);
while (dataArray.length) {
await Promise.all( dataArray.splice(0, 5000).map(async (f) => {
splitValues = f.split('|');
await processItemsFile(splitValues)
})
)
logger.info("Current batch finished processing")
}
logger.info("ALL batch finished processing")
}
catch(PromiseError){
logger.error(PromiseError)
}
}
async function processItemsFile(splitValues) {
try {
// Processing of the file is done here and I am using 'save' in moongoose to write to db
// data is cleaned and assigned to respective fields
if(!exists){
let processedValues = new Products(assignedValues);
let productDetails = await processedValues.save();
}
return;
}
catch (error) {
throw error
}
}
upstream()
So this takes about 3 hours to process 100,000 thousand rows and update it in the database.
Is there any way that I can speed this up. I am very much limited from the hardware. I am using a ec2 instance based linux server with 2 core and 4 gb ram.
Should I use worker threads like microjob to run multi-threads . if yes , then how would I go about doing it
Or is this the maximum performance?
Note : I cant do bulk update in mongodb as there is mongoose pre hooks are getting triggered on save

You can always try a bulk update with the use of updateOne method.
I would consider also using the readFileStream instead of readFileSync.
With the event-driven architecture you could push, let's say every 100k updates into array chunks and bulk update on them simultaneously.
You can trigger a pre updateOne() (instead of save()) hook during this operation.
I have solved a similar problem (updating 100k CSV rows) with the following solution:
Create a readFileStream (thanks to that, your application won't consume much heap memory in case of the huge files)
I'm using CSV-parser npm library to deconstruct a CSV file into separate rows of data:
let updates = [];
fs.createReadStream('/filePath').pipe(csv())
.on('data', row => {
// ...do anything with the data
updates.push({
updateOne: {
filter: { /* here put the query */ },
update: [ /* any data you want to update */ ],
upsert: true /* in my case I want to create record if it does not exist */
}
})
})
.on('end', async () => {
await MyCollection.bulkWrite(data)
.catch(err => {
logger.error(err);
})
updates = []; // I just clean up the huge array
})

Multple SQL queries in Node with oracledb

I'm new to Node and am having problems reading from Oracle.
I have the basic examples all set up and can issue basic queries, and process the results etc..
The problem I'm having is that I need to;
Execute one query (Q1)
For each item in the results of Q1 I need to execute a second query (Q2)
I need to combine the results of Q1 and Q2s into an array to return as a promise
I am struggling to find an example where I can perform #2 - call the same query multiple times for each item returned from Q1, using the same connection which was used for Q1.
My code is below - I first perform a read, then iterate through the results storing connection.execute objects which I then run via the Promise.all line - the result of which I just output as I want to get this working before I code the logic to combine the results of Q1 and Q2.
When I run this via mocha, the results of don't contain any data - I see the column headings but no data.
So what am I missing here?
// placeholder for the connection
let conn;
// return case list array
var caseList = [];
var queryList = [];
return new Promise((resolve, reject) => {
// retrieve connection
oracledb.getConnection({
user: dbconfig.user,
password: dbconfig.password,
connectString: dbconfig.connectString
}) // the connection is returned as a promise
.then(connection => {
console.log('Connected to the DB!');
// assign connection
conn = connection;
// execute statement
return connection.execute(
`select caseid, casereference, startdate from caseheader inner join orgobjectlink on caseheader.ownerorgobjectlinkid = orgobjectlink.orgobjectlinkid where orgobjectlink.username = :username`,
[params.username], {
outFormat: oracledb.OBJECT // set the output format to be object
}
);
})
.then(result => {
// iterate around rows
result.rows.forEach(row => {
var caseObj = {
caseID: row.CASEID,
reference: row.CASEREFERENCE,
dateAssigned: moment(row.STARTDATE).format('YYYY-MM-DD'),
username: params.username,
}
caseList.push(caseObj);
console.log(caseObj.caseID)
queryList.push(conn.execute(`select concernroleid, concernrolename from concernrole inner join caseparticipantrole on concernrole.concernroleid = caseparticipantrole.participantroleid where caseparticipantrole.caseid = :caseID and (caseparticipantrole.typecode = 'PRI' or caseparticipantrole.typecode = 'MEM')`,
[caseObj.caseID], {
outFormat: oracledb.OBJECT
}));
});
// build up queries
return Promise.all(queryList).then(results => {
console.log(results);
Promise.resolve(results);
}, err => {
console.log(err);
});
}).then({
if(conn){
console.log("Closing DB connection");
conn.close();
}
}).catch(err => {
console.log('Error', err);
});
});

Promise.all will not work for you as you want to use a single connection and as mentioned previously a connection will only do one thing at a time anyway. To solve this problem using promises, you'd have to build up and unwind a promise chain. I can show you an example, but it's nasty - probably better to just forget I mentioned it.
A better option would be to go into a simple for loop using async/await. I can show you can example of that too but again, I think this is the wrong move. We call this row by row fetching (a.k.a slow by slow).
It's likely the best solution for you will be to take the results from the first query and build up an array. Then execute the second query using one of these options to process the array. https://oracle.github.io/node-oracledb/doc/api.html#sqlwherein
You'll need to include the caseid column in the select clause and perhaps even order by that column so that post-processing of the result set is simplified in Node.js.
This solution has the potential to greatly improve performance and resource utilization, but that has to be balanced against the amount of data you have, the resources, etc. I could probably show you an example of this too, but it will take a bit longer and I'd want to get some more info from you to ensure we're on the right path.

One problem is the Promise.all().then... function doesn't return anything (and doesn't need the additional resolve()). The way to get this sorted is build small, testable, promise returning functions, and test them individually.
Starting simply, write a mocha test to connect to the database...
function connect() {
return oracledb.getConnection({
user: dbconfig.user,
password: dbconfig.password,
connectString: dbconfig.connectString
});
}
Here's one that can run a command on the db. Test this with a simple query that you know will return some results.
function executeCmd(connection, cmd, params) {
return connection.execute(cmd, params, { outFormat: oracledb.OBJECT });
}
With just these two (and one more) we can outline a simple function that does the job: connect to the database, run a select, process each result asynchronously, then disconnect.
function connectAndQuery(username) {
let connection;
return connect().then(result => {
connection = result;
let cmd = `select caseid, casereference, startdate from caseheader inner join orgobjectlink on caseheader.ownerorgobjectlinkid = orgobjectlink.orgobjectlinkid where orgobjectlink.username = :username`;
return executeCmd(connection, cmd, [username]);
}).then(result => {
let promises = result.rows.map(row => processCaseRow(connection, row, username));
return Promise.all(promises);
}).then(result => {
// result should be an array of caseObj's
return connection.close().then(() => result);
});
}
The last thing to build and test is a promise-returning function which processes a row from the main function above.
I had to take some liberty with this, but I think the objective is -- given a row representing a "case" -- build a case object, including a collection of "concernedRoles" that can be queried with the caseID. (that last bit was my idea, but you can build a separate collection if you like)
// return a promise that resolves to an object with the following properties...
// caseID, reference, dateAssigned, username, concernedRoles
// get concernedRoles by querying the db
function processCaseRow(connection, row, username) {
var caseObj = {
caseID: row.CASEID,
reference: row.CASEREFERENCE,
dateAssigned: moment(row.STARTDATE).format('YYYY-MM-DD'),
username: username
}
let cmd = `select concernroleid, concernrolename from concernrole inner join caseparticipantrole on concernrole.concernroleid = caseparticipantrole.participantroleid where caseparticipantrole.caseid = :caseID and (caseparticipantrole.typecode = 'PRI' or caseparticipantrole.typecode = 'MEM')`;
return executeCmd(connection, cmd, row.CASEID).then(result => {
caseObj.concernedRole = result
return caseObj
})
}

Using node.js and promise to fetch paginated data

Please keep in mind that I am new to node.js and I am used with android development.
My scenario is like this:
Run a query against the database that returns either null or a value
Call a web service with that database value, that offers info paginated, meaning that on a call I get a parameter to pass for the next call if there is more info to fetch.
After all the items are retrieved, store them in a database table
If everything is well, for each item received previously, I need to make another web call and store the retrieved info in another table
if fetching any of the data set fails, all data must be reverted from the database
So far, I've tried this:
getAllData: function(){
self.getMainWebData(null)
.then(function(result){
//get secondary data for each result row and insert it into database
}
}
getMainWebData: function(nextPage){
return new Promise(function(resolve, reject) {
module.getWebData(nextPage, function(errorReturned, response, values) {
if (errorReturned) {
reject(errorReturned);
}
nextPage = response.nextPageValue;
resolve(values);
})
}).then(function(result) {
//here I need to insert the returned values in database
//there's a new page, so fetch the next set of data
if (nextPage) {
//call again getMainWebData?
self.getMainWebData(nextPage)
}
})
There are a few things missing, from what I've tested, getAllData.then fires only one for the first set of items and not for others, so clearly handling the returned data in not right.
LATER EDIT: I've edited the scenario. Given some more research my feeling is that I could use a chain or .then() to perform the operations in a sequence.

Yes it is happening as you are resolving the promise on the first call itself. You should put resolve(value) inside an if statement which checks if more data is needed to be fetched. You will also need to restructure the logic as node is asynchronous. And the above code will not work unless you do change the logic.
Solution 1:
You can either append the paginated response to another variable outside the context of the calls you are making. And later use that value after you are done with the response.
getAllData: function(){
self.getMainWebData(null)
.then(function(result){
// make your database transaction if result is not an error
}
}
function getList(nextpage, result, callback){
module.getWebData(nextPage, function(errorReturned, response, values) {
if(errorReturned)
callback(errorReturned);
result.push(values);
nextPage = response.nextPageValue;
if(nextPage)
getList(nextPage, result, callback);
else
callback(null, result);
})
}
getMainWebData: function(nextPage){
return new Promise(function(resolve, reject) {
var result = [];
getList(nextpage, result, function(err, results){
if(err)
reject(err);
else{
// Here all the items are retrieved, you can store them in a database table
// for each item received make your web call and store it into another variable or result set
// suggestion is to make the database transaction only after you have retrieved all your data
// other wise it will include database rollback which will depend on the database which you are using
// after all this is done resolve the promise with the returning value
resolve(results);
}
});
})
}
I have not tested it but something like this should work. If problem persists let me know in comments.
Solution 2:
You can remove promises and try the same thing with callback as they are easier to follow and will make sense to the programmers who are familiar with structural languages.

Looking at your problem, I have created a code that would loop through promises.
and would only procede if there is more data to be fetched, the stored data would still be available in an array.
I hope this help. Dont forget to mark if it helps.
let fetchData = (offset = 0, limit= 10) => {
let addresses = [...Array(100).keys()];
return Promise.resolve(addresses.slice(offset, offset + limit))
}
// o => offset & l => limit
let o = 0, l = 10;
let results = [];
let process = p => {
if (!p) return p;
return p.then(data => {
// Process with data here;
console.log(data);
// increment the pagination
o += l;
results = results.concat(data);
// while there is data equal to limit set then fetch next page
// otherwise return the collected result
return (data.length == l)? process(fetchAddress(o, l)).then(data => data) : results;
})
}
process(fetchAddress(o, l))
.then(data => {
// All the fetched data will be here
}).catch(err => {
// Handle Error here.
// All the retrieved data from database will be available in "results" array
});
if You want to do it more often I have also created a gist for reference.
If You dont want to use any global variable, and want to do it in very functional way. You can check this example. However it requires little more complication.

cassandra timeuuid columns showing buffer type data insted of string

I am retrieving data from Cassandra timeuuid column using NodeJS Cassandra driver. Now the data retrieved as buffer type instead of string type. I need the data as string type

While it's still difficult to understand what you're expecting to see, this will return your PostedDate in a more-readable format:
SELECT DateOf(PostedDate) FROM facehq.timeline;
Also, as your data grows in size, unbound queries will become problematic and slow. So make sure you qualify queries like this with a WHERE clause whenever possible.
i need result like "posteddate":"f6ca25d0-ffa4-11e4-830a-b395dbb548cd".
It sounds to me like your issue is in your node.js code. How are you executing your query? The Node.js driver documentation on the GitHub project page has an example that may help:
client.stream('SELECT time, val FROM temperature WHERE station_id=', ['abc'])
.on('readable', function () {
//readable is emitted as soon a row is received and parsed
var row;
while (row = this.read()) {
console.log('time %s and value %s', row.time, row.val);
}
})
.on('end', function () {
//stream ended, there aren't any more rows
})
.on('error', function (err) {
//Something went wrong: err is a response error from Cassandra
});
The DataStax documentation also has a specific example for working with timeUUIDs:
client.execute('SELECT id, timeid FROM sensor', function (err, result) {
assert.ifError(err);
console.log(result.rows[0].timeid instanceof TimeUuid); // true
console.log(result.rows[0].timeid instanceof Uuid); // true, it inherits from Uuid
console.log(result.rows[0].timeid.toString()); // <- xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
console.log(result.rows[0].timeid.getDate()); // <- Date stored in the identifier
});

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to query data efficiently in large Cassandra collection? - node.js

Related

node mssql multiple queries in one BEGIN TRANSACTION

Make multi threaded 1 million inserts to MongoDb using Node JS scripts

Multple SQL queries in Node with oracledb

Using node.js and promise to fetch paginated data

cassandra timeuuid columns showing buffer type data insted of string

Categories

Resources