On the project which I am currently working, I have to read an Excel file (with over a 1000 rows), extract all them and insert/update to a database table.
in terms of performance, is better to add all the records to a Doctrine_Collection and insert/update them after using the fromArray() method, right? One other possible approach is to create a new object for each row (a Excel row will be a object) and them save it but I think its worst in terms of performance.
Every time the Excel is uploaded, it is necessary to compare its rows to the existing objects on the database. If the row does not exist as object, should be inserted, otherwise updated. My first approach was turn both object and rows into arrays (or Doctrine_Collections); then compare both arrays before implementing the needed operations.
Can anyone suggest me any other possible approach?
We did a bit of this in a project recently, with CSV data. it was fairly painless. There's a symfony plugin tmCsvPlugin, but we extended this quite a bit since so the version in the plugin repo is pretty out of date. Must add that to the #TODO list :)
Question 1:
I don't explicitly know about performance, but I would guess that adding the records to a Doctrine_Collection and then calling Doctrine_Collection::save() would be the neatest approach. I'm sure it would be handy if an exception was thrown somewhere and you had to roll back on your last save..
Question 2:
If you could use a row field as a unique indentifier, (let's assume a username), then you could search for an existing record. If you find a record, and assuming that your imported row is an array, use Doctrine_Record::synchronizeWithArray() to update this record; then add it to a Doctrine_Collection. When complete, just call Doctrine_Collection::save()
A fairly rough 'n' ready implementation:
// set up a new collection
$collection = new Doctrine_Collection('User');
// assuming $row is an associative
// array representing one imported row.
foreach ($importedRows as $row) {
// try to find an existing record
// based on a unique identifier.
$user = Doctrine_Core::getTable('User')
->findOneByUsername($row['username']);
// create a new user record if
// no existing record is found.
if (!$user instanceof User) {
$user = new User();
}
// sync record with current data.
$user->synchronizeWithArray($row);
// add to collection.
$collection->add($user);
}
// done. save collection.
$collection->save();
Pretty rough but something like this worked well for me. This is assuming that you can use your imported row data in some way to serve as a unique identifier.
NOTE: be wary of synchronizeWithArray() if you're using sf1.2/doctrine 1.0 - if I remember correctly it was not implemented correctly. it works fine in doctrine 1.2 though.
I have never worked on Doctrine_Collections, but I can answer in terms of database queries and code logic in a broader sense. I would apply the following logic:-
Fetch all the rows of the excel sheet from database in a single query and store them in an array $uploadedSheet.
Create a single array of all the rows of the uploaded excel sheet, call it $storedSheet. I guess the structures of the Doctrine_Collections $uploadedSheet and $storedSheet will be similar (both two-dimensional - rows, cells can be identified and compared).
3.Run foreach loops on the $uploadedSheet as follows and only identify the rows which need to be inserted and which to be updated (do actual queries later)-
$rowsToBeUpdated =array();
$rowsToBeInserted=array();
foreach($uploadedSheet as $row=>$eachRow)
{
if(is_array($storedSheet[$row]))
{
foreach($eachRow as $column=>$value)
{
if($value != $storedSheet[$row][$column])
{//This is a representation of comparison
$rowsToBeUpdated[$row]=true;
break; //No need to check this row anymore - one difference detected.
}
}
}
else
{
$rowsToBeInserted[$row] = true;
}
}
4. This way you have two arrays. Now perform 2 database queries -
bulk insert all those rows of $uploadedSheet whose numbers are stored in $rowsToBeInserted array.
bulk update all the rows of $uploadedSheet whose numbers are stored in $rowsToBeUpdated array.
These bulk queries are the key to faster performance.
Let me know if this helped, or you wanted to know something else.
Related
Here is my problem:
I have a firestore collection that has a number of documents. There are about 500 documents generated/updated every hour and saved to the collection.
I would like to query the collection and setup a real-time snapshot listener for a subset of document IDs, that are provided by the client.
I think maybe I could to something like this (this syntax is likely not correct...just trying to get a feel for if it's even possible...but isn't the "in" limited to an array of 10 items? ):
const subbedDocs = ["doc1","doc2","doc3","doc4","doc5"]
docsRef.where('docID', 'in', subbedDocs).onSnapshot((doc) => {
handleSnapshot(doc);
});
I'm sorry, that code probably doesn't make sense....I'm still trying to learn all the ins and outs of Firestore.
Essentially, what I am trying to do is take an array of ID's and setup a .onSnapshot listener for those ID's. This list of IDs could be upwards of 40-50 items. Is this even possible? I am trying to avoid just setting up a listener on the whole collection and filtering out things I am not "subscribed" too as that seems wasteful from a resources perspective.
If you have the doc IDs in your array (it looks like you have) you can loop over them and start a listener during that:
const subbedDocs = ["doc1", "doc2", "doc3", "doc4", "doc5"];
for (let i = 0; i < subbedDocs.length; i++) {
const docID = subbedDocs[i];
docsRef.doc(docID).onSnapshot((doc) => {
handleSnapshot(doc);
});
}
It would be better to listen to a query and all filtered docs at once. But if you want to listen to each of them with a explicit listener that would do the trick.
As you've discovered, Firestore's in operator only allows up to 10 entries in the array. I'm also guessing you've added the docID as a field in the document, since I don't believe 'docID references the actual documentid.
I would not take this approach, because of the 10-entry limitation. What I would do is, as the client is selecting documents to follow, set a field (same in each document) to a unique Id for the client, so your query completely avoids the limitation. You can allow an unlimited number of Client listeners (up to implementation limits of Firestore) if you add that client ID into an array (called something like "ListenerArray") [again, as the client is selecting them]. Your query would be more like:
docsRef.where('ListenerArray', 'array-contains', clientID).onSnapshot((doc) => {
handleSnapshot(doc);
})
array-contains checks a single value against all entries in a document array, without limit. Every client can mark any number of documents to subscribe to.
I have lots of records in my postgres. (using sequelize to communicate)
I want to have a migrate script, but due to locking, I have to do each change as atomic as possible.
So I don't want to selectAll and then modify and then saveAll.
In mongo I have forEach cursor which allows me to update a record, save it and only then move to the next one.
Anything similar in sequelize/postgres?
Currently, I am doing that in my code - getting the IDs, then for each performing a query.
return migration.runOnAllUpdates((record)=>{
record.change = 'new value';
return record.save()
});
where runOnAllUpdates will simply give me records one by one.
I would like to insert a document if it doesn't exist (client_nr not found).
If this exists, replace the whole document with new values.
The only other this is, that the client_nr is not the primary key. The primary key is the default id created by rethinkdb database.
I tried the below code in node js, but nothing happened. The data is in the variable jsonArray. I use the for loop to go through the whole jsonArray.
Any idea how to solve this problem?
Thanks!!!
for(var Ticker in jsonArray){
r.db(db).table('trades').filter({client_nr: jsonArray[Ticker].client_nr}).forEach(function(post) {
return r.branch(
post.eq(null),
r.db(db).table('log').insert(jsonArray[Ticker]),
r.db(db).table('log').replace(jsonArray[Ticker])
)
}).run()
}
This is much easier to do if client_nr is your primary key. I'd consider doing that instead of using the autogenerated IDs. That will also enforce uniqueness on the field, which is probably what you want.
I was also a little confused by your example because your description made it sound like you wanted to be inserting/replacing into the same table that you're filtering on, but your example is referencing two different tables.
Assuming you want to be using a single table, something like this should do it:
TABLE.filter({client_nr: jsonArray[Ticker].client_nr}).replace(function(row) {
return r(jsonArray[Ticker]).merge(row.pluck('id'));
}).do(function(res) {
return r.branch(
res('replaced').add(res('unchanged')).eq(0),
TABLE.insert(jsonArray[Ticker]),
res);
})
I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.
I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found.
I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with a different file, though I seem to get the correct number of files emitted).
function(doc) {
if (doc.type == 'log') {
if (!pageIndex || pageIndex > 50) {
pageIndex = 1;
emit(doc.timestamp, null);
}
pageIndex++;
}
}
What am I doing wrong here? How would a CouchDB expert build this view?
Note that I don't want to use the "startkey + count + 1" method that's been mentioned elsewhere, since I'd like to be able to jump to a particular page or the last page (user expectations and all), I'd like to have a friendly "?page=5" URI instead of "?startkey=348ca1829328edefe3c5b38b3a1f36d1e988084b", and I'd rather CouchDB did this work instead of bulking up my application, if I can help it.
Thanks!
View functions (map and reduce) are purely functional. Side-effects such as setting a global variable are not supported. (When you move your application to BigCouch, how could multiple independent servers with arbitrary subsets of the data know what pageIndex is?)
Therefore the answer will have to involve a traditional map function, perhaps keyed by timestamp.
function(doc) {
if (doc.type == 'log') {
emit(doc.timestamp, null);
}
}
How can you get every 50th document? The simplest way is to add a skip=0 or skip=50, or skip=100 parameter. However that is not ideal (see below).
A way to pre-fetch the exact IDs of every 50th document is a _list function which only outputs every 50th row. (In practice you could use Mustache.JS or another template library to build HTML.)
function() {
var ddoc = this,
pageIndex = 0,
row;
send("[");
while(row = getRow()) {
if(pageIndex % 50 == 0) {
send(JSON.stringify(row));
}
pageIndex += 1;
}
send("]");
}
This will work for many situations, however it is not perfect. Here are some considerations I am thinking--not showstoppers necessarily, but it depends on your specific situation.
There is a reason the pretty URLs are discouraged. What does it mean if I load page 1, then a bunch of documents are inserted within the first 50, and then I click to page 2? If the data is changing a lot, there is no perfect user experience, the user must somehow feel the data changing.
The skip parameter and example _list function have the same problem: they do not scale. With skip you are still touching every row in the view starting from the beginning: finding it in the database file, reading it from disk, and then ignoring it, over and over, row by row, until you hit the skip value. For small values that's quite convenient but since you are grouping pages into sets of 50, I have to imagine that you will have thousands or more rows. That could make page views slow as the database is spinning its wheels most of the time.
The _list example has a similar problem, however you front-load all the work, running through the entire view from start to finish, and (presumably) sending the relevant document IDs to the client so it can quickly jump around the pages. But with hundreds of thousands of documents (you call them "log" so I assume you will have a ton) that will be an extremely slow query which is not cached.
In summary, for small data sets, you can get away with the page=1, page=2 form however you will bump into problems as your data set gets big. With the release of BigCouch, CouchDB is even better for log storage and analysis so (if that is what you are doing) you will definitely want to consider how high to scale.