Sync elasticsearch on connection with database - nodeJS - node.js

Aim: sync elasticsearch with postgres database
Why: sometimes newtwork or cluster/server break so future updates should be recorded
This article https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html suggests that I should create a separate table updates that will sync elasticsearch's id, allowing to select new data (from database) since the last record (in elasticsearch). So I thought what if I could record elasticsearch's failure and successful connection: if client ponged back successfully (returned a promise), I could launch a function to sync records with my database.
Here's my elasticConnect.js
import elasticsearch from 'elasticsearch'
import syncProcess from './sync'
const client = new elasticsearch.Client({
host: 'localhost:9200',
log: 'trace'
});
client.ping({
requestTimeout: Infinity,
hello: "elasticsearch!"
})
.then(() => syncProcess) // successful connection
.catch(err => console.error(err))
export default client
This way, I don't even need to worry about running cron job (if question 1 is correct), since I know that cluster is running.
Questions
Will syncProcess run before export default client? I don't want any requests coming in while syncing...
syncProcess should run only once (since it's cached/not exported), no matter how many times I import elasticConnect.js. Correct?
Is there any advantages using the method with updates table, instead of just selecting data from parent/source table?
The articles' comments say "don't use timestamp to compare new data!".Ehhh... why? It should be ok since database is blocking, right?

For 1: As it is you have not warranty that syncProcess will have run by the time the client is exported. Instead you should do something like in this answer and export a promise instead.
For 2: With the solution I linked to in the above question, this would be taken care of.
For 3: An updates table would also catch record deletions, while simply selecting from the DB would not, since you don't know which records have disappeared.
For 4: The second comment after the article you linked to provides the answer (hint: timestamps are not strictly monotonic).

Related

How do i set-up my backend server without over writing data each time it it restarts

I have created my first backend server using node, express, sequelize and nodemon. I have also set-up the tables I want and initialise the data I want inputted into set tables when it runs (to save me time inputting it again). There is also separate data that I have entered manually on the pgamdin browser and other data I have entered via the CRUD functions. At the minute I am unable to enter large volumes of data as my tables keep flagging errors and it looks like they are being overwritten.
I am having trouble getting my server to run stable and I wasn't sure if it is related to the below bit of code in my server.js file. Does the 'force true' mean that the existing table data will be overwritten each time the server is run or do I need to add a "if tables exist" type function? if I was to add say a foreign key would the whole table be overwritten?
I assumed the best practice was to have the code for creating my tables and I have previously created my own tables on phpmyadmin when I used PHP. However as it is my first time creating a backend server and using sequelize ORM and don't want to keep losing the data I have entered.
db.sequelize.sync({ force: true})
.then(() => {
console.log(`Drop and resync database with { force: true } `)
initial();
});
You are using {force: true} which is equivalent to DROP TABLE IF EXISTS. So whenever you restart your server it drops your existing table along with data. So, if you don't want to lose your data, I suggest you remove {force: true}
the {force: true } creates a clean state whenever you restart so it is expected to see your data goes away. This sometimes works in development when you are testing out things and you might change your schema as you are developing the application but not ideal for production.
You could disable it from the first place or you can add a checker when you are in production mode to not drop your tables.
here is a good example:
if (process.env.NODE_ENV == "production") {
throw new Error("Forced sync is disabled in production");
// do non force sync in here
}

node-mysql2: resultset not reflecting the latest results

I'm using node-mysql2 with a connection pool and a connection limit of 10. When I restart the application, the results are good - they match what I have on the db. But when I start inserting new records and redo the same select queries, then I get intermittent results missing the latest record I just added.
If I do check the database directly, I can see the records I just added through my application. It's only the application that cannot see it somehow.
I think this is a bug, but here's how I have my code setup:
module.exports.getDB = function (dbName) {
if (!(dbName in dbs)) {
console.log(`Initiating ${dbName}`);
let config = dbConfigs[dbName];
dbs[dbName] = mysql.createPool({
host: config.host,
port: config.port || 3306,
user: config.user,
password: config.password,
connectionLimit: 10,
database: config.database,
debug: config.debug
});
}
return dbs[dbName]; // I just initialize each database once
};
This is my select query:
let db = dbs.getDB('myDb');
const [rows] = await db.query(`my query`);
console.log(rows[0]); // this one starts to show my results inconsistently once I insert records
And this is my insert query:
module.exports = {
addNote: async function(action, note, userID, expID) {
let db = dbs.getDB('myDb');
await db.query(`INSERT INTO experiment_notes (experiment_id, action, created_by, note)
VALUES (?, ?, ?, ?)`, [expID, action, userID, note]);
}
};
If I set the connectionLimit to 1, I cannot reproduce the problem... at least not yet
Any idea what I'm doing wrong?
Setting your connection_limit to 1 has an interesting side-effect: it serializes all access from your node program to your database. Each operation, be it INSERT or SELECT, must run to completion before the next one starts because it has to wait for the one connection in the pool to free up.
It's likely that your intermittently missing rows are due to concurrent access to your DBMS from different connections in your pool. If you do a SELECT from one connection while MySQL is handling the INSERT from another connection, the SELECT won't always find the row being inserted. This is a feature. It's part of ACID (atomicity, consistency, isolation, durability). ACID is vital to making DBMSs scale up.
In more complex applications than the one you showed us, the same thing can happen when you use DBMS transactions and forget to COMMIT them.
Edit Multiple database connections, even connections from the same pool in the same program, work independently of each other. So, if you're performing a not-yet-committed transaction on one connection and a query on another connection, the query will (usually) reflect the database's state before the transaction started. The query cannot force the transaction to roll back unless it somehow causes a deadlock. But deadlocks generate error messages; you probably are not seeing any.
You can sometimes control what a query sees by preceding it, on the same connection, with SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; . That can, on a busy DBMS, improve query performance a little bit, and prevent some deadlocks, as long as you're willing to have your query see only part of a transaction. I use it for historical queries (what happened yesterday). It's documented here. The default, the one that explains what you see, is SET TRANSACTION LEVEL REPEATABLE READ;
But, avoid that kind of isolation-level stuff until you need it. (That advice comes under the general heading of "too smart is dumb.")

How to efficiently sync Apollo's cache using subscriptions and AWS AppSync

I'm using aws-appsync in a Node.js client to keep a cached list of data items. This cache must be available at all times, including when not connected to the internet.
When my Node app starts, it calls a query which returns the entire list of items from the AppSync data source. This is cached by Apollo's cache storage, which allows future queries (using the same GraphQL query) to be made using only the cache.
The app also makes a subscription to the mutations which are able to modify the list on other clients. When an item in the list is changed, the new data is sent to the app. This can trigger the original query for the entire list to be re-fetched, thus keeping the cache up to date.
Fetching the entire list when only one item has changed is not efficient. How can I keep the cache up to date, while minimising the amount of data that has to be fetched on each change?
The solution must provide a single point to access cached data. This can either be a GraphQL query or access to the cache store directly. However, using results from multiple queries is not an option.
The Apollo documentation hints that this should be possible:
In some cases, just using [automatic store updates] is not enough for your application ... to update correctly. For example, if you want to add something to a list of objects without refetching the entire list ... Apollo Client cannot update existing queries for you.
The alternatives it suggests are refetching (essentially what I described above) and using an update callback to manually update the cached query results in the store.
Using update gives you full control over the cache, allowing you to make changes to your data model in response to a mutation in any way you like. update is the recommended way of updating the cache after a query.
However, here it is referring to mutations made by the same client, rather than syncing using between clients using subscriptions. The update callback option doesn't appear to be available to a subscription (which provides the updated item data) or a query (which could fetch the updated item data).
As long as your subscription includes the full resource that was added, it should be possible by reading from and writing to the cache directly. Let's assume we have a subscription like this one from the docs:
const COMMENTS_SUBSCRIPTION = gql`
subscription onCommentAdded {
commentAdded {
id
content
}
}
`;
The Subscription component includes a onSubscriptionData prop, so we should be able to do something along these lines:
<Subscription
subscription={COMMENTS_SUBSCRIPTION}
onSubscriptionData={({ client, subscriptionData: { data, error } }) => {
if (!data) return
const current = client.readQuery({ query: COMMENTS_QUERY })
client.writeQuery({
query: COMMENTS_QUERY,
data: {
comments: [...current.comments, data.commentAdded],
},
})
}}
/>
Or, if you're using plain JavaScript instead of React:
const observable = client.subscribe({ query: COMMENTS_SUBSCRIPTION })
observable.subscribe({
next: (data) => {
if (!data) return
const current = client.readQuery({ query: COMMENTS_QUERY })
client.writeQuery({
query: COMMENTS_QUERY,
data: {
comments: [...current.comments, data.commentAdded],
},
})
},
complete: console.log,
error: console.error
})

Resource Conflict after syncing with PouchDB

I am new to CouchDB / PouchDB and until now I somehow could manage the start of it all. I am using the couchdb-python library to send initial values to my CouchDB before I start the development of the actual application. Here I have one database with templates of the data I want to include and the actual database of all the data I will use in the application.
couch = couchdb.Server()
templates = couch['templates']
couch.delete('data')
data = couch.create('data')
In Python I have a loop in which I send one value after another to CouchDB:
value = templates['Template01']
value.update({ '_id' : 'Some ID' })
value.update({'Other Attribute': 'Some Value'})
...
data.save(value)
It was working fine the whole time, I needed to run this several times as my data had to be adjusted. After I was satisfied with the results I started to create my application in Javascript. Now I synced PouchDB with the data database and it was also working. However, I found out that I needed to change something in the Python code, so I ran the first python script again, but now I get this error:
couchdb.http.ResourceConflict: (u'conflict', u'Document update conflict.')
I tried to destroy() the pouchDB database data and delete the CouchDB database as well. But I still get this error at this part of the code:
data.save(value)
What I also don't understand is, that a few values are actually passed to the database before this error comes. So some values are saved() into the db.
I read it has something to do with the _rev values of the documents, but I cannot get an answer. Hope someone can help here.

Collection name in Mongoose

Why would a database named 'blog' not allow a record insert and also give no return error and why would a database named 'blogs' allow a record inserts and return errors?
I just spent several hours going through all my code thinking I did something wrong. I have written many mongoose connected apps but when using the following it would return success but not insert the record and return no error as to why:
mongooose.connect('mongodb://localhost:27017/blog');
After banging my head against a wall for a bit I decided to change the database name:
mongooose.connect('mongodb://localhost:27017/blogs');
It works! But why would this name convention matter? I can't find anything in the documentation for MongoDB or Mongoosejs.
So I'm fairly certain mongodb doesn't care about database name "blog" vs "blogs". However, do note that mongoose has the questionably-helpful feature of silently queueing up operations while the database connection is still not established and then firing them off if/when the database connection is ready. That could be causing your confusion. To test that theory, pass a callback to mongoose.connect and put a console.log in the callback so you know exactly when the connection is ready.

Resources