How can I persist dialogData across different nodes? - node.js

As it says in the documentation for the Microsoft Bot Framework, they have different types of data. One of them is the dialogData, privateConversationData, conversationData and userData.
By default, it seems the userData is/should be prepared to handle the persistency across nodes, however the dialogData should be used for temporary data.
As it says here: https://learn.microsoft.com/en-us/bot-framework/nodejs/bot-builder-nodejs-dialog-waterfall
If the bot is distributed across multiple compute nodes, each step of
the waterfall could be processed by a different node, therefore it's
important to store bot data in the appropriate data bag
So, basically, if I have two nodes, how/why should I used dialogData at all, as I cannot guarantee it will be kept across nodes? It seems that if you have more than one node, you should just use userData.

I've asked the docs team to remove the last portion of the sentence: "therefore it's important to store bot data in the appropriate data bag". It is misleading. The Bot Builder is restful and stateless. Each of the dialogData, privateConversationData, conversationData and userData are stored in the State Service: so any "compute node" will be able to retrieve the data from any of these objects.
Please note: the default Connector State Service is intended only for prototyping, and should not be used with production bots. Please use the Azure Extensions or implement a custom state client.
This blog post might also be helpful: Saving State data with BotBuilder-Azure in Node.js

Related

Spark results accessible through API

We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything.
Here are a few options that come to mind
Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query.
Spark results are pumped into a messaging queue from which a socket server like connection is made.
What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection.
Can someone with expertise help in that sense?
The standard way I think would be to use Livy, as you mention. Since it's a REST API you wouldn't expect to get a JSON response containing the full result (could be gigabytes of data, after all).
Rather, you'd use pagination with ?from=500 and issue multiple requests to get the number of rows you need. A web application would only need to display or visualize a small part of the data at a time anyway.
But from what you mentioned in your comment to Raphael Roth, you didn't mean to call this API directly from the web app (with good reason). So you'll have an API layer that is called by the web app and which then invokes Spark. But in this case, you can still use Livy+pagination to achieve what you want, unless you specifically need to have the full result available. If you do need the full results generated on the backend, you could design the Spark queries so they materialize the result (ideally to cloud storage) and then all you need is to have your API layer access the storage where Spark writes the results.

Application-side join ORM for Node?

To start: I've tried Loopback. Loopback is nice but does not allow for relations across multiple REST data services, but rather makes a call to the initial data service and passes query parameters that ask it to perform the joined query.
Before I go reinventing the wheel and writing a massive wrapper around Loopback's loopback-rest-connector, I need to find out if there are any existing libraries or frameworks that already tackle this. My extensive Googling has turned up nothing so far.
In a true microservice environment, there is a service per database.
http://microservices.io/patterns/data/database-per-service.html
From this article:
Implementing queries that join data that is now in multiple databases
is challenging. There are various solutions:
Application-side joins - the application performs the join rather than
the database. For example, a service (or the API gateway) could
retrieve a customer and their orders by first retrieving the customer
from the customer service and then querying the order service to
return the customer’s most recent orders.
Command Query Responsibility Segregation (CQRS) - maintain one or more
materialized views that contain data from multiple services. The views
are kept by services that subscribe to events that each services
publishes when it updates its data. For example, the online store
could implement a query that finds customers in a particular region
and their recent orders by maintaining a view that joins customers and
orders. The view is updated by a service that subscribes to customer
and order events.
EXAMPLE:
I have 2 data microservices:
GET /pets - Returns an object like
{
"name":"ugly",
"type":"dog",
"owner":"chris"
}
and on a completely different microservice....
GET /owners/{OWNER_NAME} - Returns the owner info
{
"owner":"chris",
"address":"under a bridge",
"phone":"123-456-7890"
}
And I have an API-level microservice that is going to call these two data services. This is the microservice that I will be applying this at.
I'd like to be able to establish a model for Pet such that, when I query pet, upon a successful response from GET /pets, it will "join" with owners (send a GET /owners/{OWNERS_NAME} for all responses), and to the user, simply return a list of pets that includes their owner's data.
So GET /pets (maybe something like Pets.find()) would return
{
"name":"ugly",
"type":"dog",
"owner": "chris",
"address": "under a bridge",
"phone": "123-456-7890"
}
Applying any model/domain logic on your API-gateway is bad decision, and considered as bad practice. API Gateway should only do your systems's CAS (with relying onto Auth service which holds the logic), And convert incoming external requests into inner system requests (different headers/ requester payload data) and proxy formatted requests to services for any other work, recieves them, cares about encapsulating errors, and presents every response in proper external form.
Another point, if there is alot of joins between two models required for application core flow (validation/scoping etc) then perhaps you should reconsider to which Business Domain your models/services are bound. If it's same BD perhaps they should be together. Priciples of Domain-Driven-Design helped me to understand where real boundaries between micro-services are.
If you work with loopback (like we are and face same problem we faced - that loopback have no proper join implementation) you can have separate Report/Combined data service, which is only one who can access to all the service databases and does it only for READ purposes - i.e. queries. Provide it with separately set-up read-only wide access to the db - instead of having only one datasource being set up (single database) it should be able to read from all the databases which are in scope of this query-join db user.
Such service should able to generate proper joins with expected output schema from configuration json - like loopback models (thats what I did in same case). Once abstraction is done it's pretty simple to build/add any equery with any complex joins. It's clean, and it's easy to reason about. Also, it's DBA friendly. For me such approach worked well so far.

Real-Time Database Messaging

We've got an application in Django running against a PGSQL database. One of the functions we've grown to support is real-time messaging to our UI when data is updated in the backend DB.
So... for example we show the contents of a customer table in our UI, as records are added/removed/updated from the backend customer DB table we echo those updates to our UI in real-time via some redis/socket.io/node.js magic.
Currently we've rolled our own solution for this entire thing using overloaded save() methods on the Django table models. That actually works pretty well for our current functions but as tables continue to grow into GB's of data, it is starting to slow down on some larger tables as our engine digs through the current 'subscribed' UI's and messages out appropriately which updates are needed as which clients.
Curious what other options might exist here. I believe MongoDB and other no-sql type engines support some constructs like this out of the box but I'm not finding an exact hit when Googling for better solutions.
Currently we've rolled our own solution for this entire thing using
overloaded save() methods on the Django table models.
Instead of working on the app level you might want to work on the lower, database level.
Add a PostgreSQL trigger after row insertion, and use pg_notify to notify external apps of the change.
Then in NodeJS:
var PGPubsub = require('pg-pubsub');
var pubsubInstance = new PGPubsub('postgres://username#localhost/tablename');
pubsubInstance.addChannel('channelName', function (channelPayload) {
// Handle the notification and its payload
// If the payload was JSON it has already been parsed for you
});
See that and that.
And you will be able to to the same in Python https://pypi.python.org/pypi/pgpubsub/0.0.2.
Finally, you might want to use data-partitioning in PostgreSQL. Long story short, PostgreSQL has already everything you need :)

Fetching Initial Data from CloudKit

Here is a common scenario: app is installed the first time and needs some initial data. You could bundle it in the app and have it load from a plist or something, or a CSV file. Or you could go get it from a remote store.
I want to get it from CloudKit. Yes, I know that CloudKit is not to be treated as a remote database but rather a hub. I am fine with that. Frankly I think this use case is one of the only holes in that strategy.
Imagine I have an object graph I need to get that has one class at the base and then 3 or 4 related classes. I want the new user to install the app and then get the latest version of this class. If I use CloudKit, I have to load each entity with a separate fetch and assemble the whole. It's ugly and not generic. Once I do that, I will go into change tracking mode. Listening for updates and syncing my local copy.
In some ways this is similar to the challenge that you have using Services on Android: suppose I have a service for the weather forecast. When I subscribe to it, I will not get the weather until tomorrow when it creates its next new forecast. To handle the deficiency of this, the Android Services SDK allows me to make 'sticky' services where I can get the last message that service produced upon subscribing.
I am thinking of doing something similar in a generic way: making it possible to hold a snapshot of some object graph, probably in JSON, with a version token, and then for initial loads, just being able to fetch those and turn them into CoreData object graphs locally.
Question is does this strategy make sense or should I hold my nose and write pyramid of doom code with nested queries? (Don't suggest using CoreData syncing as that has been deprecated.)
Your question is a bit old, so you probably already moved on from this, but I figured I'd suggest an option.
You could create a record type called Data in the Public database in your CloudKit container. Within Data, you could have a field named structure that is a String (or a CKAsset if you wanted to attach a JSON file).
Then on every app load, you query the public database and pull down the structure string that has your classes definitions and use it how you like. Since it's in the public database, all your users would have access to it. Good luck!

Getting the next node id for a Neo4j Node using the REST API

EDIT
When i am talking about node and node id i am specifically talking about the Neo4j representation of a node not node as in Node.js
I am building out an application on top of Neo with node using the thingdom wrapper on top of the REST API and i am attempting to add my own custom id property that will be a hash of the id to be used in the URL for example.
What i am currently doing is creating the node and then once the id is returned hashing this and saving it back to the node, so in effect i am calling the REST API twice to create a single node.
This is a long shot but is there a way to get a reliable next id from Neo using the REST API so that i can do this all in one request.
If not does anyone know of a better approach to what i am doing?
The internal id of neo4j nodes is not supposed to be used for external interfaces, as noted in the documentation. That means especially it's not a good idea to try to guess the next id.
It's recommended to use application specific ids to reference nodes, if you use UUIDs (especially uuid type 4) there is only a minimal chance of collisions and you can compute them on node creation, before storing them in the database.
By curiosity, can I ask you why you need to have the Id stored in the Node?
But anyway, it's quite common in Node.js to call a succession of APIs. And you will see that with Neo4j it will be required more than once.
If you don't already use it, I can only suggest you to take a look at Async: https://github.com/caolan/async
And particularly to the "waterfall" method that allows you to call more than one API that use the result of the previous call.

Resources