Bulk Import using nodeJS with Azure DocumentDB. Partition Key issue - node.js

I'm attempting to bulk import a lot of data into DocumentDB using Node. I've created the recommended stored procedure (the one that has been posted numerous times here and on MSDN). However, when I try executing it via Node, I get the following response: "Requests originating from scripts cannot reference partition keys other than the one for which client request was submitted."
The partitionKey on the Collection is "/id", and my request in Node looks like this:
client.executeStoredProcedure('dbs/blah/blah/sprocs/blah', [docs], {partitionKey: '/id'}, ...);
I can't find any concise documentation on this as it related to Node specifically. Has anyone encountered this issue and is there a solution? I'm completely open to the idea that I'm making a silly mistake, but this is my first rodeo, so to speak, with DocumentDB. Thank you all

I suspect that you have a data in the set you are trying to import with an /id other than the one you specify. If that's the case, I can think of two options:
Not use the sproc and load them one at a time from your node.js script. I recommend using one of the async.js functions or the equivalent from some other parallelization library so you can have many requests going in parallel.
You can separate the data client side by /id and then call the sproc for each set.

Related

CosmosDB return data from external API on read

I am attempting to write an Azure CosmosDB integration (Core SQL Api) that integrates with an external service to provide some of the query data. As an example, I need a query made on Cosmos DB to convert some of the data returned (e.g. ID's) by the query into real data by calling an external service via a REST API. This should only happen when querying certain columns.
I initially investigated using a JS stored procedure and/or a UDF to make this external call, but the JS environment seems to be extremely limited and doesn't provide any way to make external calls. I then tried using this https://github.com/Oblarg/cosmosdb-storedprocs-ts repository, which uses webpack to bundle all of node.js into the stored procedure, allowing node modules to be used in stored procedures. Whilst this does allow some node modules to be used, whenever I try and use "https", "fetch", or "axios" modules to make an HTTP GET request I get errors (the same code works fine in a normal node environment, but I'm not a JS expert and can't seem to work past these errors). After a day of attempts it seems like the stored procedure approach is not possible.
Is this the case or is there some way of making HTTP GET requests from a JS stored procedure? If not possible with stored procedures, are there any other techniques to achieve the requirement of reading data from a remote API when querying cosmos DB?
Thanks
There is no way to achieve this from CosmosDB directly, for queries you also cannot use the change feed as the document dont change, so really your only option is to use a function or some preprocessor app to handle it, as you say its not ideal but there is no other solution here. If it was an insert or an update then change feed would allow you to do this but for plain queries its not possible.

Spark results accessible through API

We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything.
Here are a few options that come to mind
Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query.
Spark results are pumped into a messaging queue from which a socket server like connection is made.
What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection.
Can someone with expertise help in that sense?
The standard way I think would be to use Livy, as you mention. Since it's a REST API you wouldn't expect to get a JSON response containing the full result (could be gigabytes of data, after all).
Rather, you'd use pagination with ?from=500 and issue multiple requests to get the number of rows you need. A web application would only need to display or visualize a small part of the data at a time anyway.
But from what you mentioned in your comment to Raphael Roth, you didn't mean to call this API directly from the web app (with good reason). So you'll have an API layer that is called by the web app and which then invokes Spark. But in this case, you can still use Livy+pagination to achieve what you want, unless you specifically need to have the full result available. If you do need the full results generated on the backend, you could design the Spark queries so they materialize the result (ideally to cloud storage) and then all you need is to have your API layer access the storage where Spark writes the results.

Import data from Clio to Azure database using API v4

Let me start out by saying I am a SQL Server Database expert, not a coder so making API calls is certainly not an everyday task for me.
Having said that, I am trying to use the Azure Data Factory's data copy tool to import data from Clio to an Azure SQL Server database. I have had some limited success, data is copied over using the API and inserted into the target table but paging really seems to be an issue. I am testing this with the billable_clients call and the first 25 records with the fields I specify are inserted along with the paging record. As I understand, the billable_clients call is eligible for bulk actions which may be the solution, although I've not been able to figure out how it works. The url I am calling is below:
https://app.clio.com/api/v4/billable_clients.json?fields=id,unbilled_hours,name
Using Postman I've tried to make the same call while adding X-BULK true to the header but that returns no results. If there is anyone that can shed some light on how the X-BULK header flag is used when making a call, or if anyone has any experience loading Clio data into a SQL Server database I'd love some feedback on your methods.
If any additional information regarding my attempts or setup would help please let me know.
Thanks!
you need to download the json files with Bulk API and then update them in DB.
It isn't possible to directly insert the data

How to execute HTTP DELETE request in AWS Lambda Nodejs function

I am trying to create an AWS Lambda transform function for a Firehose stream that sends records directly to an Elasticsearch cluster.
Currently, there is no way to specify an ES document id in a Firehose stream record, so all records, even duplicates, are inserted. However, Firehose does support transformation functions hosted in Lambda, which gave me an idea:
My solution is to create a Lambda transform function that executes a DELETE request to Elasticsearch for every record during transformation, then returning all records unmodified, thereby achieving "delete-insert" behaviour (I am ok with the record disappearing for a short period).
However, I know very little about Nodejs and even though this is such a simple thing, I can't figure out how to do it.
Is there a Node package available to Lambda that I can use to do this? (Preferably an AWS Elasticsearch API, but a simple HTTP package would do).
Do I have to package up some other module to get this done?
Can something like Apex help me out here? My preferred language is Go, but so far I have been unable to get apex functions to execute or log anything to Cloudwatch...
Thanks in advance.
It seems a simple task to do, so I guess no framework is needed.
A few lines of code in Node.js would get the things done.
There are two packages that can help you:
elasticsearch-js
http-aws-es (If your ES domain is protected and you need to sign the requests)
The API doc: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html

Real-Time Database Messaging

We've got an application in Django running against a PGSQL database. One of the functions we've grown to support is real-time messaging to our UI when data is updated in the backend DB.
So... for example we show the contents of a customer table in our UI, as records are added/removed/updated from the backend customer DB table we echo those updates to our UI in real-time via some redis/socket.io/node.js magic.
Currently we've rolled our own solution for this entire thing using overloaded save() methods on the Django table models. That actually works pretty well for our current functions but as tables continue to grow into GB's of data, it is starting to slow down on some larger tables as our engine digs through the current 'subscribed' UI's and messages out appropriately which updates are needed as which clients.
Curious what other options might exist here. I believe MongoDB and other no-sql type engines support some constructs like this out of the box but I'm not finding an exact hit when Googling for better solutions.
Currently we've rolled our own solution for this entire thing using
overloaded save() methods on the Django table models.
Instead of working on the app level you might want to work on the lower, database level.
Add a PostgreSQL trigger after row insertion, and use pg_notify to notify external apps of the change.
Then in NodeJS:
var PGPubsub = require('pg-pubsub');
var pubsubInstance = new PGPubsub('postgres://username#localhost/tablename');
pubsubInstance.addChannel('channelName', function (channelPayload) {
// Handle the notification and its payload
// If the payload was JSON it has already been parsed for you
});
See that and that.
And you will be able to to the same in Python https://pypi.python.org/pypi/pgpubsub/0.0.2.
Finally, you might want to use data-partitioning in PostgreSQL. Long story short, PostgreSQL has already everything you need :)

Resources