What is the best way to stream data in real time into Big Query (using Node)? - node.js

I want to stream HTTP requests into BigQuery, in real time (or near real time).
Ideally, I would like to use a tool that provides an endpoint to stream HTTP requests to and allows me to write simple Node such that:
1. I can add the appropriate insertId so BigQuery can dedupe requests if necessary and
2. I can batch the data so I don't send a single row at a time (which would result in unnecessary GCP costs)
I have tried using AWS Lambdas or Google Cloud Functions but the necessary setup for this problem on those platforms far exceeds the needs of the use case here. I assume many developers have this same problem and there must be a better solution.

Since you are looking for a way to stream HTTP requests to BigQuery and also send them in batch to minimize Google Cloud Platform costs, you might want to take a look at the public documentation where this issue is explained.
You can also find a Node.js template on how to perform the stream insert into BigQuery:
// Imports the Google Cloud client library
const {BigQuery} = require('#google-cloud/bigquery');
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const projectId = "your-project-id";
// const datasetId = "my_dataset";
// const tableId = "my_table";
// const rows = [{name: "Tom", age: 30}, {name: "Jane", age: 32}];
// Creates a client
const bigquery = new BigQuery({
projectId: projectId,
});
// Inserts data into a table
await bigquery
.dataset(datasetId)
.table(tableId)
.insert(rows);
console.log(`Inserted ${rows.length} rows`);
As for the batch part, the recommended ratio is to use 500 rows per request even though it can be up to 10,000. More information about that Quotas & Limits for streaming inserts can be found in the public documentation.

You can make use of Cloud functions. With the help of cloud functions, you can create your own API in Node JS and then it can be used for Streaming data in BQ.
Target Architecture for STREAM will be like this:
Pubsub Subscriber (PUSH TYPE) -> Google Cloud Function -> Google Big Query
You can make use of this API in batch mode as well with the help of Cloud Composer (i.e. Apache Airflow) or Cloud Scheduler to schedule your API as per your requirements.
Target Architecture for BATCH will be like this:
Cloud Scheduler/Cloud Composer -> Google Cloud Function -> Google Big Query

Related

How to create a Flutter Stream using MongoDB (watch collection?) with Firebase Cloud Function

I've been trying out MongoDB as database for my Flutter project lately, since I want to migrate from pure Firebase database (some limitations in Firebase are an issue for my project, like the "in-array" limit of 10 for queries).
I already made some CRUD operations methods in some Firebase Cloud Functions, using MongoDB. I'm now able to save data and display it as a Future in a Flutter App (a simple ListView of Users in a FutureBuilder).
My question is : how would it be possible to create a StreamBuilder thanks to MongoDB and Firebase Cloud Functions ? I saw some stuff about watch collection and Stream change but nothing clear enough for me (usually I read a lot of examples or tutorial to understand).
Maybe some of you would have some clues or maybe tutorial that I can read/watch to learn a little bit more about that subject ?
For now, I have this as an example (NodeJS Cloud Function stored in Firebase), which obviously produces a Future in my Future app (not realtime) :
exports.getUsers = functions.https.onCall(async (data, context) => {
const uri = "mongodb+srv://....";
const client = new MongoClient(uri);
await client.connect();
var results = await client.db("myDB").collection("user").find({}).toArray();
await client.close();
return results;
});
What would you advice me to obtain a Stream instead of a Future, using maybe watch collection and Stream change from MongoDB, providing example if possible !
Thank you very much !
Cloud Functions are meant for short-lived operations, not for long-term listeners. It is not possible to create long-lived connections from Cloud Functions, neither to other services (such as you're trying to do to MongoDB here) nor from Cloud Functions back to the calling client.
Also see:
If I implement onSnapshot real-time listener to Firestore in Cloud Function will it cost more?
Can a Firestore query listener "listen" to a cloud function?
the documentation on EventArc, which is the platform that allows you build custom triggers. It'll be (a lot* more. involved though.

Why does a simple SQL query causes significant slow down in my Lambda function?

I built a basic node.js API hosted on AWS Lambda and served over AWS API Gateway. This is the code:
'use strict';
// Require and initialize outside of your main handler
const mysql = require('serverless-mysql')({
config: {
host : process.env.ENDPOINT,
database : process.env.DATABASE,
user : process.env.USERNAME,
password : process.env.PASSWORD
}
});
// Import the Dialogflow module from the Actions on Google client library.
const {dialogflow} = require('actions-on-google');
// Instantiate the Dialogflow client.
const app = dialogflow({debug: true});
// Handle the Dialogflow intent named 'trip name'.
// The intent collects a parameter named 'tripName'.
app.intent('trip name', async (conv, {tripName}) => {
// Run your query
let results = await mysql.query('SELECT * FROM tablename where field = ? limit 1', tripName)
// Respond with the user's lucky number and end the conversation.
conv.close('Your lucky number is ' + results[0].id);
// Run clean up function
await mysql.end()
});
// Set the DialogflowApp object to handle the HTTPS POST request.
exports.fulfillment = app;
It receives a parameter (a trip name), looks it up in MySQL and returns the result.
My issue is that the API takes more than 5 seconds to respond which is slow.
I'm not sure why it's slow? the MySQL is a powerful Amazon Aurora and node.js is supposed to be fast.
I tested the function from the same AWS region as the MySQL (Mumbai) and it still times out so I don't think it has to do with distance between different regions.
The reason of slowness is carrying out any SQL query (even a dead simple SELECT). It does bring back the correct result but slowly.
When I remove the SQL part it becomes blazing fast. I increased the memory for Lambda to the maximum and redeployed Aurora to a far more powerful one.
Lambda functions will run faster if you configure more memory. The less the memory configured, the worse the performance.
This means if you have configured your function to use 128MB, it's going to be run in a very low profile hardware.
On the other hand, if you configure it to use 3GB, it will run in a very decent machine.
At 1792MB, your function will run in a hardware with a dedicated core, which will speed up your code significantly considering you are making use of IO calls (network requests, for example). You can see this information here
There's no magic formula though. You have to run a few tests and see what memory configuration best suits your application. I would start with 3GB and eventually decrease it by chunks of 128MB until you find the right configuration.

export firebase data node to pdf report

What a have is a mobile app emergency message system which uses Firebase as a backend. When the end of an emergency event ends, I would like to capture the message log in a pdf document. I have not been able to find any report editors that work with Firebase. This means I may have to export this to php mysql. The Firebase php SDK looks to be to much overkill for this task. I have been googling php get from firebase and most responses have to do with using the Firebase php SDK. Is this the only way it can be acomplished?
You could use PDF Kit (...) on Cloud Functions (it's all nodeJS, no PHP available there).
On npmjs.com there are several packages for #firebase-ops, googleapis and #google-cloud.
In order to read from Firebase and write to Storage Bucket or Data Store; that example script would still require a database reference and a storage destination, to render the PDF content (eventually from a template) and puts it, where it belongs. also see firebase / functions-samples (especially the package.json which defines the dependencies). npm install -g firebase-tools installs the tools required for deployment; also the requires need to be installed in order to be locally known (quite alike composer - while remotely these are made known while the deployment process).
You'd need a) Firebase Event onUpdate() as the trigger, b) check the endTime of the returned DeltaSnapshot for a value and c) then render & store the PDF document. the code may vary, just to provide a coarse idea of how it works, within the given environment:
'use strict';
const admin = require('firebase-admin');
const functions = require('firebase-functions');
const PDFDocument = require('pdfkit');
const gcs = require('#google-cloud/storage')();
const bucket = gcs.bucket( 'some-bucket' );
const fs = require('fs');
// TODO: obtain a handle to the delta snapshot
// TODO: render the report
var pdf = new PDFDocument({
size: 'A4',
info: {Title: 'Tile of File', Author: 'Author'}
});
pdf.text('Emergency Incident Report');
pdf.pipe(
// TODO: figure out how / where to store the file
fs.createWriteStream( './path/to/file.pdf' )
).on('finish', function () {
console.log('PDF closed');
});
pdf.end();
externally running PHP code is in this case nevertheless not run on the server-side. the problem with it is, that an external server won't deliver any realtime trigger and therefore the file will not appear instantly, upon time-stamp update (as one would expect it from a Realtime Database). one could also add external web-hooks (or interface them with PHP), eg. to obtain these PDF files through HTTPS (or even generated upon HTTPS request, for externally triggered generation). for local testing one can use command firebase serve, saves much time vs. firebase deploy.
the point is, that one can teach Cloud Function how the PDF files shall look alike, when they shall be created and where to put them, as micro-service which does nothing else but to render these files. scripting one script should be still within acceptable range, given all the clues provided.

Node app that fetches, processes, and formats data for consumption by a frontend app on another server

I currently have a frontend-only app that fetches 5-6 different JSON feeds, grabs some necessary data from each of them, and then renders a page based on said data. I'd like to move the data fetching / processing part of the app to a server-side node application which outputs one simple JSON file which the frontend app can fetch and easily render.
There are two noteworthy complications for this project:
1) The new backend app will have to live on a different server than its frontend counterpart
2) Some of the feeds change fairly often, so I'll need the backend processing to constantly check for changes (every 5-10 seconds). Currently with the frontend-only app, the browser fetches the latest versions of the feeds on load. I'd like to replicate this behavior as closely as possible
My thought process for solving this took me in two directions:
The first is to setup an express application that uses setTimeout to constantly check for new data to process. This data is then sent as a response to a simple GET request:
const express = require('express');
let app = express();
let processedData = {};
const getData = () => {...} // returns a promise that fetches and processes data
/* use an immediately invoked function with setTimeout to fetch the data
* when the program starts and then once every 5 seconds after that */
(function refreshData() {
getData.then((data) => {
processedData = data;
});
setTimeout(refreshData, 5000);
})();
app.get('/', (req, res) => {
res.send(processedData);
});
app.listen(port, () => {
console.log(`Started on port ${port}`);
});
I would then run a simple get request from the client (after properly adjusting CORS headers) to get the JSON object.
My questions about this approach are pretty generic: Is this even a good solution to this problem? Will this drive up hosting costs based on processing / client GET requests? Is setTimeout a good way to have a task run repeatedly on the server?
The other solution I'm considering would deal with setting up an AWS Lambda that writes the resulting JSON to an s3 bucket. It looks like the minimum interval for scheduling an AWS Lambda function is 1 minute, however. I imagine I could set up 3 or 4 identical Lambda functions and offset them by 10-15 seconds, however that seems so hacky that it makes me physically uncomfortable.
Any suggestions / pointers / solutions would be greatly appreciated. I am not yet a super experienced backend developer, so please ELI5 wherever you deem fit.
A few pointers.
Use crontasks for periodic processing of data. This is far preferable especially if you are formatting a lot of data.
Don't setup multiple Lambda functions for the same task. It's going to be messy to maintain all those functions.
After processing / fetching the feed, you can store the JSON file in your own server or S3. Note that if it's S3, then you are paying and waiting for a network operation. You can read the file from your express app and just send the response back to your clients.
Depending on the file size and your load in the server you might want to add a caching server so that you can cache the response until new JSON data is available.

Google cloud datastore slow (>800ms) with simple query from compute engine

When I try to query the Google Cloud Datastore from a (micro) compute engine, it usually takes >800ms to get a reply. The best I got was 450ms, the worst was >3 seconds.
I was under the impression that latency should be much, much lower (like 20-80ms), so I'm guessing I'm doing something wrong.
This is the (node.js) code I'm using to query (from a simple datastore with just a single entity):
const Datastore = require('#google-cloud/datastore');
const projectId = '<my-project-id>';
const datastoreClient = Datastore({
projectId: projectId
});
var query = datastoreClient.createQuery('Test').limit(1);
console.time('query');
query.run(function (err, test) {
if (err) {
console.log(err);
return;
}
console.timeEnd('query');
});
Not sure if it's relevant, but my app-engine project is in the US-Central region, as is the compute engine I'm running the query from.
UPDATE
After some more testing I found out that the default authentication (token?) that you get when using the Node.js library provided by Google expires after about 4 minutes.
So in other words: if you use the same process, but you wait 4 minutes or more between requests, query times are back to >800ms.
I also tried authenticating using a keyfile, and that seemed to do better: subsequent request are still faster, but the initial request only takes half the time (>400ms).
This latency that you see for your initial requests to the Datastore are most likely due to caching being warmed up. The Datastore uses a distributed architecture to manage scaling, which allows your queries to scale with the size of your result set. By performing more of the same query, the better prepared the Datastore is to serve your query, and the more consistent the speeds of your results.
If you want similar result speeds on low Datastore access rates, it is recommended to configure your own caching layer. Google App Engine provides Memcache which is optimized for use with the Datastore. Since you are making requests from Compute Engine, you can use other third-party solutions such as Redis or Memcached.

Resources