How to keep a Firebase database sync with BigQuery? - node.js

We are working on a project where a lot of data is involved. Now we recently read about Google BigQuery. But how can we export the data to this platform? We have seen the sample of importing logs into Google BigQuery. But this does not contain information about updating and deleting data (only inserting).
So our objects are able to update their data. And we have a limited amount of queries on the BigQuery tables. How can we synchronize our data without exceeding the BigQuery quota limits.
Our current function code:
'use strict';
// Default imports.
const functions = require('firebase-functions');
const bigQuery = require('#google-cloud/bigquery')();
// If you want to change the nodes to listen to REMEMBER TO change the constants below.
// The 'id' field is AUTOMATICALLY added to the values, so you CANNOT add it.
const ROOT_NODE = 'categories';
const VALUES = [
'name'
];
// This function listens to the supplied root node.
// When the root node is completed empty all of the Google BigQuery rows will be removed.
// This function should only activate when the root node is deleted.
exports.root = functions.database.ref(ROOT_NODE).onWrite(event => {
if (event.data.exists()) {
return;
}
return bigQuery.query({
query: [
'DELETE FROM `stampwallet.' + ROOT_NODE + '`',
'WHERE true'
].join(' '),
params: []
});
});
// This function listens to the supplied root node, but on child added/removed/changed.
// When an object is inserted/deleted/updated the appropriate action will be taken.
exports.children = functions.database.ref(ROOT_NODE + '/{id}').onWrite(event => {
const id = event.params.id;
if (!event.data.exists()) {
return bigQuery.query({
query: [
'DELETE FROM `stampwallet.' + ROOT_NODE + '`',
'WHERE id = ?'
].join(' '),
params: [
id
]
});
}
const item = event.data.val();
if (event.data.previous.exists()) {
let update = [];
for (let index = 0; index < VALUES.length; index++) {
const value = VALUES[index];
update.push(item[value]);
}
update.push(id);
return bigQuery.query({
query: [
'UPDATE `stampwallet.' + ROOT_NODE + '`',
'SET ' + VALUES.join(' = ?, ') + ' = ?',
'WHERE id = ?'
].join(' '),
params: update
});
}
let template = [];
for (let index = 0; index < VALUES.length; index++) {
template.push('?');
}
let create = [];
create.push(id);
for (let index = 0; index < VALUES.length; index++) {
const value = VALUES[index];
create.push(item[value]);
}
return bigQuery.query({
query: [
'INSERT INTO `stampwallet.' + ROOT_NODE + '` (id, ' + VALUES.join(', ') + ')',
'VALUES (?, ' + template.join(', ') + ')'
].join(' '),
params: create
});
});
What would be the best way to sync firebase to bigquery?

BigQuery supports UPDATE and DELETE, but not frequent ones - BigQuery is an analytical database, not a transactional one.
To synchronize a transactional database with BigQuery you can use approaches like:
Export a daily dump, and import it into BigQuery.
Treat updates and deletes as new events, and keep appending events to your BigQuery event log.
Use a tool like https://github.com/MemedDev/mysql-to-google-bigquery.
Approaches like "BigQuery at WePay part III: Automating MySQL exports every 15 minutes with Airflow, and dealing with updates"
With Firebase you could schedule a daily load to BigQuery from their daily backups:
https://firebase.googleblog.com/2016/10/announcing-automated-daily-backups-for-the-firebase-database.html

... way to sync firebase to bigquery?
I recommend considering streaming all you data into BigQuery as a historical data. You can mark entries as new(insert), update or delete. Then, on BigQuery side, you can write query that will resolve most recent values for specific record based on whatever logic you have.
So your code can be reused almost 100% - just fix logic of UPDATE/DELETE to have it as INSERT
// When an object is inserted/deleted/updated the appropriate action will be taken.
So our objects are able to update their data. And we have a limited amount of queries on the BigQuery tables. How can we synchronize our data without exceeding the BigQuery quota limits?
Yes, BigQuery supports UPDATE, DELETE, INSERT as a part of Data Manipulation Language.
General availability was announced in BigQuery Standard SQL at March 8, 2017
Before considering using this feature for syncing BigQuery with transactional data – please take a look at Quotas, Pricing and Known Issues.
Below are some excerpts!
Quotas (excerpts)
DML statements are significantly more expensive to process than SELECT statements.
• Maximum UPDATE/DELETE statements per day per table: 96
• Maximum UPDATE/DELETE statements per day per project: 1,000
Pricing (excerpts, extra highlighting + comment added)
BigQuery charges for DML queries based on the number of bytes processed by the query.
The number of bytes processed is calculated as follows:
UPDATE Bytes processed = sum of bytes in referenced fields in the scanned tables + the sum of bytes for all fields in the updated table at the time the UPDATE starts.
DELETE Bytes processed = sum of bytes of referenced fields in the scanned tables + sum of bytes for all fields in the modified table at the time the DELETE starts.
Comment by post author: As you can see you will be charged for whole table scan even though you update just one row! This is a key here for decision making, I think!
Known Issues (excerpts)
• DML statements cannot be used to modify tables with REQUIRED fields in their schema.
• Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
• The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted.
For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
• Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.

The reason why you didn't find update and delete functions in BigQuery is they are not supported by BigQuery. BigQuery has only append and truncate operations. If you want to update or delete row in your BigQuery you'll need to delete the whole database and write it again with modified row or without it. It is not a good idea.
BigQuery is used to store big amounts of data and have a quick access to it, for example it is good for collecting data from different sensors. But for your customer database you need to use MySQL or NoSQL database.

Related

Node.js and Oracle DB select query getting empty array in rows

const result = await connection.execute(
`SELECT * from no_example `, [], { maxRows: 1000 } // bind value for :id
);
but in result i always get empty rows
If you are inserting rows in another tool, or another program. Make sure that you COMMIT the data:
SQL> create table t (c number);
Table created.
SQL> insert into t (c) values (1);
1 row created.
SQL> commit;
Commit complete.
If you are insert using Node.js, look at the autoCommit attribute and connection.commit() function. Also see the node-oracledb documentation on Transaction Management.
Unrelated to your problem, but you almost certainly shouldn't be using maxRows. By default node-oracledb will return all rows. If you don't want all, then add some kind of WHERE clause or row-limiting clause to your query. If you expect a big number of rows, then use a result set so you can access consecutive batches of rows.

Cosmos Db Sql Query produces drastically different results when using order by

I have a Cosmos Db instance with > 1 Million JSON Documents stored in it.
I am trying to pull data of a certain time frame as to when the document was created based on the _ts variable which is auto-generated when the document is inserted. It represents the UNIX timestamp of that moment.
I am unable to understand, why both these queries produce drastically different results:
Query 1:
Select *
from c
where c._ts > TimeStamp1
AND c._ts < TimeStamp2
Produces 0 results
Query 2
Select *
from c
where c._ts > TimeStamp1
AND c._ts < TimeStamp2
order by c._ts desc
Produces the correct number of results.
What I have tried?
I suspected that might be because of the default CosmosDb index on the data. So, I rewrote the index policy to index only that variable. Still the same problem.
Since my end purpose is to group by on the returned data from the query, then I tried to use group by with order by alone or in a subquery. Surprisingly, according to the docs, CosmosDb yet doesn't support using group by with order by.
What I need help on?
Why am I observing such a behavior?
Is there a way to index the Db in such a way that the rows are returned.
Beyond this, is there a way to still use group by and order by together (Please don't link the question to another one because of this point, I have gone through them and their answers are not valid in my case).
#Andy and #Tiny-wa, Thanks for replying.
I was able to understand the unintended behavior and it was showing up because of the GetCurrentTimestamp() used to calculate the TimeStamps. The documentation states that
This system function will not utilize the index. If you need to
compare values to the current time, obtain the current time before
query execution and use that constant string value in the WHERE
clause.
Although, I don't fully understand what this means but I was to solve this by creating a Stored Procedure where the Time Stamp is fetched before the SQLAPI query is formed and executed and I was able to get the rows as expected.
Stored Procedure Pseudocode for that is like:
function FetchData(){
..
..
..
var Current_TimeStamp = Date.now();
var CDbQuery =
`Select *
FROM c
where (c._ts * 10000000) > DateTimeToTicks(DateTimeAdd("day", -1, TicksToDateTime(` + Current_TimeStamp + ` * 10000)))
AND (c._ts * 10000000) < (` + Current_TimeStamp + ` * 10000)`
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
XQuery,
function (err, feed, options) {
..
..
..
});
}

Getting AutoQuery pagination to work with left join

In my AutoQuery request I have a left join specified so I can query on properties in the joined table.
public class ProductSearchRequest : QueryDb<Book>
, ILeftJoin<Book, BookAuthor>, ILeftJoin<BookAuthor, Author>
{}
If I use standard way of autoquery like so:
var q = AutoQuery.CreateQuery(request, base.Request);
var results = AutoQuery.Execute(request, q);
And 100 are being requested, then often less than 100 will be retuned as the Take() is based on results with a left join.
To remedy this I am doing this instead:
var q = AutoQuery.CreateQuery(request, base.Request);
q.OrderByExpression = null //throws error if orderby exists
var total = Db.Scalar<int>(q.Select(x => Sql.CountDistinct(x.Id))); //returns 0
var q1 = AutoQuery.CreateQuery(request, base.Request).GroupBy(x => x);
var results = Db.Select<Book>(q1);
return new QueryResponse<Book>
{
Offset = q1.Offset.GetValueOrDefault(0),
Total = total
Results = results
};
The group by appears to return correct number of results so paging works but the Total returns 0.
I also tried:
var total2 = (int)Db.Count(q1);
But even though q1 has a GroupBy() it returns the number of results including the left join and not the actual query
How can I get the true total of the query?
(Getting some official docs on how to do paging and totals with autoquery & left join would be very helpful as right now it's a bit confusing)
Your primary issue stems from trying to return a different total then the actual query AutoQuery executes. If you have multiple left joins, the total is the total results of the query it executes not the number of rows in your source table.
So you're not looking for the "True total", rather you're looking to execute a different query to get a different total than the query that's executed, but still deriving from the original query as its basis. First consider using normal INNER JOINS (IJoin<>) instead of LEFT JOINS so it only returns results for related rows in joined tables which the total will reflect accordingly.
Your total query that returns 0 is likely returning no results, so I'd look at looking at the query in an SQL Profiler so you can see the query that's executed. You can also enable logging of OrmLite queries with Debug logging enabled and in your AppHost:
OrmLiteUtils.PrintSql();
Also note that GroupBy() of the entire table is unusual, you would normally group by a single or multiple explicit selected columns, e.g:
.GroupBy(x => x.Id);
.GroupBy(x => new { x.Id, x.Name });

How to query count for each column in DynamoDB

I have a DynamoDB with 50 different columns labeled question1 - question 50. Each of these columns have either a, b, c, or d as answers to a multiple choice question. What is the most efficient way of getting the count of how many people answered 'a' for question1?
I'm trying to return the count of a, b, c, d for ALL questions, so I want to see how many answered a for question1, how many answered b for question 1, etc. So in the end I should have a count for each question and their answer.
Currently I have this, but I don't feel like it's efficient to type everything out. Is there a simplified way of doing this?
exports.handler = async function(event, ctx, callback) {
const params = {
ScanFilter: {
'question1' : {
ComparisonOperator: 'EQ',
AttributeValueList: {
S: 'a'
}
}
},
TableName : 'app',
Select: 'COUNT'
};
try {
data = await dynamoDb.scan(params).promise()
console.log(data)
}
catch (err) {
console.log(err);
}
}
You have missed mentioning two things - is this a one time operation for you or you need to do this regularly? and how many records do you have?
If this is a one time operation:
Since you have 50 questions and 4 options for each (200 combinations) and assuming you have a lot of data, the easiest solution is to export the entire data to a csv and do a pivot table there. This is easier than scanning entire table and doing aggregation operations in memory. Or you can export the table to s3 as json and use athena to run queries on the data.
If you need to do this regularly, you can do one of the following:
Save your aggregate counts as GSI in the same table or in a new table or somewhere else entirely. Enable and send streams to a lambda function. Increment these counts according to the new data coming in.
Use elastic search - Enable streams on your ddb and have a lambda function send them to an elastic search index. Index the current data as well. And then do aggregate queries on this index.
RDBMS's aggregate quite easily...DDB not so much.
Usual answer with DDB is to enable streams and have a lambda attached to the stream that calculates the needed aggregations and stores them in a separate record in DDB.
Read through the Using Global Secondary Indexes for Materialized Aggregation Queries section of the docs.

Pass column name as argument - Postgres and Node JS

I have a query (Update statement) wrapped in a function and will need to perform the same statement on multiple columns during the course of my script
async function update_percentage_value(value, id){
(async () => {
const client = await pool.connect();
try {
const res = await client.query('UPDATE fixtures SET column_1_percentage = ($1) WHERE id = ($2) RETURNING *', [value, id]);
} finally {
client.release();
}
})().catch(e => console.log(e.stack))
}
I then call this function
update_percentage_value(50, 2);
I have many columns to update at various points of my script, each one needs to be done at the time. I would like to be able to just call the one function, passing the column name, value and id.
My table looks like below
CREATE TABLE fixtures (
ID SERIAL PRIMARY KEY,
home_team VARCHAR,
away_team VARCHAR,
column_1_percentage INTEGER,
column_2_percentage INTEGER,
column_3_percentage INTEGER,
column_4_percentage INTEGER
);
Is it at all possible to do this?
I'm going to post the solution that was advised by Sehrope Sarkuni via the node-postgres GitHub repo. This helped me a lot and works for what I require:
No column names are identifiers and they can't be specified as parameters. They have to be included in the text of the SQL command.
It is possible but you have to build the SQL text with the column names. If you're going to dynamically build SQL you should make sure to escape the components using something like pg-format or use an ORM that handles this type of thing.
So something like:
const format = require('pg-format');
async function updateFixtures(id, column, value) {
const sql = format('UPDATE fixtures SET %I = $1 WHERE id = $2', column);
await pool.query(sql, [value, id]);
}
Also if you're doing multiple updates to the same row back-to-back then you're likely better off with a single UPDATE statement that modifies all the columns rather than separate statements as they'd be both slower and generate more WAL on the server.
To get the column names of the table, you can query the information_schema.columns table which stores the details of column structure of your table, this would help you in framing a dynamic query for updating a specific column based on a specific result.
You can get the column names of the table with the help of following query:
select column_name from information_schema.columns where table_name='fixtures' and table_schema='public';
The above query would give you the list of columns in the table.
Now to update each one for a specific purpose, You can store the result set of column name to a variable and pass that variable to the function to perform the required action.

Resources