I am shifting my database from mongodb to dynamo db. I have a problem with delete function from a table where labName is partition key and serialNumber is my sort key and there is one Id as feedId I want to delete all the records from the table where labName is given and feedId is NOT IN (array of ids).
I am doing it in mongo like below mentioned code
Is there a way with BatchWriteItem where i can add condition for feedId without sort key.
let dbHandle = await getMongoDbHandle(dbName);
let query = {
feedid: {$nin: feedObjectIds}
}
let output = await dbModule.removePromisify(dbHandle,
dbModule.collectionNames.feeds, query);
While working with DynamoDB, you can perform Conditional Retrieval (GET) / Deletion (DELETE) on the records only & only if you have provided all of the attributes for the Primary Key. For example:
For a Simple Primary key, you only need to provide a value for the Partition key.
For a Composite Primary Key, you must need to provide values for both the Partition key & sort key.
I have one table TestTable and partition Key TestColumn.
Inputs Dates:
from_date= "2017-04-20T16:31:54.451071+00:00"
to_date = "2018-04-20T16:31:54.451071+00:00"
when I use equal query the date then it is working.
key_expr = Key('TestColumn').eq(to_date)
query_resp = table.query(KeyConditionExpression=key_expr)
but when I use between query then is not working.
key_expr = Key('TestColumn').between(from_date, to_date)
query_resp = table.query(KeyConditionExpression=key_expr)
Error:
Unknown err_msg while querying dynamodb: An error occurred (ValidationException) when calling the Query operation: Query key condition not supported
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
DynamoDB Query will return data from one and only one partition, meaning you have to supply a single partition key in the request.
KeyConditionExpression
The condition that specifies the key value(s)
for items to be retrieved by the Query action.
The condition must perform an equality test on a single partition key
value.
You can optionally use a BETWEEN operator on a sort key (but you still have to supply a single partition key).
If you use a Scan you can use an ExpressionFilter and use the BETWEEN operator on TestColumn
i am new to cosmos-db and facing issues in querying the collection, i have a partitioned collection with 100000 RU/s(unlimited storage capacity). the partition is based on '/Bid' which a GUID. i am querying the collection based on the partition key value which has 10,000 records (the collection has more than 28,942,445 documents for different partitions). i am using the following query to get the documents but it takes around 50 seconds to execute the query which is not feasible.
object partitionkey = new object();
partitionkey = "2359c59a-f730-40df-865c-d4e161189f5b";
// Now execute the same query via direct SQL
var DistinctBColumn = this.client.CreateDocumentQuery<BColumn>(BordereauxColumnCollection.SelfLink, "SELECT * FROM BColumn_UL c WHERE c.BId = '2359c59a-f730-40df-865c-d4e161189f5b'",new FeedOptions { EnableCrossPartitionQuery=true, PartitionKey= new PartitionKey("2359c59a-f730-40df-865c-d4e161189f5b") }, partitionkey).ToList();
also tried with other querying options which too resulted in talking along 50 seconds.
But it takes less than a second for the same query on azure portal.
kindly help to optimize the query and correct me if i am wrong. Many Thanks.
When we fetch data from Document Db in the change feed, we only want it per partition and have tried adding PatitionKey to the code.
do
{
FeedResponse<PartitionKeyRange> pkRangesResponse = await client.ReadPartitionKeyRangeFeedAsync(
collectionUri,
new FeedOptions
{
RequestContinuation = pkRangesResponseContinuation,
PartitionKey = new PartitionKey("KEY"),
});
partitionKeyRanges.AddRange(pkRangesResponse);
pkRangesResponseContinuation = pkRangesResponse.ResponseContinuation;
}
while (pkRangesResponseContinuation != null);
It returns single range and when we go perform the second query
IDocumentQuery<Document> query = client.CreateDocumentChangeFeedQuery(
collectionUri,
new ChangeFeedOptions
{
PartitionKeyRangeId = pkRange.Id,
StartFromBeginning = true,
RequestContinuation = continuation,
MaxItemCount = -1,
});
It returns all the results from all partitions. Is there a way to restrict the results from single partition only?
Changefeed works at a PartitionKey Range level.
What are partition key ranges?
Document Db currently has 10 GB Physical partitions.
The partition key that you specify is the Logical Partition Key.
Document Db internally maps this logical partition key to a Physical Partition using a hash.
So its possible that a bunch of logical partitions are sharing the same physical partition.
So a physical partition is assigned for a range of these hashes.
The minimum grain that is allowed to read from changefeed would be Partition key ranges.
So for the you would have to query the partition key range id for the partition that you are interested in. Then query the Changefeed for that range id and filter out the data that is not associated to the partition id.
Note: Document db transparently creates new physical partitions if a particular partition gets full. So the partition key range id for a given logical partition could change over time.
This link explains this in good detail:
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data#partitioning-in-azure-cosmos-db
We are working on a project where a lot of data is involved. Now we recently read about Google BigQuery. But how can we export the data to this platform? We have seen the sample of importing logs into Google BigQuery. But this does not contain information about updating and deleting data (only inserting).
So our objects are able to update their data. And we have a limited amount of queries on the BigQuery tables. How can we synchronize our data without exceeding the BigQuery quota limits.
Our current function code:
'use strict';
// Default imports.
const functions = require('firebase-functions');
const bigQuery = require('#google-cloud/bigquery')();
// If you want to change the nodes to listen to REMEMBER TO change the constants below.
// The 'id' field is AUTOMATICALLY added to the values, so you CANNOT add it.
const ROOT_NODE = 'categories';
const VALUES = [
'name'
];
// This function listens to the supplied root node.
// When the root node is completed empty all of the Google BigQuery rows will be removed.
// This function should only activate when the root node is deleted.
exports.root = functions.database.ref(ROOT_NODE).onWrite(event => {
if (event.data.exists()) {
return;
}
return bigQuery.query({
query: [
'DELETE FROM `stampwallet.' + ROOT_NODE + '`',
'WHERE true'
].join(' '),
params: []
});
});
// This function listens to the supplied root node, but on child added/removed/changed.
// When an object is inserted/deleted/updated the appropriate action will be taken.
exports.children = functions.database.ref(ROOT_NODE + '/{id}').onWrite(event => {
const id = event.params.id;
if (!event.data.exists()) {
return bigQuery.query({
query: [
'DELETE FROM `stampwallet.' + ROOT_NODE + '`',
'WHERE id = ?'
].join(' '),
params: [
id
]
});
}
const item = event.data.val();
if (event.data.previous.exists()) {
let update = [];
for (let index = 0; index < VALUES.length; index++) {
const value = VALUES[index];
update.push(item[value]);
}
update.push(id);
return bigQuery.query({
query: [
'UPDATE `stampwallet.' + ROOT_NODE + '`',
'SET ' + VALUES.join(' = ?, ') + ' = ?',
'WHERE id = ?'
].join(' '),
params: update
});
}
let template = [];
for (let index = 0; index < VALUES.length; index++) {
template.push('?');
}
let create = [];
create.push(id);
for (let index = 0; index < VALUES.length; index++) {
const value = VALUES[index];
create.push(item[value]);
}
return bigQuery.query({
query: [
'INSERT INTO `stampwallet.' + ROOT_NODE + '` (id, ' + VALUES.join(', ') + ')',
'VALUES (?, ' + template.join(', ') + ')'
].join(' '),
params: create
});
});
What would be the best way to sync firebase to bigquery?
BigQuery supports UPDATE and DELETE, but not frequent ones - BigQuery is an analytical database, not a transactional one.
To synchronize a transactional database with BigQuery you can use approaches like:
Export a daily dump, and import it into BigQuery.
Treat updates and deletes as new events, and keep appending events to your BigQuery event log.
Use a tool like https://github.com/MemedDev/mysql-to-google-bigquery.
Approaches like "BigQuery at WePay part III: Automating MySQL exports every 15 minutes with Airflow, and dealing with updates"
With Firebase you could schedule a daily load to BigQuery from their daily backups:
https://firebase.googleblog.com/2016/10/announcing-automated-daily-backups-for-the-firebase-database.html
... way to sync firebase to bigquery?
I recommend considering streaming all you data into BigQuery as a historical data. You can mark entries as new(insert), update or delete. Then, on BigQuery side, you can write query that will resolve most recent values for specific record based on whatever logic you have.
So your code can be reused almost 100% - just fix logic of UPDATE/DELETE to have it as INSERT
// When an object is inserted/deleted/updated the appropriate action will be taken.
So our objects are able to update their data. And we have a limited amount of queries on the BigQuery tables. How can we synchronize our data without exceeding the BigQuery quota limits?
Yes, BigQuery supports UPDATE, DELETE, INSERT as a part of Data Manipulation Language.
General availability was announced in BigQuery Standard SQL at March 8, 2017
Before considering using this feature for syncing BigQuery with transactional data – please take a look at Quotas, Pricing and Known Issues.
Below are some excerpts!
Quotas (excerpts)
DML statements are significantly more expensive to process than SELECT statements.
• Maximum UPDATE/DELETE statements per day per table: 96
• Maximum UPDATE/DELETE statements per day per project: 1,000
Pricing (excerpts, extra highlighting + comment added)
BigQuery charges for DML queries based on the number of bytes processed by the query.
The number of bytes processed is calculated as follows:
UPDATE Bytes processed = sum of bytes in referenced fields in the scanned tables + the sum of bytes for all fields in the updated table at the time the UPDATE starts.
DELETE Bytes processed = sum of bytes of referenced fields in the scanned tables + sum of bytes for all fields in the modified table at the time the DELETE starts.
Comment by post author: As you can see you will be charged for whole table scan even though you update just one row! This is a key here for decision making, I think!
Known Issues (excerpts)
• DML statements cannot be used to modify tables with REQUIRED fields in their schema.
• Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
• The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted.
For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
• Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.
The reason why you didn't find update and delete functions in BigQuery is they are not supported by BigQuery. BigQuery has only append and truncate operations. If you want to update or delete row in your BigQuery you'll need to delete the whole database and write it again with modified row or without it. It is not a good idea.
BigQuery is used to store big amounts of data and have a quick access to it, for example it is good for collecting data from different sensors. But for your customer database you need to use MySQL or NoSQL database.