Insert bulk data into big-query without keeping it in streaming buffer

Insert bulk data into big-query without keeping it in streaming buffer - node.js

My motive here is as follow:
Insert bulk records into big-query every half an hour
Delete the record if the exists
Those records are transactions which change their statuses from: pending, success, fail and expire.
BigQuery does not allow me to delete the rows that are inserted just half an hour ago as they are still in the streaming buffer.
can anyone suggest me some workaround as i am getting some duplicate rows in my table.

A better course of action would be to:
Perform periodic loads into a staging table (loading is a free operation)
After the load completes, execute a MERGE statement.
You would want something like this:
MERGE dataset.TransactionTable dt
USING dataset.StagingTransactionTable st
ON dt.tx_id = st.tx_id
WHEN MATCHED THEN
UPDATE dt.status = st.status
WHEN NOT MATCHED THEN
INSERT (tx_id, status) VALUES (st.tx_id, st.status)

Related

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?

Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.

LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

How to cleanup the JdbcMetadataStore?

Initially our flow of cimmunicating with google Pub/Sub was so:
Application accepts message
Checks that it doesn't exist in idempotencyStore
3.1 If doesn't exist - put it into idempotency store (key is a value of unique header, value is a current timestamp)
3.2 If exist - just ignore this message
When processing is finished - send acknowledge
In the acknowledge successfull callback - remove this msg from metadatastore
The point 5 is wrong because theoretically we can get duplicated message even after message has processed. Moreover we found out that sometimes message might not be removed even although successful callback was invoked( Message is received from Google Pub/Sub subscription again and again after acknowledge[Heisenbug]) So we decided to update value after message is proccessed and replace timestamp with "FiNISHED" string
But sooner or later we will encounter that this table will be overcrowded. So we have to cleanup messages in the MetaDataStore. We can remove messages which are processed and they were processed more 1 day.
As was mentioned in the comments of https://stackoverflow.com/a/51845202/2674303 I can add additional column in the metadataStore table where I could mark if message is processed. It is not a problem at all. But how can I use this flag in the my cleaner? MetadataStore has only key and value

In the acknowledge successfull callback - remove this msg from metadatastore
I don't see a reason in this step at all.
Since you say that you store in the value a timestamp that means that you can analyze this table from time to time to remove definitely old entries.
In some my project we have a daily job in DB to archive a table for better main process performance. Right, just because we don't need old data any more. For this reason we definitely check some timestamp in the raw to determine if that should go into archive or not. I wouldn't remove data immediately after process just because there is a chance for redelivery from external system.
On the other hand for better performance I would add extra indexed column with timestamp type into that metadata table and would populate a value via trigger on each update or instert. Well, MetadataStore just insert an entry from the MetadataStoreSelector:
return this.metadataStore.putIfAbsent(key, value) == null;
So, you need an on_insert trigger to populate that date column. This way you will know in the end of day if you need to remove an entry or not.

Return the item number X in DynamoDB

I would like to provide one piece of content per day storing all items in dynamoDB. I will add new content from time to time but only one piece of content needs to be read per day.
It seems it's not recommended to have incremental Id as primary key on dynamoDB.
Here is what I have at the moment:
content_table
id, content_title, content_body, content_author, view_count
1b657df9-8582-4990-8250-f00f2194abe9, title_1, body_1, author_1, view_count_1
810162c7-d954-43ff-84bf-c86741d594ee, title_2, body_2, author_2, view_count_2
4fdac916-0644-4237-8124-e3c5fb97b142, title_3, body_3, author_3, view_count_3
The database will have a low rate of adding new item has I will add new content myself manually.
How can I get the item number XX without querying all the database in nodeJS ?
Should I switch back to a MySQL database ?
Should I use a homemade auto increment even if it's an anti pattern ?
Should I used a time-based uuid, and do a query like, get all ids, sort them, and get the number X in the array ?
Should I use a tool like http://www.stateful.co/ ?
Thanks for your help

I would make the date your hash key, you can then simply get the content from any particular day using GetItem.
date, content_title, content_body, content_author, view_count
20180208, title_1, body_1, author_1, view_count_1
20180207, title_2, body_2, author_2, view_count_2
20180206, title_3, body_3, author_3, view_count_3
If you think you might have more than one piece of content for any one day in future, you could add a datetime attribute and make this the range key
date, datetime, content_title, content_body, content_author, view_count
20180208, 20180208101010, title_1, body_1, author_1, view_count_1
20180208, 20180208111111, title_2, body_2, author_2, view_count_2
20180206, 20180208101010, title_3, body_3, author_3, view_count_3
Its then still very fast and simple to execute a Query to get the content for a particular day.
Note that due to the way DynamoDB distributes throughput, if you choose the second option, you might want to archive old content into another table.

Sequelize/Postgres - how to update each row individually on migrate?

I have lots of records in my postgres. (using sequelize to communicate)
I want to have a migrate script, but due to locking, I have to do each change as atomic as possible.
So I don't want to selectAll and then modify and then saveAll.
In mongo I have forEach cursor which allows me to update a record, save it and only then move to the next one.
Anything similar in sequelize/postgres?
Currently, I am doing that in my code - getting the IDs, then for each performing a query.
return migration.runOnAllUpdates((record)=>{
record.change = 'new value';
return record.save()
});
where runOnAllUpdates will simply give me records one by one.

Why is GetPaged() Executing two database calls?

I'm a bit new to subsonic (i.e. evaluating 3.0.0.3) and have come across a strange behavior in GetPaged(int pageIndex, int pageSize). When I execute the method it does two SQL calls. Any ideas why ?
Details
Lets say I have a "Cultures" table with 200 rows. In my code I do something like ...
var sonicCollection = from c in RTE.Culture.GetPaged(1, 25)
select c;
Now, I would expect this executes a single query returning the first 25 entries in my cultures table. When I watch SQL profiler I see two queries run by.
First this--
SELECT [dbo].[Cultures].[cultureCode], [dbo].[Cultures].[cultureName]
FROM [dbo].[Cultures]
Then This--
SELECT *
FROM (SELECT ROW_NUMBER() OVER (
ORDER BY cultureID ASC) AS Row,
[dbo].[Cultures].[cultureCode], [dbo].[Cultures].[cultureName]
FROM [dbo].[Cultures]
)
AS PagedResults
WHERE Row >= 1 AND Row <= 25
I expect the 2nd query to roll by, as it is the one returning the 25 rows I politely requested of subsonic. The first query, however, appears to return 200 rows (at least according to SQL profiler).
Any ideas what's going on?

It's a bug in the code. The code actually queries every record and then iterates over each one for the count. I've created an issue in the github repo here:
https://github.com/subsonic/SubSonic-3.0/issues/259
You can download the source, fix the issue and recompile pretty easily. I've done this and its fixed my issue.

You just want to use RTE.Culture.GetPaged() - it runs the paged query for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Insert bulk data into big-query without keeping it in streaming buffer - node.js

Related

How to get Salesforce REST API to paginate?

How to cleanup the JdbcMetadataStore?

Return the item number X in DynamoDB

Sequelize/Postgres - how to update each row individually on migrate?

Why is GetPaged() Executing two database calls?

Categories

Resources