Right now I'm copying files on Google Cloud Storage to Bigquery using the following line in node.js:
const bigquery = new BigQuery();
bigquery.dataset(xx).table(xx).load(storage.bucket(bucketName).file(fileName));
But now I'd like to add a new timestamp column to this file. So how can I do this?
So two questions I could think of:
First read this file into some data structure like array:
array = FunctionToReadFileNameToArray(FileName);
Do we have such a function? Suppose we have, then it's quite easy to manipulate upon the array to add timestamp column.
Second, load the new array data into bigquery. But I only find one way to insert streaming data:
bigquery.dataset(xx).table(xx).insert(rows);
And here rows is different data structure like dictionary/map but not array. So how can we load array into bigquery?
Thanks
I'm going to assume you have a file (Object) of structured records (JSON, XML, CSV). The first task would appear to be opening that GCS object for reading. You would then read one record at a time. You would then augment that record with your desired extra column (timestamp) and then invoke the insert() API. This API can take a single object to be inserted or an array of objects.
However ... if this is a one-time event or can be performed in batch ... you may find it cheaper to read the GCS object and write a new GCS object containing your desired data and THEN load the data into BQ as a unit. Looking at the pricing for BQ, we seem to find that streaming inserts are charged at $0.01 per 200MB in addition to the storage costs which would be bypassed for a GCS object load as a unit. My own thinking is that doing extra work to save pennies is a poor use of time/money but if you are processing TB of data over months, it may add up.
Related
I am using Data factory to copy collection from Mongo Atlas to ADLS Gen 2.
By default data factory will create one json file per collection. But that leaves me with one huge json file.
I checked data flows and transformation but they work on file that is already present in ADLS. Is there a way I can split the data as it comes in to ADLS rather than first getting a huge file and then post processing and splitting it into smaller files?
If the collection size is 5GB, is it possible for data factory to split them in chunks of 100MB as the copy runs?
I would suggest you to use Partitioning as Partition option in Sink. As shown in below screenshot.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab
I am using data flow activity to convert MongoDB data to SQL.
As of now MongoDB/Atlas is not supported as a source in dataflow. I am converting MongoDB data to JSON file in AzureBlob Storage and then using that json file as a source in dataflow.
for a json source file whose size is around/more than 4Gb, whenever I try to import projection, the Azure Integration Runtime is throwing following error.
I have changed the core size to 16+16 and cluster type to memory optimized.
Is there any other way to import projection ?
Since your source data is one large file that contains lots of rows with maybe complex schemas, you can create a temporary file with a few rows that contain all the columns you want to read, and then do the following:
1. From the data flow source Debug Settings -> Import projection
with sample file to get the complete schema.
Now, Select Import projection.
2. Next, rollback the Debug Settings to use the source dataset for the remaining data movement/transformation.
If you want to map data types as well, you can follow this official MS recommendation doc, as map data type cannot be directly supported in JSON source.
Workaround for this was:
Instead of pulling all the data from mongo in a single blob, I pulled small chunks (500MB-1GB each) by using limit and skip option in "Copy Data" Activity.
and stored them in different JSON blobs
We have one lambda that will update dynamodb table after some operation.
Now we want to export whole dynamodb table into a s3 bucket with a csv format.
Any efficient way to do this.
Also I have found the below way of streaming directly from dynamodb to s3
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
But in above it will store in json format. and can not find a way to do this efficiently for 10GB data
As far as I can tell you have three "simple" options.
Option #1: Program that does a Scan
It is fairly simple to write a program that does a (parallel) scan of your table and then outputs the result in a CSV. A no bells and whistles version of this is about 100-150 lines of code in Python or Go.
Advantages:
Easy to develop
Can be run easily multiple times from local machines or CI/CD pipelines or whatever.
Disadvantages:
It will cost you a bit of money. Scanning the whole table will use up some read units. Depending on the amount you are readin, this might get costly fast.
Depending on the amount of data this can take a while.
Note: If you want to run this in a Lambda then remember that Lambdas can run for a maximum of 15 minutes. So once you more data than can be processed within those 15 minutes, you probably need to switch to Step Functions.
Option #2: Process a S3 backup
DynamoDB allows you to create backups of your table to S3 (as the article describes you linked). Those backups will either be in JSON or a JSON like AWS format. You then can write a program that converts those JSON files to CSV.
Advantages:
(A lot) cheaper than a scan
Disadvantages:
Requires more "plumbing" because you need to first create the backup, then do download it from S3 to wherever you want to process it etc.
Probably will take longer than option #1
Hi I've got a simple collection with 40k records in. It's just an import of a csv (c.4Mb) so it has a consistent object per document and is for an Open Data portal.
I need to be able to offer a full download of the data as well as the capabilities of AQL for querying, grouping, aggregating etc.
If I set batchSize to the full dataset then it takes around 50 seconds to return and is unsurprisingly about 12Mb due to the column names.
eg
{"query":"for x in dataset return x","batchSize":50000}
I've tried things caching and balancing between a larger batchSize and using the cursor to build the whole dataset but I can't get the response time down very much.
Today I came across the attributes and values functions and created this AQL statement.
{"query":"return union(
for x in dataset limit 1 return attributes(x,true),
for x in dataset return values(x,true))","batchSize":50000}
It will mean I have to unparse the object but I use PapaParse so that should be no issue (not proved yet).
Is this the best / only way to have an option to output the full csv and still have a response that performs well?
I am trying to avoid having to store the data multiple times, eg once raw csv then data in a collection. I guess there may be a dataset that is too big to cope with this approach but this is one of our bigger datasets.
Thanks
I wrote a Console Application that reads list of flat files
and Parse the data type on a row basis
and inserts records one after other in respective tables.
there are few Flat Files which contains about 63k records(rows).
for such files, my program is taking about 6 hours for one file of 63k
records to complete.
This is a test data file. In production i have to deal with 100 time more load.
I am worried, if i can do this any better to speed up.
Can any one suggest a best way to handle this job?
Work Flow is as below:
Read the FlatFile from Local Machine using File.ReadAllLines("location")
Create a Record Entity object after parsing each field of the row.
Insert this current row in to the Entity
Purpose of making this as console application is,
this application should be run(scheduled application) on weekly basis
and there is conditional logic in it, based on some variable there will be
full table replace or
update a existing table or
delete records in table.
You can try to use 'bulk insert' operation for inserting a huge data into database.