Is there any way one can append two files in GCS, suppose file one is a full
load and second file is an incremental load. Then what's the way we can append
the two?
Secondly, using gsutil compose will append the two files including the attributes
names as well. So, in the final file I want the data of the two files.
You can append two separate files using compose in the Google Cloud Shell and rename the output file as the first file, like this:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/obj1
This command is meant for parallel uploads in which you divide a large object file in smaller objects. They get uploaded to Google Cloud Storage and then you can append them to get the original file. You can find more information on Composite Objects and Parallel Uploads.
I've come up with two possible solutions:
Google Cloud Function solution
The option I would go for is using a Cloud Function. Doing something like the following:
Create an empty bucket like append_bucket.
Upload the first file.
Create a Cloud Function to be triggered by new uploaded files on the
bucket.
Upload the second file.
Read the first and the second file (you will have to download them as string first).
Make the append operation.
Upload the result to the bucket.
Google Dataflow solution
You can also do it with Dataflow for BigQuery (keep in mind it’s still in beta).
Create a BigQuery dataset and table.
Create a Dataflow instance, from the template Cloud Storage Text to BigQuery.
Create a Javascript file with the logic to transform the text.
Upload your files in Json format to the bucket.
Dataflow will read the Json file, execute the Javascript code and append the new data to the BigQuery dataset.
At last, export the BigQuery query result to Cloud Storage.
Related
I have a requirement where in I will receive file content which I need to load to BigQuery tables. Standard API shows how to load data from local file but I don't see any variant of the load method which accepts file content as string rather than a file path. Any idea how I can achieve this ?
As we can see in the source code and official documentation load function loads data only from a local file or Storage File. Allowed options are:
AVRO,
CSV,
JSON,
ORC,
PARQUET
The load job is created and it will run your data load asynchronously. If you would like instantaneous access to your data, insert it using Table insert function, where you need to provide the rows to insert into the table:
// Insert a single row
table.insert({
INSTNM: 'Motion Picture Institute of Michigan',
CITY: 'Troy',
STABBR: 'MI'
}, insertHandler);
If you want to load i.e. CSV file, firstly you need to save data to a CSV in Node.js manually. Then, load it as a single column CSV using load() method. That will load the whole string as a single column.
Additionally, what I can recommend you is to use Dataflow templates, i.e. Cloud Storage Text to BigQuery, that read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF), and output the result to BigQuery. But your data to load needs to be stored in Cloud Storage.
I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.
I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated
Thank you
You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time.
For example:
s3://bucket/reports/date=2019-08-28/report1.json
s3://bucket/reports/date=2019-08-28/report2.json
s3://bucket/reports/date=2019-08-28/report3.json
s3://bucket/reports/date=2019-08-29/report1.json
This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html
I am trying to figure out how to iterate over objects in a blob in google cloud storage. The address is similar to this:
gs://project_ID/bucket_name/DIRECTORY/file1
gs://project_ID/bucket_name/DIRECTORY/file2
gs://project_ID/bucket_name/DIRECTORY/file3
gs://project_ID/bucket_name/DIRECTORY/file4
...
The DIRECTORY on the GCS bucket has a bunch of different files that I need to iterate over, so that I can check when it was last updated (to see if it is a new file there) so that I can pull the contents.
Example function
def getNewFiles():
storage_client = storage.Client(project='project_ID')
try:
bucket = storage_client.get_bucket('bucket_name')
except:
storage_client.create_bucket(bucket_name)
for blob in bucket.list_blobs(prefix='DIRECTORY'):
if blob.name == 'DIRECTORY/':
**Iterate through this Directory**
**CODE NEEDED HERE***
**Figure out how to iterate through all files here**
I have gone through the python api and the client library, and can't find any examples of this working..
According to Google Cloud Client Library for Python docs, blob.name:
This corresponds to the unique path of the object in the bucket
Therefore blob.name will return something like this:
DIRECTORY/file1
If you are already including the parameter prefix='DIRECTORY' when using the list_blobs() method you can get all your files in your directory by doing:
for blob in bucket.list_blobs(prefix='DIRECTORY'):
print(blob.name)
You can use something like blob.name.lstrip('DIRECTORY') or the standard library re module to clean the string and get only the file name.
However, according to what you said: "so that I can check when it was last updated (to see if it is a new file there)" if you are looking for some function to be triggered when you have new files in your bucket, you can use Google Cloud Functions. You have the docs here on how to use them with Cloud Storage when new objects are created. Although as of current date (Feb/2018) you can only write Cloud Functions using NODE.JS
I am using Stream Analytics to join streaming data (via IoT Hub) and reference data (via blob storage). The reference data blob file is generated every minute with latest data and is in a format "filename-{date} {time}.csv". The reference blob file data is used in the Azure Machine Learning function as parameters in SA job. The output of stream analytics job (into Azure SQL or Power BI) seems to be generating multiple rows instead of one for Azure Machine Learning function's output, one each for parameter values from previous blob files. My understanding is that it should only use the latest blob file content but looks like it is using all the blob files and generating multiple rows from AML output. Here is the query I am using:
SELECT
AMLFunction(Ref.Input1, Ref.Input2), *
FROM IoTInput Stream
LEFT JOIN RefBlobInput Ref ON Stream.DeviceId = Ref.[DeviceID]
Please can you advice if the query or the file path needs changing to avoid duplicating records? Thanks
To take effect of only latest file, you need to store your file in particular folder structure.
If you have note down, whenever you select reference data file as stream input; stream input dialog asks you for folder structure along with date and time format.
Stream always search for reference file from latest {date}/{time} folder. i.e. you need to store your file like,
2018-01-25/07:30/filename.json (YYYY-MM-DD/HH-mm/filename.json)
NOTE: Here your time folder needs to be unique for each minute. Same as, date folder needs to be unique for each date. Whenever you create new file, create it with under new time stamp folder and under current date folder.
You can use any datetime format that stream input supports.