I am creating a pipeline in Azure data factory where I am using Function app as one of activity to transform data and store in append blob container as csv format .As I have taken 50 batches in for loop so 50 times my function app is to process data for each order.I am appending header in csv file with below logic.
//First I am creating file as per business logic //
//csveventcontent is my source data //
var dateAndTime = DateTime.Now.AddDays(-1);
string FileDate = dateAndTime.ToString("ddMMyyyy");
string FileName = _config.ContainerName + FileDate + ".csv";
StringBuilder csveventcontent = new StringBuilder();
OrderEventService obj = new OrderEventService();
//Now I am checking if todays file exists and if it doesn't we create it.//
if (await appBlob.ExistsAsync() == false)
{
await appBlob.CreateOrReplaceAsync(); //CreateOrReplace();
//Append Header
csveventcontent.AppendLine(obj.GetHeader());
}
Now the problem is header is appending so many times in csv file .Sometimes it is not appending at top.Probably due to parralel function app is running 50 times.
How I can fixed header at top only at one time.
I have tried with data flow and logic app also but unable to do it.If it can be handled through code that would be easier I guess.
I think you are right there. Its the concurrency of the function app that is causing the problem. The best approach would be to use a queue and process messages one by one. Or you could use a distributed lock to ensure only one function writes to the file at a time. You can use blob leases for this.
The Lease Blob operation creates and manages a lock on a blob for write and delete operations. The lock duration can be 15 to 60 seconds, or can be infinite.
Refer: Lease Blob Request Headers
Related
I am accepting multiple zip file which I want to process in orchestrator. My durable orchestrator is httptriggered.
I am able to access the file in http trigger as a multipartmemorystream but when I pass the same to durable orchestrator , orchestrator triggers but unable to get files for further processing.
Below is my http trigger function code to read the multiple files and pass to orchestrator
var data = req.Content.ReadAsMultipartAsync().Result;
string instanceId = await starter.StartNewAsync("ParentOrchestrator", data);
Orchestrator Trigger code:
public static async Task<List<string>> RunOrchestrator(
[OrchestrationTrigger] IDurableOrchestrationContext context
)
{
var files = context.GetInput<System.Net.Http.MultipartMemoryStreamProvider>();
To read the input I also tried to created class and pass the stream to the property so data can be serialized as JSON but did not work.
anything I am missing in code?
issue is how to get the zip files for processing.
I checked raw input under the orchestrator context , There I can see file name and other details
Passing files as input seems like a bad idea to me.
Those inputs will be loaded by the orchestrator from Table Storage/Blob Storage each time it replays.
Instead I would recommend that you upload the Zip files to Blob Storage and pass the blob URLs as input to the orchestrator.
Then you use the URLs as inputs to activities where the files are actually processed.
Orchestrator accept only the data which can be serialised. As memory stream is not serialisable it was not able to retrieve the data using GetInput<provider>().
I convert the memory stream to byte array as byte array can be serialised.
I read multiple article which was claiming that ,if we convert the file to byte array we loss the file metadata. Actually if you read file as stream and then to byte array then file data along with meta data get converted to byte array.
Here ,
read the httprequest message as multipartread this gives the object as multipartmemoryatreamprovider.
convert data to byte array
pass to orchestrator
4)receive the files as byte array by using GetInput<byte[]>()
In orchestrator convert byte array to stream MemoryStream ms = new MemoryStream(<input byte array>)
I am new to GCP, I am able to get 1 file into GCS from my VM and then transfer it to bigquery.
How to I transfer multiple files from GCS to Bigquery. I know wildcard URi is the solution to it but what other changes are also needed in the code below?
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print(f"Processing file: {file['name']}.")
As there could be multiple uploads so I cannot define the specific table name or file name? Is it possible to do this task automatically?
This function is triggered by PubSub whenever there is a new file in GCS bucket.
Thanks
To transfer multiple files from GCS to Bigquery, you can simply loop through all the files. A sample of the working code with comments is below.
I believe event and context (function arguments) are handled by Google cloud function by default, so no need to modify that part. Or you can simplify the code by leveraging event instead of a loop.
def hello_gcs(event, context):
import re
from google.cloud import storage
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
bq_client = bigquery.Client()
bucket = storage.Client().bucket("bucket-name")
for blob in bucket.list_blobs(prefix="folder-name/"):
if ".csv" in blob.name: #Checking for csv blobs as list_blobs also returns folder_name
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
)
csv_filename = re.findall(r".*/(.*).csv",blob.name) #Extracting file name for BQ's table id
bq_table_id = "project-name.dataset-name."+csv_filename[0] # Determining table name
try: #Check if the table already exists and skip uploading it.
bq_client.get_table(bq_table_id)
print("Table {} already exists. Not uploaded.".format(bq_table_id))
except NotFound: #If table is not found, upload it.
uri = "gs://bucket-name/"+blob.name
print(uri)
load_job = bq_client.load_table_from_uri(
uri, bq_table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = bq_client.get_table(bq_table_id) # Make an API request.
print("Table {} uploaded.".format(bq_table_id))
Correct me if I am wrong, I understand that your cloud function is triggered by a finalize event (Google Cloud Storage Triggers), when a new file (or object) appears in a storage bucket. It means that there is one event for each "new" object in the bucket. Thus, at least one invocation of the cloud function for every object.
The link above has an example of data which comes in the event dictionary. Plenty of information there including details of the object (file) to be loaded.
You might like to have some configuration with mapping between a file name pattern and a target BigQuery table for data loading, for example. Using that map you will be able to make a decision on which table should be used for loading. Or you may have some other mechanism for choosing the target table.
Some other things to think about:
Exception handling - what are you going to do with the file if the
data is not loaded (for any reason)? Who and how is to be informed?
What is to be done to (correct the source data or the target table
and) repeat the loading, etc.
What happens if the loading takes more time, than a cloud function
timeout (maximum 540 seconds at the present moment)?
What happens if the there are more than one cloud function
invocations from one finalize event, or from different events but
from semantically the same source file (repeated data, duplications,
etc.)
Don't answer to me, just think about such cases if you have not done it yet.
if your Data source is GCS and your destination is BQ you can use BigQuery Data Transfer Service to ETL your data in BQ. every Transfer job is for a certain Table and you can select if you want to append or overwrite data in a certain Table with Streaming mode.
You can schedule this job as well. Dialy, weekly,..etc.
To load multiple GCS files onto multiple BQ tables on a single Cloud Function invocation, you’d need to list those files and then iterate over them, creating a load job for each file, just as you have done for one. But doing all that work inside a single function call, kind of breaks the purpose of using Cloud Functions.
If your requirements do not force you to do so, you can leverage the power of Cloud Functions and let a single CF be triggered by each of those files once they are added to the bucket as it is an event driven function. Please refer https://cloud.google.com/functions/docs/writing/background#cloud-storage-example. It would be triggered every time there is a specified activity, for which there would be event metadata.
So, in your application rather than taking the entire bucket contents in the URI, we can take the name of the file which triggered the event and load only that file into a bigquery table as shown in the below code sample.
Here is how you can resolve the issue in your code. Try the following changes in your code.
You can extract the details about the event and detail about the file which triggered the event from the cloud function event dictionary. In your case, we can get the file name as event[‘name’] and update the “uri” variable.
Generate a new unique table_id (here as an example the table_id is the same as the file name). You can use other schemes to generate unique file names as required.
Refer the code below
def hello_gcs(event, context):
from google.cloud import bigquery
client = bigquery.Client() # Construct a BigQuery client object.
print(f"Processing file: {event['name']}.") #name of the file which triggers the function
if ".csv" in event['name']:
# bq job config
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
)
file_name = event['name'].split('.')
table_id = "<project_id>.<dataset_name>."+file_name[0] #[generating new id for each table]
uri = "gs://<bucket_name>/"+event['name']
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Table {} uploaded.".format(table_id))
Azure Function utilising Azure Table Storage
I have an Azure Function which is triggered from Azure Service Bus topic subscription, let's call it "Process File Info" function.
The message on the subscription contains file information to be processed. Something similar to this:
{
"uniqueFileId": "adjsdakajksajkskjdasd",
"fileName":"mydocument.docx",
"sourceSystemRef":"System1",
"sizeBytes": 1024,
... and other data
}
The function carries out the following two operations -
Check individual file storage table for the existing of the file. If it exists, update that file. If it's new, add the file to the storage table (stored on a per system|per fileId basis).
Capture metrics on the file size bytes and store in a second storage table, called metrics (constantly incrementing the bytes, stored on a per system|per year/month basis).
The following diagram gives a brief summary of my approach:
The difference between the individualFileInfo table and the fileMetric is that the individual table has one record per file, where as the metric table stores one record per month that is constantly updated (incremented) gathering the total bytes that are passed through the function.
Data in the fileMetrics table is stored as follows:
The issue...
Azure functions are brilliant at scaling, in my setup I have a max of 6 of these functions running at any one time. Presuming each file message getting processed is unique - updating the record (or inserting) in the individualFileInfo table works fine as there are no race conditions.
However, updating the fileMetric table is proving problematic as say all 6 functions fire at once, they all intend to update the metrics table at the one time (constantly incrementing the new file counter or incrementing the existing file counter).
I have tried using the etag for optimistic updates, along with a little bit of recursion to retry should a 412 response come back from the storage update (code sample below). But I can't seem to avoid this race condition. Has anyone any suggestion on how to work around this constraint or come up against something similar before?
Sample code that is executed in the function for storing the fileMetric update:
internal static async Task UpdateMetricEntry(IAzureTableStorageService auditTableService,
string sourceSystemReference, long addNewBytes, long addIncrementBytes, int retryDepth = 0)
{
const int maxRetryDepth = 3; // only recurively attempt max 3 times
var todayYearMonth = DateTime.Now.ToString("yyyyMM");
try
{
// Attempt to get existing record from table storage.
var result = await auditTableService.GetRecord<VolumeMetric>("VolumeMetrics", sourceSystemReference, todayYearMonth);
// If the volume metrics table existing in storage - add or edit the records as required.
if (result.TableExists)
{
VolumeMetric volumeMetric = result.RecordExists ?
// Existing metric record.
(VolumeMetric)result.Record.Clone()
:
// Brand new metrics record.
new VolumeMetric
{
PartitionKey = sourceSystemReference,
RowKey = todayYearMonth,
SourceSystemReference = sourceSystemReference,
BillingMonth = DateTime.Now.Month,
BillingYear = DateTime.Now.Year,
ETag = "*"
};
volumeMetric.NewVolumeBytes += addNewBytes;
volumeMetric.IncrementalVolumeBytes += addIncrementBytes;
await auditTableService.InsertOrReplace("VolumeMetrics", volumeMetric);
}
}
catch (StorageException ex)
{
if (ex.RequestInformation.HttpStatusCode == 412)
{
// Retry to update the volume metrics.
if (retryDepth < maxRetryDepth)
await UpdateMetricEntry(auditTableService, sourceSystemReference, addNewBytes, addIncrementBytes, retryDepth++);
}
else
throw;
}
}
Etag keeps track of conflicts and if this code gets a 412 Http response it will retry, up to a max of 3 times (an attempt to mitigate the issue). My issue here is that I cannot guarantee the updates to table storage across all instances of the function.
Thanks for any tips in advance!!
You can put the second part of the work into a second queue and function, maybe even put a trigger on the file updates.
Since the other operation sounds like it might take most of the time anyways, it could also remove some of the heat from the second step.
You can then solve any remaining race conditions by focusing only on that function. You can use sessions to limit the concurrency effectively. In your case, the system id could be a possible session key. If you use that, you will only have one Azure Function processing data from one system at one time, effectively solving your race conditions.
https://dev.to/azure/ordered-queue-processing-in-azure-functions-4h6c
Edit: If you can't use Sessions to logically lock the resource, you can use locks via blob storage:
https://www.azurefromthetrenches.com/acquiring-locks-on-table-storage/
I currently have a Timer triggered Azure Function that checks a data endpoint to determine if any new data has been added. If new data has been added, then I generate an output blob (which I return).
However, returning output appears to be mandatory. Whereas I'd only like to generate an output blob under specific conditions, I must do it all of the time, clogging up my storage.
Is there any way to generate output only under specified conditions?
If you have the blob output binding set to your return value, but you do not want to generate a blob, simply return null to ensure the blob is not created.
You're free to execute whatever logic you want in your functions. You may need to remove the output binding from your function (this is what is making the output required) and construct the connection to blob storage in your function instead. Then you can conditionally create and save the blob.
I'm making a program that stores and reads from Azure tables some that are stored in CSV files. What I got are CSV files that that can have various number of columns, and between 3k and 50k rows. What I need to do is upload that data in Azure table. So far I managed to both upload data and retrieve it.
I'm using REST API, and for uploading I'm creating XML batch request, with 100 rows per request. Now that works fine, except it takes a bit too long to upload, ex. for 3k rows it takes around 30seconds. Is there any way to speed that up? I noticed that it takes most time when proccessing response ( for ReadToEnd() command ). I read somewhere that setting proxy to null could help, but it doesn't do much in my case.
I also found somewhere that it is possible to upload whole XML request to blob and then execute it from there, but I couldn't find any example for doing that.
using (Stream requestStream = request.GetRequestStream())
{
requestStream.Write(content, 0, content.Length);
}
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream dataStream = response.GetResponseStream();
using (var reader = new StreamReader(dataStream))
{
String responseFromServer = reader.ReadToEnd();
}
}
As for retrieving data from azure tables, I managed to get 1000 entities per request. As for that, it takes me around 9s for CS with 3k rows. It also takes most time when reading from stream. When I'm calling this part of the code (again ReadToEnd() ):
response = request.GetResponse() as HttpWebResponse;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string result = reader.ReadToEnd();
}
Any tips?
As you mentioned you are using REST API you may have to write extra code and depend on your own methods to implement performance improvement differently then using client library. In your case using Storage client library would be best as you can use already build features to expedite insert, upsert etc as described here.
However if you were using Storage Client Library and ADO.NET you can use the article below which is written by Windows Azure Table team as supported way to improve Azure Access Performance:
.NET and ADO.NET Data Service Performance Tips for Windows Azure Tables