ADF Copy Pipeline not reading/writing all rows - azure

I have a simple copy pipeline that reads from a CSV file and writes to a Azure SQL database. The pipeline finishes with no errors. On inspection, however, I can see that only 107,506 of the 129,601 rows are actually being read/written. The giveaway that this was not working is that the last row written to the table was not written completely.
The format of the file looks to be intact and formatted correctly. What is causing this issue (timeout issue?), and what options do I have to detect this in the future?
**** EDIT
It appears the issue is further up my pipeline. I am using a Logic App to check for new CSV files on an FTP site and then copy these files to Azure Blob Storage. On two occasions, the write of the logic app has cut off the file at 10 MB. This appears to be new as prior to 5/1 I was often exceeding 10MB. Is this in Azure change?

Check settings in copy activity . I suspect skip incompatible rows is selected in Fault tolerance option.

Related

Copy files from AWS s3 sub folder to Azure Blob

I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks
I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist

How to ignore files with invalid schema Azure Data Factory?

I am new to Azure Data Factory. Currently, I am trying to copy some data from files in blob storage to Azure Data Explorer.
The pipeline reads multiple files from a specific cosmos root directory and copies the data to ADE. Currently, the pipeline successfully handles the copy activities. I now want to add a validation component so that the pipeline only reads files with the correct schema and ignores the rest of the files.
The schema is as follows: Name, Type, Value, Date.
The pipeline itself should not fail/stop if the files are of invalid type or schema, it should just skip them and continue on with the rest of the files.
What component can I use for this validation?
Thanks!
You can use ForEach activity to achieve this requirement. The ForEach activity (in this case) can be used to copy each source file into destination. In case the schema don't match, the copy activity usually fails.
But when a copy activity inside foreach fails, the pipeline execution does not fail/stop. It throws an error for that particular copy activity that fails, but it continues the execution of copy activity for remaining files.
Look at the following demonstration for clear understanding.
The following are the files I am using for sample. The file sample_3.csv does not match the schema of my destination table.
I used Get Metadata activity to get files names of the above files using child items field list. I passed these values to ForEach as #activity('Get Metadata1').output.childItems.
Inside Foreach, I have a copy activity along with wait activity (wait executes only if copy activity fails)
When I execute this pipeline, it completes successfully without any errors. The following is the debug output.
The third iteration of copy activity fails as sample_3.csv has incorrect schema. And yet the pipeline runs successfully without failing.
UPDATE:
Mapping need not be specified in copy activity.

NPOI sometimes only reading 10 lines from a spreadsheet

PROBLEM: I've hit a troubleshooting wall and hoping for suggestions on what to check to get past an issue I'm having with an internet site I'm working on. When reading data from a spreadsheet using NPOI (C#), sometimes (not all the time) the row reading stops after just ten rows.
Sorry for the very long post but not sure what is/isn't useful. Primary reason for posting here is that I don't know the right question to ask the Great Google Machine.
I have an intranet site where I'm reading an XLSX file and pushing its contents into an Oracle table. As you can tell by the subject line, I'm using NPOI. For the most part, it's just working, but only sometimes...
In Oracle, I have a staging table, which is truncated and is supposed to be filled with data from the spreadsheet.
In my app (ASPX), users upload their spreadsheet to the server (this just works), then the app calls a WebMethod that truncates data from the Oracle staging table (this just works), then another WebMethod is called that is supposed to read data from the spreadsheet and load the staging table (this, kinda works).
It's this "kinda works" piece is what I need help with.
The spreadsheet has 170 data rows. When I run the app in VS, it reads/writes all 170 records most of the time but sometimes it reads just 10 records. When I run the app from the web server, the first time it fails (haven't been able to catch a specific error), the second and subsequent times, it reads just ten records from the spreadsheet and successfully loads all ten. I've checked the file uploaded to the server and it does have 170 data records.
Whether the process reads 10 records or 170 records, there are no error messages and no indication why it stopped reading after just ten. (I'll mention here that the file today has 170 but tomorrow could have 180 or 162, so it's not fixed).
So, I've described what it's supposed to do and what it's actually doing. I think it's time for code snippet.
/* snowSource below is the path/filename assembled separately */
/* SnowExcelFormat below is a class that basically maps row data with a specific data class */
IWorkbook workbook;
try
{
using (FileStream file = new FileStream(snowSource, FileMode.Open, FileAccess.Read, FileShare.Read))
{
workbook = WorkbookFactory.Create(file);
}
var importer = new Mapper(workbook);
var items = importer.Take<SnowExcelFormat>(0);
/* at this point, item should have 170 rows but sometimes it contains only 10 with no indication why */
/* I don't see anything in the workbook or importer objects that sheds any light on what's happening. */
Again, this works perfectly fine most of the time when running from VS. That tells me this is workable code. When running this on the web server, it fails the first time I try the process but subsequently it runs but only picking up that first 10 records, ignoring the rest. Also, all the data that's read (10 or 170) is successfully inserted into the staging table, which tells me that Oracle is perfectly okay with the data, its format, and this process. All I need is to figure out why my code doesn't read all the data from Excel.
I have verified numerous times that the local DLL and webserver DLL are the same. And I'm reading the same Excel file.
I'm hitting a serious wall here and have run out of ideas on how to troubleshoot where the code is failing, when it fails. I don't know if there's something limiting memory available to the FileStream object causing it to stop reading the file prematurely - and didn't run across anything that looked like a resource limiter. I don't know if there's something limiting the number of rows pulled by the importer.Take method. Any suggestions would be appreciated.
I faced the same issue on some files and after analyzing the problem, this is what worked for me.
importer.Take<SnowExcelFormat>(0) have three parameters and one of them is maxErrorRows. Its default value is 10.
Your parsing stop when facing more than 10 errors then the function stop reading.
What you have to do is to set maxErrorRows instead of taking default value that is 10.

Filter recent files in Logic Apps' SFTP when files are added/modified trigger

I have this Logic App that connects to an SFTP server and it's triggered by the "files are added or modified" trigger. It's set to run every 10 minutes, looking for new/modified files and copying them to an Azure storage account.
The problem is that this SFTP server path is set to overwrite a set of files every X minutes (I have no control over this) and so, pretty often the Logic App overlaps with the update process of these files and downloads files that are still being written. The result is corrupted files.
Is there a way to add a filter to the When files are added or modified (properties only) so that it only takes into consideration files with a modified date of, at least, 1 minute old?
That way, files that are currently being written won't be added to the list of files to download. The next run of the Logic App would then fetch this ignored files and so on.
UPDATE
I've found a Trigger Conditions in the trigger's setting but I can't find any documentation about it.
According to test the trigger "When files are added or modified", it seems we can not add a filter in the trigger to filter the records which are modified at least 1 minute ago. We can just get the List of Files LastModified datetime and loop them, use "If" condition to judge if we should download it.
Update:
The expression in the screenshot is:
sub(ticks(utcNow()), ticks(triggerBody()?['LastModified']))
Update workaround
Is it possible to add a "Delay" action when the last modified time less than 1 minute ? For example, if the last modified time less than 60 seconds, use "Delay" to wait 5 minutes until the overwrite operation complete, then do the download.
I check the sample #equals(triggers().code, 'InternalServerError'), actually it uses the condition functions in Logical comparison functions, so the key word is make sure the property you want to filter is in the trigger or triggerBody or you will get the below error.
So I change the expression to like #greater(triggerBody().LastModified,'2020-04-20T11:23:00Z'), this could filter the file modified less than 2020-04-20T11:23:00Z not trigger the flow.
Also you could use other function like less ,greaterOrEquals etc in the Logical comparison functions.

Using Logic Apps to get specific files from all sub(sub)folders, load them to SQL-Azure

I'm quite new to Data Factory and Logic Apps (but I am experienced with SSIS since many years),
I succeeded in loading a folder with 100 text-files into SQL-Azure with DATA FACTORY
But the files themselves are untouched
Now, another requirement is that I loop through the folders to get all files with a certain file extension,
In the end I should move (=copy & delete) all the files from the 'To_be_processed' folder to the 'Processed' folder
I can not find where to put 'wildcards' and such:
For example, get all files with file extensions .001, 002, 003, 004, 005, ...until... , 996, 997, 998, 999 (thousand files)
--> also searching in the subfolders.
Is it possible to call a Data Factory from within a Logic App ? (although this seems unnecessary)
Please find some more detailed information in this screenshot:
(click to enlarge)
Thanks in advance helping me out exploring this new technology!
Interesting situation.
I agree that using Logic Apps just for this additional layer of file handling seems unnecessary, but Azure Data Factory may currently be unable to deal with exactly what you need...
In terms of adding wild cards to your Azure Data Factory datasets you have 3 attributes available within the JSON type properties block, as follows.
Folder Path - to specify the directory. Which can work with a partition by clause for a time slice start and end. Required.
File Name - to specify the file. Which again can work with a partition by clause for a time slice start and end. Not required.
File Filter - this is where wildcards can be used for single and multiple characters. (*) for multi and (?) for single. Not required.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-file-system-connector
I have to say that separately none of the above are ideal for what you require and I've already fed back to Microsoft that we need a more flexible attribute that combines the 3 above values into 1, allowing wildcards in various places and a partition by condition that works with more than just date time values.
That said. Try something like the below.
"typeProperties": {
"folderPath": "TO_BE_PROCESSED",
"fileFilter": "17-SKO-??-MD1.*" //looks like 2 middle values in image above
}
On a side note; there is already a Microsoft feedback item thats been raised for a file move activity which is currently under review.
See here: https://feedback.azure.com/forums/270578-data-factory/suggestions/13427742-move-activity
Hope this helps
We have used a C# application which we call through 'app services' -> webjobs.
Much easier to iterate through folders. To call SQL we used sql bulkinsert

Resources