I can't ignore existing files when using Delta Live Tables - databricks

I created a DLT pipeline targeting a terrabyte scale directory with file notifications option turned on. I set "cloudFiles.includeExistingFiles": false to ignore existing files, and ingest the data starting from the first run.
What I expect to happen is that on the first run (t0) no data is ingested, while on the second run (t1) the incoming data is ingested between t0 and t1. I also expect the first run to complete instantly, and since I am using file notifications, I expect the second run to complete pretty fast as well.
I started the first run, it's still running for the last 7 hours :) No data is ingested as I expected, but I have no idea what the pipeline is doing right now. I guess it does something with the existing files even though I explicitly stated that I want to ignore them.
Any ideas why the behavior I expected isn't happening?

Related

Is there a way to stop Azure ML throwing an error when exporting zero lines of data?

I am currently developing an Azure ML pipeline that as one of its outputs is maintaining a SQL table holding all of the unique items that are fed into it. There is no way to know in advance if the data fed into the pipeline is new unique items or repeats of previous items, so before updating the table that it maintains it pulls the data already in that table and drops any of the new items that already appear.
However, due to this there are cases where this self-reference results in zero new items being found, and as such there is nothing to export to the SQL table. When this happens Azure ML throws an error, as it is considered an error for there to be zero lines of data to export. In my case, however, this is expected behaviour, and as such absolutely fine.
Is there any way for me to suppress this error, so that when it has zero lines of data to export it just skips the export module and moves on?
It sounds as if you are struggling to orchestrate a data pipeline because there orchestration is happening in two places. My advice would be to either move more orchestration into Azure ML, or make the separation between the two greater. One way to do this would be to have a regular export to blob of the table you want to use as training. Then you can use a Logic App to trigger a pipeline whenever a non-empty blob lands in the location
This issue has been resolved by an update to Azure Machine Learning; You can now run pipelines with a flag set to "Continue on Failure Step", which means that steps following the failed data export will continue to run.
This does mean you will need to design your pipeline to be able to handles upstream failures in its downstream modules; this must be done very carefully.

CSV Playback with Node Red

Disclaimer - I am not a software guy so please bear with me while I learn.
I am looking to use node red as a parser/translator by taking data from a CSV file and sending out the rows of data at 1Hz. Let's say 5-10 rows of data being read and published per second.
Eventually, I will publish that data to some Modbus registers but I'm not there yet.
I have scoured the web and tried several examples, however, as soon as I trigger the flow, Node.Red stops responding and I have to delete the source CSV,(so it can't run any more) and restart node.red in order to get it back up in running.
I have many of the Big Nodes from this guy installed and have tried a variety of different methods but I just can't seem to get it.
If I can get a single column of data from a CSV file being sent out one row at a time, I think that would keep me busy for a bit.
There is a file node that will read a file a line at a time, you can then feed this through the csv node to parse out the fields in the CSV into an object so you can work with it.
The delay node has a rate limiting function that can be used to limit the flow to processing 1 message per second to achieve the rate you want.
All the nodes I've mentioned should be in the core set that ships with Node-RED

Azure Data Factory prohibit Pipeline Double Run

I know it might be a bit a confusing title but couldn't get up to anythig better.
The problem ...
I have a ADF Pipeline with 3 Activities, first a Copy to a DB, then 2 times a Stored procedure. All are triggered by day and use a WindowEnd to read the right directory or pass a data to the SP.
There is no way I can get a import-date into the XML files that we are receiving.
So i'm trying to add it in the first SP.
Problem is that once the first action from the pipeline is done 2 others are started.
The 2nd action in the same slice, being the SP that adds the dates, but in case history is loaded the same Pipeline starts again a copy for another slice.
So i'm getting mixed up data.
As you can see in the 'Last Attempt Start'.
Anybody has a idea on how to avoid this ?
ADF Monitoring
In case somebody hits a similar problem..
I've solved the problem by working with daily named tables.
each slice puts its data into a staging table with a _YYYYMMDD after, can be set as"tableName": "$$Text.Format('[stg].[filesin_1_{0:yyyyMMdd}]', SliceEnd)".
So now there is never a problem anymore of parallelism.
The only disadvantage is that the SP's coming after this first have to work with Dynamic SQL as the table name where they are selecting from is variable.
But that wasn't a big coding problem.
Works like a charm !

Incremental load in Azure Data Lake

I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.
These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.
Initial load
Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:
#entries =
EXTRACT
repo string,
branch string,
build int,
step int,
Line int,
Entry string
FROM #"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
USING new MyJSONExtractor();
When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:
INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM #entries
WHERE (repo == "myRepo") AND (branch == "master");
Still no luck - compiler times out.
(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)
Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?
Incremental load
Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.
A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.
Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?
For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md
Input File Set scales orders of magnitudes better (opt-in statement is
now provided)
Previously, U-SQL's file set pattern on EXTRACT expressions ran into
compile time time-outs around 800 to 5000 files.
U-SQL's file set pattern now scales to many more files and generates
more efficient plans.
For example, a U-SQL script querying over 2500 files in our telemetry
system previously took over 10 minutes to compile now compiles in 1
minute and the script now executes in 9 minutes instead of over 35
minutes using a lot less AUs. We also have compiled scripts that
access 30'000 files.
The preview feature can be turned on by adding the following statement
to your script:
SET ##FeaturePreviews = "FileSetV2Dot5:on";
If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.
I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.
To address your second question:
You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.
Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.
As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.

Unloading Large Sql Anywhere Table

An old box has brought back live running Sql Anywhere 9.
I need to retrieve data from there to migrate to SQL Server and then I can kill the old box again
I ran an unload on 533 tables which all run ok. I have 1 table that does not unload.
I run the dbunload from the command-line and since it worked for 533 tables... in theory it should work.
I dont see specific what is wrong with this table. The unload runs but gives no errors (and also no file is written). I checked the event log and also no errors. I just wonder how to diagnose the problem to check what is wrong. One thing I do notice is that this the only table without a primary key, but I dont know if this matters with unload.
the production database of this table contains 50.000.000 entries. The archive version only contains 15 million entries but both versions refuse to be unloaded or better said: give no indication of what is wrong but simply dont generate an unloaded file.
I wonder if it is the size that is the problem, maybe memory / temp diskspace? or otherwise a time-out that is set somewhere.
I also wonder if there is an alternative.
p.s. the gui (sybase central) just produces an endless turning wheel that never finishes.
update: i saw that dbunload works temp in the user's local settings/temp directory. Possible that is why it fails or there is possible some other place where it temporary saves items.

Resources