Delimited File with Varying Number of Rows Azure Data Factory - azure

I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.

Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.

Related

Excel data tables: Multiple outputs with only one input column

I am trying to create a data table with multiple outputs across periods, but for the same scenarios.
Is it possible to create that without inserting an extra column between each output column to deliver input for the data table (i.e. input column = index 50-110).
Is this in any way possible? See picture of what I would usually mark to create the data table (this does only cover one period/output though). But if I were to make the scenario for FY23, then I would need to insert a column between FY22 and FY23 where I copy the index 50-110 again. I would like to not have to do that.

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Azure Data Factory Copy Activity is dropping columns on the floor

first time, long time.
I'm running an import of a csv file that has 734 columns in Azure Data Factory Copy Activity. Data factory is not reading the last 9 columns and is populating with NULL. Even in the preview I can see that the columns have no values but the schema for those columns is detected. Is there a limit of columns in Copy to 725?
As Joel said there is no restriction for 725 or so columns . I suggest
Go to the mapping tab and only pick 726th column ( if you have a header it will be easy or ADF will generate header like Prop_726( most probably) , copy the data to blob as sink , If the blob has the field , that means that you have a data type issue on the table .
Let me know how its goes , if you are still facing the issue , please share some dummy data for 726th column .
Here is what happened. I had the file in zip folders, and I thought I had to unzip the files first to process them. It turns out that when unzipping through ADF, it stripped the quotation marks from my columns, and then one of the columns had an escape character in it. That escape character shifted everything over, and resulted in me losing nine columns.
But I did learn a bunch of things NOT to do, so it wasn't a total waste of time. Thanks for the answers!

How to fetch the column count from dat file in Azure data lake analytics files

I have different Dat and CSV files. it's containing more than 255 columns and delimiter as '|' and tab. How to fetch the column count. Anyone please share sample U-Sql code
I know this was down voted, so I hope it is still OK to supply an answer (although I'm not including a code sample).
Extract just the first row in your file (using FETCH 1 ROWS) into a single column rowset. You should then be able to use String.Split to get a column count.

Custom parallel extractor - U-SQL

I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.
Example:
...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...
Sorry for my English.
I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.
U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.
So you have two options:
Mark the extractor as operating on the whole file with adding the following line before the extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.
I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.

Resources