Custom parallel extractor - U-SQL - azure

I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.
Example:
...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...
Sorry for my English.

I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.

U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.
So you have two options:
Mark the extractor as operating on the whole file with adding the following line before the extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.

I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.

Related

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Delimited File with Varying Number of Rows Azure Data Factory

I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.
Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.

Azure Data Factory Copy Activity is dropping columns on the floor

first time, long time.
I'm running an import of a csv file that has 734 columns in Azure Data Factory Copy Activity. Data factory is not reading the last 9 columns and is populating with NULL. Even in the preview I can see that the columns have no values but the schema for those columns is detected. Is there a limit of columns in Copy to 725?
As Joel said there is no restriction for 725 or so columns . I suggest
Go to the mapping tab and only pick 726th column ( if you have a header it will be easy or ADF will generate header like Prop_726( most probably) , copy the data to blob as sink , If the blob has the field , that means that you have a data type issue on the table .
Let me know how its goes , if you are still facing the issue , please share some dummy data for 726th column .
Here is what happened. I had the file in zip folders, and I thought I had to unzip the files first to process them. It turns out that when unzipping through ADF, it stripped the quotation marks from my columns, and then one of the columns had an escape character in it. That escape character shifted everything over, and resulted in me losing nine columns.
But I did learn a bunch of things NOT to do, so it wasn't a total waste of time. Thanks for the answers!

How to overwrite fields when you COPY FROM data?

Is it possible to somehow overwrite existing counters-fields when COPY FROM data (from CSV),
or completely delete rows from the database?
When I COPY FROM data to existing rows, the counters are summarized.
I can’t completely DELETE these rows as well:
although it seems that the rows are deleted, when you re-COPY FROM the data from CSV,
the counters-fields continue to increase.
You can't set counters to the specific value - for them the only supported operation is either increase, or decrease. To set them to specific value you need either to decrease it to its current value, and then increase to desired value, but this will require that you read value. Or you need to delete corresponding cells (or whole row), and perform increase operation using desired number.
The second approach could be implemented easier, but will require that you first generate a file with CQL DELETE commands based on the content of your CSV file, and then use COPY FROM - if nobody increased values since deletion, then counters will get correct values.

Show graph for data over 10 billions lines

I have a 40G data file containing over 10 billions line.
Each line contains 2 columns, separate by comma, both columns are of the type 'float' and the first columns increases each row.
I want to show this data in a file, and I've tried Excel, but it said too many lines and can't display all.
Is there any other tools that can handle large data and show a line graph of the data?
Excel is not the right tool for files of that size. There is no prefect rule, but I usually move the data into a relational database if it is over 200,000 rows. In SQL you can perform the logic to create a smaller dataset that aggregates the data into a more manageable size.
If you cannot achieve an excel-worthy size, then you'll need to look into options like Tableau that can graph data in a relational database.

Resources