Databricks Autoloader - dealing with combined files

Databricks Autoloader - dealing with combined files - databricks

I'm working with some files that have some complexities
multiple tab files concatenated into 1
csv files with some meta data prior to the csv data
csv files with an extra row after the header that should be ignored
csv files with log information interspersed into the file
My question relates to whether autoloader can split the stream (ie 1 input file to 2 or more output files) based on pattern matching or has some other mechanism for dealing with these scenarios
Ignoring the metadata using skipRows isn't an option as I want to retain the metadata in a separate output file
The RescuedDataColumn option doesn't appear to be a valid approach as the data doesn't fall into the 3 identified scenarios (from the docs). ie.
The column is missing from the schema.
Type mismatches.
Case mismatches.

Related

DelimitedTextMoreColumnsThanDefined in Azure Data Factory

After several successful pipelines witch move .txt files from you Azure fileshare to your Azure SQL-server I am experiencing problems with moving one specific file to an sql-server table. I get the following errorcode:
ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'chauf_afw.txt' with row number 2: found more columns than expected column count 5.,Source=Microsoft.DataTransfer.Common,'
Azure Data factory sees 5 column on both sink and source side. On both sides(source and sink) I have a schema with 5 columns, the final schematic look likes following.
schematic picture The .txt file contains 6 columns when counting the tabs.
The source file is a UTF-8 .txt file with tab separated data nothing special and in the same format as the other successfully imported files.
Regarding the delimiter the file used tabs in notepad++ it looks like this.
I am afraid I am missing something but I can't file the cause the the error code.

Add column to CSV File from another CSV File (Azure Data Factory)

For example:
Persons.csv
name, last_name
-----------------------
jack, jack_lastName
luc, luc_lastname
FileExample.csv
id
243
123
Result:
name, last_name, exampleId
-------------------------------
jack, jack_lastName, 243
luc, luc_lastname, 123
I want to aggregate any number of columns from another data source, to insert that final result in a file or in a database table.
I have been trying many ways but I can't do it.

You can try to make use of Mergefiles in azure data factory pipeline to merge two csv files .
Select copydata activity and go to source to loop through wild card entry *.csv to search for csv files in storage(configure linked storage to adf in this process).
Then the create a output csv in the same container if required as in my case to merge files and store by naming it some examplemerge.csv.
Check mark the first row as header.
validate and try to debug .
Then you must be able to see merged files in the resultant merged file in output folder.
You can check this reference vlog Merge Multiple CSV files to single CSV for more details and also this vlog on Load Multiple CSV Files to a Table in Azure Data Factory if required.
But if you want to join the files , there must be some common column to join.
Also check this thread from Q&A Azure Data Factory merge 2 csv files with different schema

Azure Data Factory appending large number of files having different schema from csv files

We have 500 CSV files uploaded to an Azure storage container. These files use 4 different schemas, meaning that they have few different columns and some columns are common across all files.
We are using ADF and schema drift to map columns in sink and source and be able to wrote the files.
But this is not working and it only uses schema for the 1st file it processes for every file and this is causing data issues. Please advise on this issue.
We ran the pipeline for three scenarios but the issue is not resolved. Same issue as mentioned below occurring in all three cases:
1.Incorrect Mapping i.e. the Description and PayClass from A type get mapped to WBSname and Activity Name
2. If one less column in one of file (missing column) that also disturbs the mapping i.e. one files does not have resource type that maps Group incorrectly to other column.
Case 1
No Schema Drift at source and sink
Empty Dummy File with all columns created and uploaded at source
Derived table with column Pattern
Case 2 :
Schema Drift at source and sink
Dummy File with all columns created and uploaded at source
Derived table with column Pattern
Case 3 :Schema Drift at Source /No Schema Drift at Sink
Dummy File with all columns created and uploaded at source
Derived table with column Pattern

This is because you have different schemas inside the files being read by the single source transformation.
Schema Drift will automatically handle occasions when that source's schema changes over different invocations from your pipeline.
The way to solve this in your case is to have 4 sources: 1 for each of your CSV schema types. You can always Union the results back together into a single stream and sink them together at the end.
If you use schema drift in this scenario with 4 different source types, data flow will automatically handle cases where more columns are found and columns change per pipeline execution of this data flow.
BTW, this schemaMerge feature you're asking for is available today with Parquet sources in ADF's data flow. We're working on adding native schemaMerge to CSV sources. Until then, you'll need to use an approach like the one I described above.

Upload Microsoft Excel Workbook with Many Sheets into Azure ML Studio

I want to upload my Excel Workbook into Azure Machine Learning Studio. The reason is I have some data that I would like to join into my other .csv files to create a training data set.
When I upload my Excel, I don't get .xlsx, or .xls, but other extensions such as .csv, .txt etc..
This is how it looks,
I uploaded anyways and now, I am getting weird characters. How can I get excel workbook uploaded and get my sheets, so, I can join data and do, data preparation. Any suggestions?

You could save the workbook as a (set of) CSV file(s) and upload them separately.
A CSV file, a 'Comma Separated Values' file, is exactly that. A flat file with some values separated by a comma. If you load an Excel file it will mess up since there's way more information in an Excel file than just values separated by comma's. Have a look at File -> Save as -> Save as type where you can select 'CSV (comma delimited) (*.csv)'
Disclaimer: no, it's not always a comma...
In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator.
Edit
So apparently Excel files are supported: Supported data sources for Azure Machine Learning data preparation
Excel (.xls/.xlsx)
Read an Excel file one sheet at a time by specifying sheet name or number.
But also, only UTF-8 is supported: Import Data - Technical notes
Azure Machine Learning requires UTF-8 encoding. If the data you are importing uses a different encoding, or was exported from a data source that uses a different default encoding, various problems might appear in the text.

How to add columns from multiple files in U-SQL in ADLA?

I have a lot of csv files in a Azure Data Lake, consisting of data of various types (e.g., pressure, temperature, true/false). They are all time-stamped and I need to collect them in a single file according to timestamp for machine learning purposes. This is easy enough to do in Java - start a filestream, run a loop on the folder that opens each file, compares timestamps to write relevant values to the output file, starting a new column (going to the end of the first line) for each file.
While I've worked around the timestamp problem in U-SQL I'm having trouble coming up with syntax that will help me run this on the whole folder. The wildcard syntax {*} treats all files as the same fileset while I need to run some sort of loop to join a column from each file individually.
Is there any way to do this, perhaps using virtual columns?

First you have to think about your problem functional/declaratively and not based on procedural paradigms such as loops.
Let me try to rephrase your question to see if I can help. You have many csv files with data that is timestamped. Different files can have rows with the same timestamp, and you want to have all rows for the same timestamp (or range of timestamps) output to a specific file? So you basically want to repartition the data?
What is the format of each of the files? Do they all have the same schema or different schemas? In the later case, how can you differentiate them? Based on filename?
Let me know in the comments if that is a correct declarative restatement and the answers to my questions and I will augment my answer with the next step.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Databricks Autoloader - dealing with combined files - databricks

Related

DelimitedTextMoreColumnsThanDefined in Azure Data Factory

Add column to CSV File from another CSV File (Azure Data Factory)

Azure Data Factory appending large number of files having different schema from csv files

Upload Microsoft Excel Workbook with Many Sheets into Azure ML Studio

How to add columns from multiple files in U-SQL in ADLA?

Categories

Resources