Error in U-SQL Job on Azure Data Lake - azure

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.
What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.
My issue is:
I started with this script:
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #INPUT_FILE string = #"correct_path/Tracking_3e9.json";
//Extract the different properties from the Json file using a JsonExtractor
#json =
EXTRACT Partition string, Custom string
FROM #INPUT_FILE
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
OUTPUT #json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
I get error:
E_STORE_USER_FILENOTFOUND: File not found or access denied
But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?
I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.
Haven't found any tutorials on this.
Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?
Thanks for your help.

1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?
2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be #"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:
EXTRACT
column1 string,
column2 string,
day int,
hour int,
id int
FROM #input
3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:
#refine=
SELECT timestamp FROM #json;
OUTPUT #refine
...
It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!

Related

KUSTO split txt when ingesting

I created a table in my Azure Data Explorer with the following command:
.create table MyLogs ( Level:string, Timestamp:datetime, UserId:string, TraceId:string, Message:string, ProcessId:int32 )
I then created my Storage Account --> Container and i then uploaded a simple txt file with the following content
Level Timestamp UserId TraceId Message ProcessId
I then generated a SAS for the container holding that txt file and used in into the query section of my Azure Data Explorer like the following:
.ingest into table MyLogs (
h'...sas for my txt file ...')
Now, when i read the table i see something like this
Level TimeStamp UserId TraceID MEssage ProcessId
Level Timestamp UserId TraceId Message ProcessId
So it basically put all the content into the first column.
I was expecting some automatic splitting. I tried with tab, spaces, commas and many other separators.
I tried to configure an injection mapping with csv format but had no luck.
For what I understood, each new line in the txt is a new row in the table. But how to split the same line with some specific separator?
I read many pages of documentation but had no luck
You can specify any of the formats that you want to try using the format argument, see the list of formats and the ingestion command syntax example that specify the format here
In addition, you can use the "one click ingestion" from the web interface.
This should work (I have done it before with Python SDK)
.create table MyLogs ingestion csv mapping 'MyLogs_CSV_Mapping' ```
[
{"Name":"Level","datatype":"datetime","Ordinal":0},
{"Name":"Timestamp","datatype":"datetime","Ordinal":1},
{"Name":"UserId","datatype":"string","Ordinal":2},
{"Name":"TraceId","datatype":"string","Ordinal":3},
{"Name":"Message","datatype":"string","Ordinal":4},
{"Name":"ProcessId","datatype":"long","Ordinal":5}
]```
https://learn.microsoft.com/de-de/azure/data-explorer/kusto/management/data-ingestion/ingest-from-storage
.ingest into table MyLogs SourceDataLocator with (
format="csv",
ingestionMappingReference = "MyLogs_CSV_Mapping"
)
Hopefully this will help a bit :)

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

Writing in sub directory - Azure Mapping Data Flow

In my Data flow, the sink is ADLS.
My source files are present in SoureDump/Data and I am reading from path SoureDump/Data. I am doing few transformations and I am trying write the output files into SoureDump/Rawzone.
Output file name is created from the data.
When I trigger the pipeline, the output files are generated as expected but are written in the parent directory SoureDump.
My work:
DataSet screenshot
Please let me know if I have given anything wrong.
Thanks.
"As data in column" defaults back to your dataset container object. Check the info bubble next to the column name field:
Just set your target folder with a Derived Column prior to your sink and append the value like this:
tgt_file_name_w_path = '/mypath/output/'+tgt_file_name
Then use tgt_file_name_w_path instead in your column with file name property.

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?
I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

Bulk Insert or any other insert options to insert file contents without providing file path

I want to know if it is possible to import a file's contents without knowing the exact file location/path. Instead of the file path, the file contents will directly be passed in as a string, to say a stored procedure.
How can I utilize any of the insert options like BULK INSERT or OPENROWSET etc. to import/copy data directly from input as string (instead of giving file path), and please do give some examples or links, with code if possible.
If more details are needed, I can provide. Or, do I have to perform string manipulation with looping of some kind to input the huge string file blob into respective tables/columns?

Resources