Azure data lake analytics empty output file - azure

I need help, can't see the issue of this problem:
I am provisioning an Azure Data Lake Store with a Stream Analytics jobs.
File are tab separated and the job is running without errors.
I deployed an Azure Data Lake Analytics service to aggregate data like this:
#input =
EXTRACT [applicationname] string,
[clientip] string,
[continent] string,
[country] string,
[province] string,
[city] string,
[latitude] string,
[longitude] string
FROM "adl://mydatalakesotre.azuredatalakestore.net/instrumentationoutput/mystore/2017-10-22/{*}"
USING Extractors.Text(delimiter: '\t', skipFirstNRows: 1, silent: true);
OUTPUT #input
TO "output/PowerBI_output.tsv"
USING Outputters.Tsv(outputHeader: true);
I can't find the way to make it working... I Have other 5 MB of input data, but the output got only the headers, as specify in the query... What am I missing.
Thanks for the help.

As Bob mentions in the comment, you are getting an empty result because you most likely have a misalignment in your schema definition and the actual files that you extract from.
I suggest that you open the file in the ADL Tools of Visual Studio and use the CREATE EXTRACT statement wizard to create you the EXTRACT statement. If you still get error messages (after removing silent:true), please update your question with the detected error message and we will give you an updated answer.

Related

Azure stream analytics split at comma

I have an input to my stream analytics job as a CSV string such as follows:
jon,41,111 treadmill lane,07831231123,aa,bb,123...etc.
I'd like to sort this data into columns of an SQL table with column headings:
name,age,address,phone,result1,result2,result3...etc.
I've tried using SQL split functions but none I've tried seem to be compatible with Azure stream analytics job query. Could anyone provide any assistance as to how I can split my string into the appropriate tables? Many thanks.
If your events are coming in with a CSV format, you don't have to do anything in your query to work with it. The trick is to set the correct serialisation for your input. When you create your IoT Hub input, set the serialisation to CSV:
This will work if your CSV message has the headers included in the message:
name,age,address,phone,result1,result2,result3
jon,41,111 treadmill lane,07831231123,aa,bb,123
It will show up in the input preview like so:
When the headers are present, you can use them in your queries.
SELECT
name,
age
INTO
target
FROM
[csv-input]

KUSTO split txt when ingesting

I created a table in my Azure Data Explorer with the following command:
.create table MyLogs ( Level:string, Timestamp:datetime, UserId:string, TraceId:string, Message:string, ProcessId:int32 )
I then created my Storage Account --> Container and i then uploaded a simple txt file with the following content
Level Timestamp UserId TraceId Message ProcessId
I then generated a SAS for the container holding that txt file and used in into the query section of my Azure Data Explorer like the following:
.ingest into table MyLogs (
h'...sas for my txt file ...')
Now, when i read the table i see something like this
Level TimeStamp UserId TraceID MEssage ProcessId
Level Timestamp UserId TraceId Message ProcessId
So it basically put all the content into the first column.
I was expecting some automatic splitting. I tried with tab, spaces, commas and many other separators.
I tried to configure an injection mapping with csv format but had no luck.
For what I understood, each new line in the txt is a new row in the table. But how to split the same line with some specific separator?
I read many pages of documentation but had no luck
You can specify any of the formats that you want to try using the format argument, see the list of formats and the ingestion command syntax example that specify the format here
In addition, you can use the "one click ingestion" from the web interface.
This should work (I have done it before with Python SDK)
.create table MyLogs ingestion csv mapping 'MyLogs_CSV_Mapping' ```
[
{"Name":"Level","datatype":"datetime","Ordinal":0},
{"Name":"Timestamp","datatype":"datetime","Ordinal":1},
{"Name":"UserId","datatype":"string","Ordinal":2},
{"Name":"TraceId","datatype":"string","Ordinal":3},
{"Name":"Message","datatype":"string","Ordinal":4},
{"Name":"ProcessId","datatype":"long","Ordinal":5}
]```
https://learn.microsoft.com/de-de/azure/data-explorer/kusto/management/data-ingestion/ingest-from-storage
.ingest into table MyLogs SourceDataLocator with (
format="csv",
ingestionMappingReference = "MyLogs_CSV_Mapping"
)
Hopefully this will help a bit :)

Error in U-SQL Job on Azure Data Lake

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.
What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.
My issue is:
I started with this script:
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #INPUT_FILE string = #"correct_path/Tracking_3e9.json";
//Extract the different properties from the Json file using a JsonExtractor
#json =
EXTRACT Partition string, Custom string
FROM #INPUT_FILE
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
OUTPUT #json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
I get error:
E_STORE_USER_FILENOTFOUND: File not found or access denied
But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?
I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.
Haven't found any tutorials on this.
Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?
Thanks for your help.
1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?
2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be #"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:
EXTRACT
column1 string,
column2 string,
day int,
hour int,
id int
FROM #input
3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:
#refine=
SELECT timestamp FROM #json;
OUTPUT #refine
...
It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!

Copy String by Azure Data Factory Failed

I am trying to do following...
With Azure Data Fatctory, pipeline copys string from JSON file in Blob Storage to Azure SQL.
I am facing problem as below...
Copied String to Azure SQL is displayed as "???" while original string is "圃場1"(ASC-II format)
How do I properly copy original string to Azure SQL?(Maybe, I need to setup encoding format within LinkedService file.
You have to set the correct encoding in the input dataset of your pipeline. You can do this in the format property, with type TextFormat and encodingName. Read more about these properties here: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage#dataset-properties
Your linked service is working fine, as you can get data from your blob storage so no need to change that.
Your format json would look something like this:
"format": {
"type": "TextFormat",
"encodingName": "gb2312"
}
In this example I used gb2312 because I think those characters are chinese, but I'm not really sure. You can check other encodings here: https://msdn.microsoft.com/library/system.text.encoding.aspx
Also reading this might be useful, to get to know a bit more about other text format properties: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#text-format
Hope this helped! :)

Creating a Hive schema on the raw data not working as expected

I am trying to teach myself HIVE and familiarise myself to the HDInsight and HDFS on MicrsoftAzure using the tutorial From Raw Data to Insight using HDP and Microsoft Business Intelligence
I have managed to Staging the data on HDFS and now using AzurePowershell and Microsoft Azure HDInsight Query Console to create a Hive schema on the raw data.
I am trying to use the DDL statement below to create the table ‘price_data':
create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float)
row format delimited
fields terminated by ','
stored as textfile
location '/nyse/nyse_prices';
The blob files are located in the container "nyse" and each blob file within the container is named 'nyse_prices/NYSE_daily_prices_.csv'.
I have made sure that the format conforms to Processing data with Hive documentation on MSDN.
When I run the above query it executes successfully and creates the table.
The external table be pointing to the underlying files and therefore should be populated with the data within each csv file.
However when I run the query:
select count(*) from price_data
It returns 0. This is not correct. Can some one let me know what I am doing wrong here.
Cheers
I think the location you are specifying may be incorrect.
You have a default container, which is the container you specify or create when creating an HDInsight container. For example, 'mycontainer'. If I put all the csv files in that container as nyse_prices/filename.csv, then my location would be just '/nyse_prices'. Just the directory that contains the files. The 'container' is treated as the root in this case - '/'.
If the files are not in the default container, or on a different storage account, you can use a location of 'wasb://container#storagename.blob.core.windows.net/nyse_prices'.
As a test, I just created nyse_prices/ on my default container and uploaded some of the csv's to it. Then modified your query to use location '/nyse_prices'; and was able to do selects against the data after that.

Resources