How to use trino to get prefix path in GCS - presto

I'm using Trino with Hive+GCS connector to access my JSON files inside GCS. If I direct the external table to the path with the JSON it works, like:
create table transactions(
date DATE,
transaction1 DOUBLE,
TransactionType VARCHAR(255)
) WITH (
external_location = 'gs://bucket/transaction_type/date',
format = 'PARQUET'
);
But I would like to pass only until transaction_type and it access my files inside dates "folders" recursively.
I know that GCS treats everything inside the bucket differently than a folder, I believe thats the problem, but I don't know how to workaround that.
has anyone done something similar?

You should pass hive.recursive-directories as parameter inside the catalog file hive.properties

Related

KUSTO split txt when ingesting

I created a table in my Azure Data Explorer with the following command:
.create table MyLogs ( Level:string, Timestamp:datetime, UserId:string, TraceId:string, Message:string, ProcessId:int32 )
I then created my Storage Account --> Container and i then uploaded a simple txt file with the following content
Level Timestamp UserId TraceId Message ProcessId
I then generated a SAS for the container holding that txt file and used in into the query section of my Azure Data Explorer like the following:
.ingest into table MyLogs (
h'...sas for my txt file ...')
Now, when i read the table i see something like this
Level TimeStamp UserId TraceID MEssage ProcessId
Level Timestamp UserId TraceId Message ProcessId
So it basically put all the content into the first column.
I was expecting some automatic splitting. I tried with tab, spaces, commas and many other separators.
I tried to configure an injection mapping with csv format but had no luck.
For what I understood, each new line in the txt is a new row in the table. But how to split the same line with some specific separator?
I read many pages of documentation but had no luck
You can specify any of the formats that you want to try using the format argument, see the list of formats and the ingestion command syntax example that specify the format here
In addition, you can use the "one click ingestion" from the web interface.
This should work (I have done it before with Python SDK)
.create table MyLogs ingestion csv mapping 'MyLogs_CSV_Mapping' ```
[
{"Name":"Level","datatype":"datetime","Ordinal":0},
{"Name":"Timestamp","datatype":"datetime","Ordinal":1},
{"Name":"UserId","datatype":"string","Ordinal":2},
{"Name":"TraceId","datatype":"string","Ordinal":3},
{"Name":"Message","datatype":"string","Ordinal":4},
{"Name":"ProcessId","datatype":"long","Ordinal":5}
]```
https://learn.microsoft.com/de-de/azure/data-explorer/kusto/management/data-ingestion/ingest-from-storage
.ingest into table MyLogs SourceDataLocator with (
format="csv",
ingestionMappingReference = "MyLogs_CSV_Mapping"
)
Hopefully this will help a bit :)

AWS Lambda Nodejs: Get all objects created in the last 24hours from a S3 bucket

I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated
Thank you
You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time.
For example:
s3://bucket/reports/date=2019-08-28/report1.json
s3://bucket/reports/date=2019-08-28/report2.json
s3://bucket/reports/date=2019-08-28/report3.json
s3://bucket/reports/date=2019-08-29/report1.json
This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html

Error in U-SQL Job on Azure Data Lake

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.
What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.
My issue is:
I started with this script:
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #INPUT_FILE string = #"correct_path/Tracking_3e9.json";
//Extract the different properties from the Json file using a JsonExtractor
#json =
EXTRACT Partition string, Custom string
FROM #INPUT_FILE
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
OUTPUT #json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
I get error:
E_STORE_USER_FILENOTFOUND: File not found or access denied
But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?
I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.
Haven't found any tutorials on this.
Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?
Thanks for your help.
1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?
2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be #"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:
EXTRACT
column1 string,
column2 string,
day int,
hour int,
id int
FROM #input
3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:
#refine=
SELECT timestamp FROM #json;
OUTPUT #refine
...
It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!

how to access multiple json files using dataframe from S3

I am using apapche spark. I want to access multiple json files from spark on date basis. How can i pick multiple files i.e. i want to provide range that files ending with 1034.json up to files ending with 1434.json. I am trying this.
DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");
But i am getting the following error
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.range(Pattern.java:2594)
at java.util.regex.Pattern.clazz(Pattern.java:2507)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)
Please specify a way out.
You can read something like this.
sqlContext.read().json("s3n://bucket/filepath/*.json")
Also, you can use wildcards in the file path.
For example:
sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")

Creating a Hive schema on the raw data not working as expected

I am trying to teach myself HIVE and familiarise myself to the HDInsight and HDFS on MicrsoftAzure using the tutorial From Raw Data to Insight using HDP and Microsoft Business Intelligence
I have managed to Staging the data on HDFS and now using AzurePowershell and Microsoft Azure HDInsight Query Console to create a Hive schema on the raw data.
I am trying to use the DDL statement below to create the table ‘price_data':
create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float)
row format delimited
fields terminated by ','
stored as textfile
location '/nyse/nyse_prices';
The blob files are located in the container "nyse" and each blob file within the container is named 'nyse_prices/NYSE_daily_prices_.csv'.
I have made sure that the format conforms to Processing data with Hive documentation on MSDN.
When I run the above query it executes successfully and creates the table.
The external table be pointing to the underlying files and therefore should be populated with the data within each csv file.
However when I run the query:
select count(*) from price_data
It returns 0. This is not correct. Can some one let me know what I am doing wrong here.
Cheers
I think the location you are specifying may be incorrect.
You have a default container, which is the container you specify or create when creating an HDInsight container. For example, 'mycontainer'. If I put all the csv files in that container as nyse_prices/filename.csv, then my location would be just '/nyse_prices'. Just the directory that contains the files. The 'container' is treated as the root in this case - '/'.
If the files are not in the default container, or on a different storage account, you can use a location of 'wasb://container#storagename.blob.core.windows.net/nyse_prices'.
As a test, I just created nyse_prices/ on my default container and uploaded some of the csv's to it. Then modified your query to use location '/nyse_prices'; and was able to do selects against the data after that.

Resources