I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.
The base command would be az ml create data -f <file-name>.yml. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.
# directory in blobstorage
├── data
│ ├── MLTable
│ ├── file_1.parquet
.
.
.
│ ├── file_n.parquet
I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.
Do I need to specify the full path or the pattern in the yml file?
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?
I have more confusion about the MLTable file:
type: mltable
paths:
- pattern: ./*.parquet
transformations:
- read_parquet:
# what comes here?
E.g. I have a column with dates with format %Y-%m%d %H:%M:%S which should be converted to a timestamp. (I can provide this information at least in the GUI!)
Any help on this topic or hidden links to documentation would be great.
A working MLTable file to convert string columns from parquet files looks like this:
---
type: mltable
paths:
- pattern: ./*.parquet
transformations:
- read_parquet:
include_path_column: false
- convert_column_types:
- columns: column_a
column_type:
datetime:
formats:
- "%Y-%m-%d %H:%M:%S"
- convert_column_types:
- columns: column_b
column_type:
datetime:
formats:
- "%Y-%m-%d %H:%M:%S"
(By the way, at the moment of writing this specifying multiple columns as array did not work, e.g. columns: [column_a, column_b])
To perform this operation, we need to check with installations and requirements for the experiment. We need to have valid subscription and workspace.
Install the required mltable library.
There are 4 different supported paths as the parameters in Azure ML
• Local computer path
• Path on public server like HTTP/HTTPS
• Path on azure storage (Like blob in this case)
• Path on datastore
Create a YAML file in the folder which was created as an assert
Filename can be anything (filename.yml)
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: uri_folder
name: <name_of_data>
description: <description goes here>
path: <path>
to create the data assert using CLI.
az ml data create -f filename.yml
To create a specific file as the data asset
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# Supported paths include:
# local: ./<path>/<file>
# blob: https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>
# ADLS gen2: abfss://<file_system>#<account_name>.dfs.core.windows.net/<path>/<file>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>
type: uri_file
name: <name>
description: <description>
path: <uri>
All the paths need to be mentioned according to your workspace credentials.
To create MLTable file as the data asset.
Create a yml file with the data pattern like below with the data in your case
type: mltable
paths:
- pattern: ./*.filetypeextension
transformations:
- read_delimited:
delimiter: ,
encoding: ascii
header: all_files_same_headers
Use the below python code to use the MLTable
import mltable
table1 = mltable.load(uri="./data")
df = table1.to_pandas_dataframe()
To create MLTable data asset. Use the below code block.
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# path must point to **folder** containing MLTable artifact (MLTable file + data
# Supported paths include:
# blob: https://<account_name>.blob.core.windows.net/<container_name>/<path>
type: mltable
name: <name_of_data>
description: <description goes here>
path: <path>
Blob is the storage mechanism in the current requirement.
The same procedure is used to create a data asset of MLTable
az ml data create -f <file-name>.yml
Related
trying to set up a data flow to de-duplicate data but there seems to be a simple error that I cannot fix, the data set is set up like so and the preview works correctly - updated file path and imported schema
the data flow is set up like this
however I get this error when trying Data preview - updated -> got the columns from importing the schema in the dataset
I have tried editing wildcard paths in source options and so far that has not worked
i tried
daily/*csv and *.csv
the structure of the blob account looks like so:
source options from the answer given:
still the error: Path does not resolve to any file(s). Please make sure the file/folder exists and is not hidden. At the same time, ensure special character is not included in file/folder name, for example, name start with _
each directory is from an azure export and creates its own monthly folder - this works on data factory but each csv has month to date cost so I was trying to use data flows to only take new data instead of all of the data with duplicates
You can use the wildcard path below to get the files of the required type.
Input folder path:
Azure data flow:
Source dataset
Source transformation: In source options provide the wildcard path to get the files of the required extension type. I have also included columns to store filenames to verify the data from all the files.
Wildcard paths : daily/*.csv
Import projections and preview data.
I connect by Elasticsearch Instance through Spark code .. which requires to pass truststore file location and keystore file location, while instantiating the spark session as below.
.config("es.net.ssl.keystore.location", truststore_location)
.config("es.net.ssl.keystore.pass", truststore_password)
.config("es.net.ssl.truststore.location", truststore_location)
.config("es.net.ssl.truststore.pass", truststore_password)
I do have a file location but the challenge here is the value in the truststore.jks file is basically the encoded value of original value. This was done to when the ask was to copy the truststore.jks content and upload it as secret in Kubernetes pod.
I extracted the same by passing
cat truststore.jks | base64
Now as the file location when passed to spark session builder it gives invalid format error which is obvious. So is there any way by which I can extract the value and decode it and then pass the value ... not any location.
below is the way I loaded the volumes and volume mount for same
volumes:
- name: elasticsearch-truststore
secret:
secretName: "env-elasticsearch-truststore"
items:
- key: truststore.jks
path: truststore.jks
volumeMounts:
- name: elasticsearch-truststore
mountPath: /etc/encrypted
If anyone can suggest any other way I can approach the issue it will be great.
Thanks
There is a problem with the secret object you've created. The encoded value is only relevant when the object exists as a manifest in the etcd database of Kubernetes API server. The encoding has no effect on the actual contents of the secret.
I think what could have caused this is you encoded the contents and then created a secret of the encoded contents, which is what you're observing here.
A simple fix would be to delete the existing secret and create a new secret simply from your truststore.jks file, as follows:
kubectl create secret env-elasticsearch-truststore --from-file=truststore.jks=/path/to/truststore
This will create a secret named env-elasticsearch-truststore and this will contain one key truststore.jks with a value of /path/to/truststore file contents.
You can then use this secret as a file by mounting it in your pod, the specification will look like this:
...
volumes:
- name: elasticsearch-truststore
secret:
secretName: env-elasticsearch-truststore
volumeMounts:
- name: elasticsearch-truststore
mountPath: "/etc/encrypted"
...
This will ensure that the file truststore.jks will be available at the path /etc/encrypted/truststore.jks and will contain the contents of the original truststore.jks file.
I'm trying to store a list of file names within an Azure Blob container into a SQL db. The pipeline runs successfully, but after running the pipeline, it cannot output the values (file names) into the sink database, and the sink table doesn't get updated even after the pipeline completed. Followings are the steps I went through to implement the pipeline. I wonder which steps I made mistake.
I have followed the solutions given in the following links as well:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
Transfer the output of 'Set Variable' activity into a json file [Azure Data Factory]
Steps:
1- Validating File Exists, Get Files metadata and child items, Iterate the files through a foreach.
2- Variable defined at the pipeline level to hold the filenames
Variable Name: Files, Type: string
3- parameter defined to dynamically specify the dataset directory name. Parameter name: dimName, parameter type: string
4- Get Metadata configurations
5- Foreach settings
#activity('MetaGetFileNames').output.childItems
6 - Foreach Activity overview. A set Variable to set the each filename into the defined variable 'files'. Copy Activity to store the set value into db.
7- set variable configuration
8- Copy Activity source configuration. Excel Dataset refers to an empty excel file in azure blob container.
9- Copy Activity sink configuration
10-Copy Activity: mapping configuration
Instead of selecting an empty excel file, refer to a dummy excel file with dummy data.
Source: dummy excel file
You can skip using Set variable activity as you can use the Foreach current item directly in the Additional column dynamic expression.
Add additional columns in the Mapping.
Sink results in SQL database.
i have zip file i would like to uncproesss the file and get the csv file and push it to the blob.i can achive in.gz but .zip file we are not able to.
could you please assit here.
Thanks
Richard
You could set binary format as source and sink dataset in ADF copy activity.Select Compression type as ZipDefalte following this link: https://social.msdn.microsoft.com/Forums/en-US/a46a62f2-e211-4a5f-bf96-0c0705925bcf/working-with-zip-files-in-azure-data-factory
Source:
Sink:
Test result in sink path:
I am looking to validate the list of filenames present in a Control File with below mentioned structure and check if those files are present in the Folder using Azure Data Factory.
Control File Structure: SerialNo, FileName, RecordCount.
Folder Path: companysftp.xyz.io
So for example: If the control File contains,
1 data.csv 124
2 productdetails.csv 50
We need to check if the data.csv and productdetails.csv is present in the folder path mentioned above.
Thanks in advance.
Kind regards,
Arjun Rathinam
Data Factory doesn't support Control File. Ref: Supported data stores.
In data Factory, only Get Matadata can help us list all the filename. To see: Supported connectors.
Get Matadata all the file name from the source folder.
Using if-condition in Foreach to filter the filename data.csv
and productdetails.csv is exist.
For example:
1.Get all the file name in the container backup:
2.Foreach items settings: send all the filename to Foreach:
#activity('Get Metadata1').output.childitems
3 using if-condition expression to filter the filename:
#or(equals(item().name,'data.csv'),equals(item().name,'productdetails.csv'))
This expression is to filter if the name equsals data.csv or productdetails.csv,return true/false.
3.Then you can add the active in true condition and false condition.
Hope this helps.