Azure ML Dataset Versioning: What is Different if it Points to the Same Data? - azure

Context
In AzureML, we are facing an error when running a pipeline. It fails on to_pandas_dataframe because a particular dataset "could not be read beyond end of stream". On its own, this seems to be an issue with the parquet file that is being registered, maybe special characters being misinterpreted.
However, when we explicitly load a previous "version" of this Dataset--which points to the exact same location of data--it works as expected. In the documentation (here), Azure says that "when you load data from a dataset, the current data content referenced by the dataset is always loaded." This makes me think that a new version of the dataset with the same schema will be, well, the same.
Questions
What makes a Dataset version different from another version when both point to the same location? Is it only the schema definition?
Based on these differences, is there a way to figure out why one version would be succeeding and another failing?
Attempts
The schemas of the two versions are identical. We can profile both in AzureML, and all the fields have the same profile information.

As rightly suggested by #Anand Sowmithiran in comment section, This looks more like a bug with the SDK.
You can raise Azure support ticket

Related

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS.
Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support and for this purpose would like to use lakeFS together with Hudi.
If I understand everything correctly, lakeFS is a data agnostic solution and knows nothing about the data itself. lakeFS only establishes boundaries (version control) and moderates somehow the concurent access to the data..
So the reasonable question is - if lakeFS is data agnostic, how it supports merge operation? What merge itself means in terms of lakeFS? And is it possible to have a merge conflict there?
You do understand everything correctly. You could see in the branching model page that lakeFS is currently data agnostic and relies simply on the hierarchical directory structure. A conflict would occur when two branches update the same file.
This behavior fits most data engineers CI/CD use cases.
In case you are working with Delta Lake and made changes to the same table from two different branches, there will still be a conflict because the two branches changed the log file. In order to resolve the conflict you would need to forgo one of the change sets.
Admittedly this is not the best user experience and it's currently being worked on. You could read more about it on the roadmap documentation.

Create Data Catalog column tags by inspecting BigQuery data with Cloud Data Loss Prevention

I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.
I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.
I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP
So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.
UPDATE: I've made some progress and changed my approach as a result.
DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.
inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.
I was unable to find node.js documentation implementing column level tags.
However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.

AWS Athena row cast fails when key is a reserved keyword despite double quotes

I'm working with data in AWS Athena, and I'm trying to match the structure of some input data. This involves a nested structure where "from" is a key. This consistently throws errors.
I've narrowed the issue down to the fact that Athena queries don't work when you try to use reserved keywords as keys in rows. The following examples demonstrate this behavior.
This simple case, SELECT CAST(ROW(1) AS ROW("from" INTEGER)), fails with the following error: GENERIC_INTERNAL_ERROR: Unable to create class com.facebook.presto.execution.TaskInfo from JSON response: [io.airlift.jaxrs.JsonMapperParsingException: Invalid json for Java type
This simple case runs successfully: SELECT CAST(ROW(1) AS ROW("work" INTEGER))
The Athena documentation says to enclose reserved keywords in double quotes to use them in SELECT statements, but the examples above show that queries still fail when using keywords as keys in rows.
I know I have other options, but this way is by far the most convenient. Is there a way to use reserved keywords in this scenario?
As Piotr mentions in a comment, this is a Presto bug and given that it was posted just days ago it's unlikely to be fixed in Athena anytime soon. When the bug is fixed in Presto it might find its way into Athena, I know the Athena team sometimes apply upstream patches even though Athena is based on an old version of Presto. This one might not be significant enough to appear on their radar, but if you open a support ticket with AWS it might happen (be sure to be clear that you don't need any workaround just report a bug, otherwise you'll have a support person spending way too much time trying to help you to turn things off and on again).

Google Datastore returns incomplete data via official client library for nodejs

Here some information about context of the problem I facing:
we have a semi-structured (JSON from node.js backend) data in datastore.
after saving an entity,
and getting a list of entities about them soon and even a while later,
returned data does not have one indexed property
I can find the entity by that property value.
I use Google Datastore via node.js client library. #google-cloud/datastore: "^2.0.0".
How it can be possible? I understood when due to eventual consistency some updates can be incompletely written etc. But when I getting same inconsistency - lack of whole property of entity saved e. g. hour ago?
I gone through scenario multiple times for same kind multiple times.
I do not have such issues with other kinds or other properties of that kind.
How I can avoid this type of issues with Google Datastore?
Answer for anyone who may encounter with such issue.
We mostly do not use any DTO (data-transfer objects) or any other wrappers for most of our kinds in this project, but for this one a DTO has been used, mostly to be sure the result objects have default values for properties omitted/absent in entity which usually happens for entities created by older version of code.
After reviewing my own code more carefully, I found a piece of code which is out of sync with other related pieces of code - there was no a line to copy this property from entity to the DTO object.
Side note: Actually all this situation remind me a story or meme about a guy who claimed he found a bug in compiler just because he was not able to find a mistake he made in his code.

How to apply a dprep package to incoming data in score.py Azure Workbench

I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?
I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!
To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.

Resources