I have created a ML model and I want to publish the predictions of the test set onto a web page for better visualization for non-technical team members.
I have converted the predictions to a data frame to the case numbers of the test set and original data.
Predictions=pd.DataFrame({'Case.Number':CN_test,'Org_Data':y_test,'Predictions':y_pred})
As I am new to this, my experience with API is just of creating a basic API for hello world.
Requesting guidance on how to do this using API or any other way to get this done.
Regards
Sudhir
Since dataframe can't be rendered directly hence it has to converted into a list
below is the code for the same.
I got the solution in another query:
Return Pandas dataframe as JSONP response in Python Flask
Related
I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.
I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.
I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP
So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.
UPDATE: I've made some progress and changed my approach as a result.
DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.
inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.
I was unable to find node.js documentation implementing column level tags.
However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.
In my backend, I have data attributes labeled in camelcase:
customerStats: {
ownedProducts: 100,
usedProducts: 50,
},
My UI code is set up in a way that an array of ["label", data] works best most of the time i.e. most convenient for frontend coding. In my frontend, I need these labels to be in proper english spelling so they can be used as is in the UI:
customerStats: [
["Owned products", 100],
["Used products", 50],
],
My question is about best practices or standards in web development. I have been inconsistent in my past projects where I would convert the data at random places, sometimes client-side, sometimes right on the backend, sometimes converting it one way and then back again because I needed the JSON data structure.
Is there a coding convention how the data should be supplied to the frontend?
Right now all my data is transfered as JSON to the frontend. Is it best practice to convert the data to the form that is need on the frontend or backend? What if I need the JSON attributes to do further calculations right on the client?
Technologies I am using:
Frontend: Javascript / React
Backend: Javascript / Node.js + Java / Java Spring
Is there a coding convention for how to transfer data to the front end
If your front end is JavaScript based, then JSON (Java Script Object Notation) is the simplest form to consume, it is a stringified version of the objects in memory. See this healthy discussion for more information on JSON
Given that the most popular front end development language is JavaScript these days, (see the latest SO Survey on technology) It is very common and widely accepted to use JSON format to transfer data between the back and front end of solutions. The decision to use JSON in non-JavaScript based solutions is influenced by the development and deployment tools that you use, seeing more developers are using JavaScript, most of our tools are engineered to support JavaScript in some capacity.
It is however equally acceptable to use other structured formats, like XML.
JSON is generally more light-weight than XML as there is less provision made to transfer meta-data about the schema. For in-house data streams, it can be redundant to transfer a fully specced XML schema with each data transmission, so JSON is a good choice where the structure of the data is not in question between the sender and receiver.
XML is a more formal format for data transmission, it can include full schema documentation that can allow receivers to utilize the information with little or no additional documentation.
CSV or other custom formats can reduce the bytes sent across the wire, but make it hard to visually inspect the data when you need to, and there is an overhead at both the sending and receiving end to format and parse the data.
Is it best practice to convert the data to the form that is need on the frontend or backend?
The best practice is to reduce the number of times that a data element needs to be converted. Ideally you never have to convert between a label and the data property name... This is also the primary reason for using JSON as the data transfer format.
Because JSON can be natively interpreted in a JavaScript front end, in a JavaScript front end we can essentially reduce conversion to just the server-side boundary where data is serialized/deserialized. There is in effect no conversion in the front end at all.
How to refer to data by the property name and not the visual label
The general accepted convention in this space is to separate the concerns between the data model and the user experience, the view. Importantly the view is the closest layer to the user, it represents a given 'point of view' of the data model.
It is hard to tailor a code solution for OP without any language or code provided for context, in an abstract sense, to apply this concept means to not try and have the data model carry the final information about how the data should be displayed, instead you have another piece of code that provides the information needed to present the data.
In different technologies and platforms we refer to this in different ways but the core concept of separating the Model from the View or Presentation is consistently represented through these design patterns:
Exploring the MVC, MVP, and MVVM design patterns
MVP vs MVC vs MVVM vs VIPER
For OP's specific scenario, this might involve a mapping structure like the following:
customerStatsLabels: {
ownedProducts: "Owned products",
usedProducts: "Used products",
}
If this question is updated with some code around how the UI is constructed I will update this response with something more specific.
NOTE:
In JavaScript, objects are simply arrays of arrays, and as such it is very easy to tweak existing code that is based on arrays, into code based on objects and vice-versa.
I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?
I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!
To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.
I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!
I have a client who has asked me to take a look at the spreadsheet that they use for manipulating AWS data in order to import sales invoices into Xero....I'm just wondering if it's possible to directly query AWS from Excel?...this would streamline the process by cutting out the manual AWS export plus I would be able to create a query that puts the data into the format that Xero needs to see.
...moving on from this, I guess the next logical step would be to create an API that Xero can hook up to....unless this is already a thing?
Darren
There is a sample Excel VBA / VB6 Helper Routines project that might get you going in the right direction.