Reading DSL data file in python - python-3.x

Is there a way read the data file generated in dsl format from a simulation to read and perform analysis in python. Below I have attached a file in .dsl format from a simulator and want to read the generated data in python to perform data analysis, but have no clue for extracting the data from the file. Any help will be highly regarded.
Link to dls file: simulated data in dls format

Related

How to parse big XML in google cloud function efficiently?

I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

Is there an equivalent to SAS sas7bdat table files in Python?

Is there a Python equivalent to reading and writing tabular files such as SAS sas7bdat files?
My team is moving away from SAS and we'd like to replicate the SAS process in Python with our methodology as follows:
1) Pull data from various sources i.e. Excel, CSV, DBs etc.
2) Update our Data Warehouse with the new information and export this data as a Python table file (to be used next)
3) Rather than pulling data from our warehouse (super slow) we'd like to read in those Python table files and then do some data matching on a bigger set of data.
We're trying to avoid using the sas7bdat altogether (SASPy) files since we won't have SAS for much longer
Any advice, insights is greatly appreciated!
Unlike SAS, Python doesn't have a native data format. However, there are modules that implements binary protocols for serializing and de-serializing a Python object. Consider using HDF5 format to save and read files (https://www.h5py.org/). Another possibility is Pickle (https://docs.python.org/3/library/pickle.html).
Parquet is also worth considering.

How to apply a dprep package to incoming data in score.py Azure Workbench

I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?
I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!
To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.

How to read input data from an excel spreadsheet and pass it JSON payload in karate framework?

I need to create data driven unit tests for different APIs in karate framework. The various elements to be passed in the JSON payload should be taken as input from an excel file.
A few points:
I recommend you look at Karate's built-in data-table capabilities, it is far more readable, integrates into your test-script and you won't need to depend on other software. Refer these examples: call-table.feature and dynamic-params.feature
Next I would recommend using JSON instead of an Excel or CSV file, it is natively supported by Karate: call-json-array.feature
Finally, if you really wanted to, you can call any Java code and if you return data in a Map / List form, it will be ready for Karate to use. This example shows how to read a database via JDBC: dogs.feature. So although this is not built into Karate, just write a simple utility to read a CSV or Excel file and you can do pretty much anything Java can do.
EDIT: Karate now supports CSV files that can be used to even do data-driven testing: https://github.com/intuit/karate#csv-files

Resources