unit conversion when importing datasets from excel files in brightway

unit conversion when importing datasets from excel files in brightway - brightway

I am trying to create some activities using the excel importer. My activity has a technosphere flow of 0.4584 MWh of Production of electricity by gas from the previously imported EXIOBASE 3.3.17 hybrid database. The activity of Production of electricity by gas is in TJ in the database.
I ran without problems the import, something like:
ei = ExcelImporter(path_to_my_excel)
ei.apply_strategies()
ei.match_database(fields = ['name','location'])
ei.match_database(db_name = 'EXIOBASE 3.3.17 hybrid', fields = ['name','location'])
ei.match_database(db_name = 'biosphere3', fields = ['name','categories'])
ei.write_project_parameters()
ei.write_database(activate_parameters=True)
but if I iterate over the technosphere flows of my activity consuming natural gas it says it uses 0.4584 TJ of Production of electricity by gas (the same unit as the activity of production of electricty by gas, but the same amount I put in MWh). I was kind of hoping some unit conversion under the hood. Perhaps using bw2io.units.UNITS_NORMALIZATION.
Should we always express the units of exchanges with the same units as the activity they link ? is there an existing strategy to do the unit conversion for us? Thanks!

This line: ei.match_database(db_name = 'EXIOBASE 3.3.17 hybrid', fields = ['name','location']) is telling the program to match, but not to match based on units.
You can get the desired result with a migration, see an example here (in the section Fixing units for passenger cars).

Related

Azure cost analysis for a particular subscription using Python SDK

So I'm trying to automate fetching the current cost and cost forecast (Like it is shown under cost analysis for a particular subscription) for a particular subscription using python SDK but I haven't been able to find a single API that does this yet.
I've tried using UsageAggregate and Rate card but I haven't really figured out a way to find the cost for the current month to date. If there is an API that I'm missing or if I need to calculate monthly costs myself, I'd appreciate any code snippets or help.

If you already have the usage and the ratecard data, then you must combine them.
Take the meterId of the usage data and get the related ratecard data.
The ratecard data contains the MeterRates and the IncludedQuantity which you must take.
There are probably multiple meter rates and the included quantity because there are probably different costs per usage (e.g. first 10 calls for free, 3 GB for free, ...).
The consumption starts/is reseted at the 14th of the month. That's the reason why you have to read the data from the whole billing period (begins with 14th of each month), because that's the only way how you get the correct consumption.
So, if you are using e.g. Azure Functions and you have a usage of 100.000 units per day and you want the costs from 20th - 30th, then the calculation works as follows:
read data from 14th - 30th. These are 17 days and therefore it used 1.700.000 units. The first 400.000 are for free = IncludedQuantity (so in this sample the first 4 days).
From the 400.001 unit on, you have to take the meter rate (0,0000134928 €) and calculate the costs. 1.300.000 * 0,0000134928 = ~17,54€.
Fortunately, the azure functions have only one rate. If the rate changes e.g. after 5.000.000 units, then you also have to take this into account. If you have the whole costs, then you can filter on your date which is 20.-30. and you will get the result.
Its calculation implemented in C# and published it as a NuGet package here. It also contains a sample console which you could use to export the data.

I know I am bit late to the party, but after struggling with the same problem, I managed to create the code for getting the cost of a resource group using
azure.mgmt.costmanagement
Link to cost management API
Code sample is in my answer here

How to acces output folder from a PythonScriptStep?

I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
my data_prep_step to run,
have it store some data on the path to my data_ref), and
I would then like to access this stored data afterwards outside of the pipeline
But, I can't find a useful function in the documentation. Any guidance would be much appreciated.

two big ideas here -- let's start with the main one.
main ask
With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?
short answer
Consider using OutputFileDatasetConfig (docs example), instead of DataReference.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
be sure to check out how DataPaths work. Can be tricky at first glance.
set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.
more context
PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
stitch steps together
get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)
The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.
This downside is what motivated OutputFileDatasetConfig
auxilary ask
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
unit testing (the code in the step works?
integration testing (the code works when submitted to the Azure ML service)
data expectation testing (the data coming out of the meets my expectations)
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions
For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
within the PythonScriptStep itself, i.e.
run whatever code you have
test the outputs with GE before writing them out; or,
for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.
Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).

Best way to store high frequency, periodic time-series data?

I have created an MVP for a nodejs project, following are some of the features that are relevant to the question I am about to ask:
1-The application has a list of IP addresses with CRUD actions.
2-The application will ping each IP address after every 5 seconds.
3- And display against each IP address it's status i.e alive or dead and the uptime if alive
I created a working MVP on nodejs with the help of the library net-ping, express, mongo and angular. Now I have a new feature request that is:
"to calculate the round trip time(latency) for each ping that is generated for each IP address and populate a bar chart or any type of chart that will display the RTT(latency) history(1 months-1 year) of every connection"
I need to store the response of each ping in the database, Assuming the best case that if each document that I will store is of size 0.5 kb, that will make 9.5MB data to be stored in each day,285MB in each month and 3.4GB in a year for a single IP address and I am going to have 100-200 IP addresses in my application.
What is the best solution (including those which are paid) that will suit the best for my requirements considering the app can scale more?

Time series data require special treatment from a database perspective as they introduce challenges to the traditional database management from capacity, query performance, read/write optimisation targets, etc.
I wouldn't recommend you store this data in a traditional RDBMS, or object/document database.
Best option is to use a specialised time-series database engine, like InfluxDB, that can support downsampling (aggregation) and raw data retention rules

So I changed The schema design for the Time-series data after reading this and that reduced the numbers in my calculation of size massively
previous Schema looked like this:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: "Latency",
value: 1000000
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: "Latency",
value: 15000000
}
Size of each document: 0.22kb
number of document created in an hour= 720
size of data generated in an hour=0.22*720 = 158.4kb
size of data generated by one IP address in a day= 158 *24 = 3.7MB
Since every next time_Stamp is just the increment of 5 seconds from the previous one, the schema can be optimized to cut the redundant data.
The new schema looks like this :
{
timestamp_hour: ISODate("2013-10-10T23:06:00.000Z"),// will contain hours
type: “Latency”,
values: {//will contain data for all pings in the specific hour
0: 999999,
…
37: 1000000,
38: 1500000,
…
720: 2000000
}
}
Size of each document: 0.5kb
number of document created in an hour= 1
size of data generated in an hour= 0.5kb
size of data generated by one IP address in a day= 0.5 *24 = 12kb
So I Am assuming the size of the data will not be an issue anymore, and I although there is a debate for what type of storage should be used in such scenarios to ensure best performance but I am going to trust mongoDB in my case.

Getting Multiple Last Price Quotes from Interactive Brokers's API

I have a question regarding the Python API of Interactive Brokers.
Can multiple asset and stock contracts be passed into reqMktData() function and obtain the last prices? (I can set the snapshots = TRUE in reqMktData to get the last price. You can assume that I have subscribed to the appropriate data services.)
To put things in perspective, this is what I am trying to do:
1) Call reqMktData, get last prices for multiple assets.
2) Feed the data into my prediction engine, and do something
3) Go to step 1.
When I contacted Interactive Brokers, they said:
"Only one contract can be passed to reqMktData() at one time, so there is no bulk request feature in requesting real time data."
Obviously one way to get around this is to do a loop but this is too slow. Another way to do this is through multithreading but this is a lot of work plus I can't afford the extra expense of a new computer. I am not interested in either one.
Any suggestions?

You can only specify 1 contract in each reqMktData call. There is no choice but to use a loop of some type. The speed shouldn't be an issue as you can make up to 50 requests per second, maybe even more for snapshots.
The speed issue could be that you want too much data (> 50/s) or you're using an old version of the IB python api, check in connection.py for lock.acquire, I've deleted all of them. Also, if there has been no trade for >10 seconds, IB will wait for a trade before sending a snapshot. Test with active symbols.
However, what you should do is request live streaming data by setting snapshot to false and just keep track of the last price in the stream. You can stream up to 100 tickers with the default minimums. You keep them separate by using unique ticker ids.

Azure Machine learning error

I am trying to create a Percentage win calculator using a multi class decision forest
I have two data sets one of existing win/loss data and another of pending items (same column structure) (win/loss/pending are all in one column call status)
the experiment runs no problems on the test data (win/loss) and has about a 90% accuracy rating
but when i move it over to a web service and try to run it with the other data set I get an error "
Apply Transformation Error Cannot process column "NAICS" of type
System.Double. The type is not supported by the module. . ( Error 0017
)"
The naic code is no different in one data set than the other
I am lost any help would be greatly appreciated

I guess you have already known that 0017 means Exception occurs if one or more specified columns have type unsupported by current module. (https://msdn.microsoft.com/en-us/library/azure/dn905850.aspx).
From the suggested resolution, you can do [Convert to Dataset][2]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string