I am looking at Kedro Library as my team are looking into using it for our data pipeline.
While going to the offical tutorial - Spaceflight.
I came across this function:
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocess the data for companies.
Args:
companies: Source data.
Returns:
Preprocessed data.
"""
companies["iata_approved"] = companies["iata_approved"].apply(_is_true)
companies["company_rating"] = companies["company_rating"].apply(_parse_percentage)
return companies
companies is the name of the csv file containing the data
Looking at the function, my assumption is that (companies: pd.Dafarame) is the shorthand to read the "companies" dataset as a dataframe. If so, I do not understand what does -> pd.Dataframe at the end means
I tried looking at python documentation regarding such style of code but I did not managed to find any
Much help is appreciated to assist me in understanding this.
Thank you
This is tht way of declaring type of your inputs(companies: pd.DataFrame) . Here comapnies is argument and pd.DataFrame is its type . in same way -> pd.DataFrame this is the type of output
Overall they are saying that comapnies of type pd.DataFrame will return pd.DataFrametype variable .
I hope you got it
The -> notation is type hinting, as is the : part in the companies: pd.DataFrame function definition. This is not essential to do in Python but many people like to include it. The function definition would work exactly the same if it didn't contain this but instead read:
def preprocess_companies(companies):
This is a general Python thing rather than anything kedro-specific.
The way that kedro registers companies as a kedro dataset is completely separate from this function definition and is done through the catalog.yml file:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
There will then a node defined (in pipeline.py) to specify that the preprocess_companies function should take as input the kedro dataset companies:
node(
func=preprocess_companies,
inputs="companies", # THIS LINE REFERS TO THE DATASET NAME
outputs="preprocessed_companies",
name="preprocessing_companies",
),
In theory the name of the parameter in the function itself could be completely different, e.g.
def preprocess_companies(anything_you_want):
... although it is very common to give it the same name as the dataset.
In this situation companies is technically any DataFrame. However, when wrapped in a Kedro Node object the correct dataset will be passed in:
Node(
func=preprocess_companies, # The function posted above
inputs='raw_companies', # Kedro will read from a catalog entry called 'raw companies'
outputs='processed_companies', # Kedro will write to a catalog entry called 'processed_companies'
)
In essence the parameter name isn't really important here, it has been named this way so that the person reading the code knows that it is semantically about companies, but the function name does that too.
The above is technically a simplification since I'm not getting into MemoryDataSets but hopefully it covers the main points.
Related
I need to get the single contribution of the processes and emissions I filled into my database - similar to this problem : Brightway2 - Get LCA scores of immediate exchanges
it works for single methods but i was wondering how to get these results for several methods similar to when doing the ordinary calculations which can then be saved as csv? is there a way to create a loop for this?
Thank you so much!
Miriam
There is a function called multi_traverse_tagged_database in bw2analyzer which should do what you need. It was part of a pull request so it's not in the docs.
I've copied in the docstring at the bottom which should give you some pointers. It's basically the same as the traverse_tagged_database function used in the question you've linked to, but for multiple methods. You'd use it like this:
results, graph = multi_traverse_tagged_databases(functional_unit, list_of_methods, label='name')
You should be able to use pandas to export the dictionary you get in results to a csv file.
def multi_traverse_tagged_databases(
functional_unit, methods, label="tag", default_tag="other", secondary_tags=[]
):
"""Traverse a functional unit throughout its foreground database(s), and
group impacts (for multiple methods) by tag label.
Input arguments:
* ``functional_unit``: A functional unit dictionary, e.g. ``{("foo", "bar"): 42}``.
* ``methods``: A list of method names, e.g. ``[("foo", "bar"), ("baz", "qux"), ...]``
* ``label``: The label of the tag classifier. Default is ``"tag"``
* ``default_tag``: The tag classifier to use if none was given. Default is ``"other"``
* ``secondary_tags``: List of tuples in the format (secondary_label, secondary_default_tag). Default is empty list.
Returns:
Aggregated tags dictionary from ``aggregate_tagged_graph``, and tagged supply chain graph from ``recurse_tagged_database``.
"""
I have a hierarchy of data that i would like to build using classes instead of hard coding it in. The structure is like so:
Unit (has name, abbreviation, subsystems[5 different types of subsystems])
Subsystem ( has type, block diagram(photo), ParameterModel[20 different sets of parameterModels])
ParameterModel (30 or so parameters that will have [parameter name, value, units, and model index])
I'm not sure how to do this using classes but what i have made kindof work so far is creating nested dictionaries.
{'Unit':{'Unit1':{'Subsystem':{'Generator':{Parameter:{'Name': param1, 'Value':1, 'Units': 'seconds'}
like this but with 10-15 units and 5-6 subsystems and 30 or so parameters per subsystem. I know using dictionaries is not the best way to go about it but i cannot figure out the class sharing structure or where to start on building the class structure.
I want to be able to create, read, update and delete, parameters in a tkinter gui that i have built as well as export/import these system parameters and do calculations on them. I can handle the calculations and the import export but i need to create classes that will build out this structure and be able to reference each individual unit/subsystem/parameter/value/etc
I know thats alot but any advice? ive been looking into the factory and abstract factory patterns in hope to try and figure out how to create the code structure but to no avail. I have experience with matlab, visual basic, c++, and various arduio projects so i know most basic programming but this inheritance class structure is something i cannot figure out how to do in an abstract way without hardcoding each parameter with giant names like Unit1_Generator_parameterName_parameter = ____ and i really dont want to do that.
Thanks,
-A
EDIT: Here is one way I've done the implementation using a dictionary but i would like to do this using a class that can take a list and make a bunch of empty attributes and have those be editable/callable generally like setParamValue(unit, susystem, param) where i can pass the unit the subsystem and then the parameter such as 'Td' and then be able to change the value of the key,value pair within this hierarchy.
def create_keys(list):
dict = {key: None for key in list}
return dict
unit_list = ['FL','ES','NN','SF','CC','HD','ND','TH'] #unit abbreviation
sub_list = ['Gen','Gov','Exc','PSS','Rel','BlkD']
params_GENROU = ["T'do","T''do","T'qo","T''qo",'H','D','Xd','Xq',"Xd'","Xq'","X''d=X''q",'Xl','S(1.0)','S(1.2)','Ra'] #parameter names
dict = create_keys(unit_list)
for key in dict:
dict[key] = create_keys(sub_list)
dict[key]['Gen'] = create_keys(params_GENROU)
and inside each dict[unit][Gen][ParamNames] there should be a dict containing Value, units(seconds,degrees,etc), description and CON(#basically in index for another program we use)
I have discovered the pandas DataFrame.query method and it almost does exactly what I needed it to (and implemented my own parser for, since I hadn't realized it existed but really I should be using the standard method).
I would like my users to be able to specify the query in a configuration file. The syntax seems intuitive enough that I can expect my non-programmer (but engineer) users to figure it out.
There's just one thing missing: a way to select everything in the dataframe. Sometimes what my users want to use is every row, so they would put 'All' or something into that configuration option. In fact, that will be the default option.
I tried df.query('True') but that raised a KeyError. I tried df.query('1') but that returned the row with index 1. The empty string raised a ValueError.
The only things I can think of are 1) put an if clause every time I need to do this type of query (probably 3 or 4 times in the code) or 2) subclass DataFrame and either reimplement query, or add a query_with_all method:
import pandas as pd
class MyDataFrame(pd.DataFrame):
def query_with_all(self, query_string):
if query_string.lower() == 'all':
return self
else:
return self.query(query_string)
And then use my own class every time instead of the pandas one. Is this the only way to do this?
Keep things simple, and use a function:
def query_with_all(data_frame, query_string):
if query_string == "all":
return data_frame
return data_frame.query(query_string)
Whenever you need to use this type of query, just call the function with the data frame and the query string. There's no need to use any extra if statements or subclass pd.Dataframe.
If you're restricted to using df.query, you can use a global variable
ALL = slice(None)
df.query('#ALL', engine='python')
If you're not allowed to use global variables, and if your DataFrame isn't MultiIndexed, you can use
df.query('tuple()')
All of these will property handle NaN values.
df.query('ilevel_0 in ilevel_0') will always return the full dataframe, also when the index contains NaN values or even when the dataframe is completely empty.
In you particular case you could then define a global variable all_true = 'ilevel_0 in ilevel_0' (as suggested in the comments by Zero) so that your engineers could use the name of the global variable in their config file instead.
This statement is just a dirty way to properly query True like you already tried. ilevel_0 is a more formal way of making sure you are referring the index. See the docs here for more details on using in and ilevel_0: https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method
I am writing python scripts to extract data from multiple sources and put it in a graph in a certain structure.
I am using bulbs models for all the data. I have models for all relevant node types and relationships. My edge models have not additional properties except 'label'.
As it is in development, I run the same script multiple times. I use get_or_create to prevent duplicate nodes but edges do not have that method. I do not have the object for existing edge since it was created in a previous run of the script.
I saw several question talking about similar things with answers from espeed like this, but I could not find a satisfactory answer for my specific issue.
What would be the simplest code for this method?
Presently I am trying to do this via loading a gremlin script; as suggested by Stephen; with following function:
def is_connected(parent, child, edge_label) {
return g.v(parent).out(edge_label).retain([g.v(child)]).hasNext()
}
And the the following python code.
g.scripts.update('gremlin_scripts/gremlin.groovy')
script = g.scripts.get('gremlin:is_connected')
params = dict(parent=parent_node.eid, child=menu_item_v.eid, edge_label='has_sub_menu_item')
response = g.gremlin.execute(script, params)
I can't quite figure out how to get the bool result into python. I've also tried the g.gremlin.query(script, param)
Here's one way to do it:
parent_v.out(rel_label).retain(child_v).hasNext()
So, from the parent, traverse out to all children (i assume that "out" is the direction of your relationship - how you choose to implement that is specific to your domain) and determine if that child is present at any point via retain.
Hello I'm trying to write a function which reads a certain type of spreadsheet and creates vectors dynamically from it's data then returns said vectors to the workspace.
My xlcs is structured by rows, in the first row there is a string which should become the name of the vector and the rest of the rows contain the numbers which make up the vector.
Here is my code:
function [ B ] = read_excel(filename)
%read_excel a function to read time series data from spreadsheet
% I get the contents of the first cell to know what to name the vector
[nr, name]=xlsread(filename, 'sheet1','A2:A2');
% Transform it to a string
name_str = char(name);
% Create a filename from it
varname=genvarname(name_str);
% Get the numbers which will make up the vector
A=xlsread(filename,'B2:CT2');
% Create the vector with the corect name and data
eval([varname '= A;']);
end
As far as I can tell the vector is created corectly, but I have no ideea how to return it to the workspace.
Preferably the solution should be able to return a indeterminate nr of vectors as this is just a prototype and I want the function to return a nr of vectors of the user's choice at once.
To be more precise, the vector varname is created I can use it in the script, if I add:
eval(['plot(',varname,')'])
it will plot the vector, but for my purposes I need the vector varname to be returned to the workspace to persist after the script is run.
I think you're looking for evalin:
evalin('base', [varname '= B;']);
(which will not work quite right as-is; but please read on)
However, I strongly advise against using it.
It is often a lot less error-prone, usually considered good practice and in fact very common to have predictable outcomes of functions.
From all sorts of perspectives it is very undesirable to have a function that manipulates data beyond its own scope (i.e., in another workspace than its own), let alone assign unpredictable data to unpredictable variable names. This is unnecessarily hard to debug, maintain, and is not very portible. Also, using this function inside other functions does not what someone who doesn't know your function would think it does.
Why not use smoething like a structure:
function B = read_excel(filename)
...
B.data = xlsread(filename,'B2:CT2');
B.name = genvarname(name_str);
end
Then you always have the same name as output (B) which contains the same data (B.data) and whose name you can also use to reference other things dynamically (i.e., A.(B.name)).
Because this is a function, you need to pass the variables you create to an output variable. I suggest you do it through a struct as you don't know how many variables you want to output upfront. So change the eval line to this:
% Create the vector with the correct name and data
eval(['B.' varname '= A;']);
Now you should have a struct called B that persists in the workspace after running the function with field names equal to your dynamically created variable names. Say for example one varname is X, you can now access it in your workspace as B.X.
But you should think very carefully about this code design, dynamically creating variables names is very unlikely to be the best way to go.
An alternative to evalin is the function assignin. It is less powerfull than evalin, but does exacty what you want - assign a variable in a workspace.
Usage:
assignin('base', 'var', val)