es.normalize_entity error variable not found in entity - featuretools

Am using the featuretools documentation to learn entityset and am currently getting error KeyError: 'Variable: device not found in entity' for the following piece of code:
import featuretools as ft
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
customers_df
sessions_df = data["sessions"]
sessions_df.sample(5)
transactions_df = data["transactions"]
transactions_df.sample(10)
products_df = data["products"]
products_df
### Creating an entity set
es = ft.EntitySet(id="transactions")
### Adding entities
es = es.entity_from_dataframe(entity_id="transactions", dataframe=transactions_df, index="transaction_id", time_index="transaction_time", variable_types={"product_id": ft.variable_types.Categorical})
es
es["transactions"].variables
es = es.entity_from_dataframe(entity_id="products",dataframe=products_df,index="product_id")
es
### Adding new relationship
new_relationship = ft.Relationship(es["products"]["product_id"],
es["transactions"]["product_id"])
es = es.add_relationship(new_relationship)
es
### Creating entity from existing table
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index = "session_id",
additional_variables=["device",customer_id","zip_code"])
This is as per the URL - https://docs.featuretools.com/loading_data/using_entitysets.html
From the API es.normalise_entity it appears that the function would create new entity 'sessions' with index as 'session_id', and rest of the 3 variables however the error is:
C:\Users\s_belvi\AppData\Local\Continuum\Anaconda2\lib\site-packages\featuretools\entityset\entity.pyc in _get_variable(self, variable_id)
250 return v
251
--> 252 raise KeyError("Variable: %s not found in entity" % (variable_id))
253
254 #property
KeyError: 'Variable: device not found in entity'
Do we need to create entity "sessions" separately before using es.normalize_entity? Looks like something syntactically has gone wrong in the flow, some minor mistake..

The error here arises from device not being a column in your transactions_df. The "transactions" table referenced in that page of the documentation has more columns than demo.load_mock_customer in its dictionary form. You can find the rest of the columns using the return_single_table argument. Here's a full working example of normalize_entity which is only slightly modified from the code that you tried:
import featuretools as ft
data = ft.demo.load_mock_customer(return_single_table=True)
es = ft.EntitySet(id="Mock Customer")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=data,
index="transaction_id",
time_index="transaction_time",
variable_types={"product_id": ft.variable_types.Categorical})
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index = "session_id",
additional_variables=["device","customer_id","zip_code"])
This will return an EntitySet with two Entities and one Relationship:
Entityset: Mock Customer
Entities:
transactions [Rows: 500, Columns: 8]
sessions [Rows: 35, Columns: 5]
Relationships:
transactions.session_id -> sessions.session_id

Related

kedro dynamic catalog creation only for specific nodes before their run

I have several thousands of files to be processed of the different types. I am using dynamic catalog creation with hooks. I used first after_catalog_created hook but it is too early in and I need those entries only for specific nodes. My try is with before_node_run for specific node tags returning the dictionary with just dynamically created entries. Node function is **kwargs only. It works as I see that node get updated inputs, but the problem is that I need to provide for the node specification the already existing catalog entry. So I have such, fake one. Then I am using it to build a dictionary with the same length as the dictionary that is being returned by the hook.
Pipeline code
for doc in docs["Type1_documents"]:
item = doc["name"]
item_name, _ = os.path.splitext(item)
type1_datasets_dict[item_name] = "brace_dictionary"
return Pipeline(
[
node(
func=func1,
inputs=type1_datasets_dict,
outputs=[
f"output1",
f"output2",
],
name=f"type1_eta",
tags=["dynamic-catalog", "type1", "data-engineering"],
)
]
)
Hook code
#hook_impl
def before_node_run(
self, node: Node, catalog: DataCatalog
) -> Optional[Dict[str, Any]]:
self.node = node
self.catalog = catalog
if "dynamic-catalog" in node.tags:
input_catalog_name = node.name
catalog_string = f"params:{input_catalog_name}.full_name"
if self.catalog.exists(catalog_string):
true_datasets_dict = {}
catalog_properties = self.catalog.load(f"params:{input_catalog_name}")
catalog_name = catalog_properties["full_name"]
type = catalog_properties["type"]
subtype = catalog_properties["subtype"]
datasets_dict = self.catalog.load(f"params:{catalog_name}")
for dataset in datasets_dict:
doc_name, _ = os.path.splitext(dataset["name"])
self.add_text_dataset(
name=doc_name,
folder=f"parsed/{type}/{subtype}",
)
true_datasets_dict[doc_name] = doc_name
return true_datasets_dict
return true_datasets_dict
But I am getting value error for this:
line 487, in _run_with_dict
raise ValueError(
ValueError: Node type1_eta: func1([brace_dictionary,brace_dictionary,brace_dictionary,..,brace_dictionary]) -> [output1, output2] expected 1 input(s) ['brace_dictionary'], but got the following 1497 input(s) instead: ['file1', 'file2', ...].
Is there another way how to do it conditionally?

How to add multiple fields' reference to "unique_together" error message

I have a model with multiple fields being checked for uniqueness:
class AudQuestionList(BaseTimeStampModel):
aud_ques_list_id = models.AutoField(primary_key=True,...
aud_ques_list_num = models.CharField(max_length=26,...
aud_ques_list_doc_type = models.ForeignKey(DocType,...
short_text = models.CharField(max_length=55,...
aud_scope_standards = models.ForeignKey(ScopeStandard, ...
aud_freqency = models.ForeignKey(AuditFrequency, ...
aud_process = models.ForeignKey(AuditProcesses, ...
unique_together = [['aud_scope_standards', 'aud_freqency', 'aud_process',],]
My model form is as described below:
class CreateAudQuestionListForm(forms.ModelForm):
class Meta:
model = AudQuestionList
fields = ('aud_ques_list_doc_type', 'aud_scope_standards', 'aud_freqency', 'aud_process', 'short_text', ...
def validate_unique(self):
try:
self.instance.validate_unique()
except ValidationError:
self._update_errors({'aud_scope_standards': _('Record exists for the combination of key values.')})
The scenario works perfectly well, only that the field names (labels) itself are missing from the message.
Is there a way to add the field names to the message above, say something like:
Record exists for the combination of key fields + %(field_labels)s.

Peewee-async - How to do a simple JOIN (or subquery / prefetch)

I'm stuck on a pretty simple issue with peewee-async regarding JOINs, or perhaps I need to use a subquery, or prefetch... I can't figure it out what kind of query I need to do.
I have 2 database tables (parent/child):
class Group(PeeweeModel):
id = peewee.AutoField()
name = peewee.TextField()
class Channel(PeeweeModel):
id = peewee.AutoField()
name = peewee.TextField()
group = peewee.ForeignKeyField(Group, backref="channels")
I need to fetch 1 group object, and this object has multiple channel objects.
I tried:
q = Group.select(Group, Channel).join(Channel)
But my backref 'channels' is always a ModelQuery instance, not the actual resultset.
Full code
import asyncio
import peewee
import peewee_async
from peewee_async import Manager, PooledPostgresqlDatabase
database = PooledPostgresqlDatabase('test', max_connections=4, user='postgres', password='', host='127.0.0.1')
objects = peewee_async.Manager(database)
class PeeweeModel(peewee.Model):
class Meta:
database = database
class Group(PeeweeModel):
id = peewee.AutoField()
name = peewee.TextField()
class Channel(PeeweeModel):
id = peewee.AutoField()
name = peewee.TextField()
group = peewee.ForeignKeyField(Group, backref="channels")
Group.create_table()
Channel.create_table()
database.set_allow_sync(False)
async def handler():
# create 1 group object
group = await objects.create(Group, name="TestGroup")
# create 2 channel objects, assign to group
await objects.create(Channel, name="TestName1", group=group)
await objects.create(Channel, name="TestName2", group=group)
# Query 1 group, and hopefully it will have the channels
q = Group.select(Group, Channel).join(Channel)
results = await objects.execute(q)
for result in results:
print(result.channels) # problem: Channels is not a list of channel objects, but a `ModelSelect` instead
with objects.allow_sync():
Channel.drop_table(True)
Group.drop_table(True)
loop = asyncio.get_event_loop()
loop.run_until_complete(handler())
loop.close()
I was able to get help from an expertâ„¢ and the solution is to use prefetch():
async def handler():
# create 1 group object
group = await objects.create(Group, name="TestGroup")
# create 2 channel objects, assign to group
await objects.create(Channel, name="TestName", group=group)
await objects.create(Channel, name="TestName", group=group)
# Query 1 group, and hopefully it will have the channels
q = Group.select(Group)
groups = await objects.prefetch(q, Channel.select(Channel))
for group in groups:
print(group, group.channels) # channels is a list of channels.
with objects.allow_sync():
Channel.drop_table(True)
Group.drop_table(True)
Peewee will figure out the relationship (backref) by itself.

Flask and SQLAlchemy's Column type

I'm creating a Flask app with SQLAlchemy.
In Odoo ORM, we have a column or a field named 'Selection'.
I wanna do something great like this into SQLAlchemy:
STATUS = [('admin','Admin'),('guest','Guest')]
status = db.Column(db.Selection, selection=STATUS)
With Odoo's ORM:
status = fields.Selection(STATUS, default='guest')
I don't wanna create a new Model to store these : 'admin','guest','superadmin'.
Thanks for all
You can use SQLAlchemy enum. Check SQLAlchemy 1.3 Documentation here: class sqlalchemy.types.Enum(*enums, **kw)
class Status(enum.Enum):
ADMIN = "admin"
GUEST = "guest"
class MyTable(db.Model):
id = db.Column(db.Integer, primary_key = True)
status = db.Column(db.Enum(Status), default=Status.GUEST)

Featuretools - LookupError: Time index not found in dataframe

I have a input dataframe which I have split up into 3 entities based on the attributes. When I try to generate features using featuretools I get the above mentioned error
input dataframe in_df = ['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Date.of.Birth', 'Employment.Type', 'DisbursalDate', 'State_ID', 'Employee_code_ID', 'MobileNo_Avl_Flag', 'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS', 'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT', 'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT', 'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT', 'NEW.ACCTS.IN.LAST.SIX.MONTHS', 'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS', 'AVERAGE.ACCT.AGE', 'CREDIT.HISTORY.LENGTH', 'NO.OF_INQUIRIES', 'loan_default']
I have split this up into 3 entities based on the information available on the dataset:
cust_cols = ['UniqueID','Current_pincode_ID', 'Employment.Type', 'State_ID', 'MobileNo_Avl_Flag', 'branch_id',
'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'asset_cost', 'Date.of.Birth']
customers_df = df_raw_train[cust_cols]
loan_info_cols = ['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id',
'Employee_code_ID', 'loan_default', 'DisbursalDate']
loan_info_df = df_raw_train[loan_info_cols]
bureau_cols = ['UniqueID','PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS',
'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT',
'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT',
'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT', 'NEW.ACCTS.IN.LAST.SIX.MONTHS',
'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS', 'NO.OF_INQUIRIES']
bureau_df = df_raw_train[bureau_cols]
customers_df.set_index(['UniqueID', 'branch_id'],inplace = True, append = True)
loan_info_df.set_index(['UniqueID', 'branch_id'], inplace = True, append = True)
entities = {"customers" : (customers_df, "UniqueID", "branch_id"), "loans" : (loan_info_df, "UniqueID", "branch_id"),
"bureau" : (bureau_df, "UniqueID")
}
relationships = [("loans", "UniqueID", "bureau", "UniqueID"),
("customers", "branch_id", "loans", "branch_id")]
feature_matrix_customers, features_defs = ft.dfs(entities=entities, relationships=relationships, target_entity="customers")
I am getting the error" LookupError: Time index not found in dataframe
Can someone help on why is there an error as the featuretools docs does not mention any need to specify the timeindex?
Got this resolved by creating entity sets from dataframes.

Resources