Featuretools - LookupError: Time index not found in dataframe - python-3.x

I have a input dataframe which I have split up into 3 entities based on the attributes. When I try to generate features using featuretools I get the above mentioned error
input dataframe in_df = ['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Date.of.Birth', 'Employment.Type', 'DisbursalDate', 'State_ID', 'Employee_code_ID', 'MobileNo_Avl_Flag', 'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS', 'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT', 'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT', 'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT', 'NEW.ACCTS.IN.LAST.SIX.MONTHS', 'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS', 'AVERAGE.ACCT.AGE', 'CREDIT.HISTORY.LENGTH', 'NO.OF_INQUIRIES', 'loan_default']
I have split this up into 3 entities based on the information available on the dataset:
cust_cols = ['UniqueID','Current_pincode_ID', 'Employment.Type', 'State_ID', 'MobileNo_Avl_Flag', 'branch_id',
'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'asset_cost', 'Date.of.Birth']
customers_df = df_raw_train[cust_cols]
loan_info_cols = ['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id',
'Employee_code_ID', 'loan_default', 'DisbursalDate']
loan_info_df = df_raw_train[loan_info_cols]
bureau_cols = ['UniqueID','PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS',
'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT',
'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT',
'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT', 'NEW.ACCTS.IN.LAST.SIX.MONTHS',
'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS', 'NO.OF_INQUIRIES']
bureau_df = df_raw_train[bureau_cols]
customers_df.set_index(['UniqueID', 'branch_id'],inplace = True, append = True)
loan_info_df.set_index(['UniqueID', 'branch_id'], inplace = True, append = True)
entities = {"customers" : (customers_df, "UniqueID", "branch_id"), "loans" : (loan_info_df, "UniqueID", "branch_id"),
"bureau" : (bureau_df, "UniqueID")
}
relationships = [("loans", "UniqueID", "bureau", "UniqueID"),
("customers", "branch_id", "loans", "branch_id")]
feature_matrix_customers, features_defs = ft.dfs(entities=entities, relationships=relationships, target_entity="customers")
I am getting the error" LookupError: Time index not found in dataframe
Can someone help on why is there an error as the featuretools docs does not mention any need to specify the timeindex?

Got this resolved by creating entity sets from dataframes.

Related

pyspark concatenate multiple csv files in one

I need to use function concat(Path trg, Path[] psrcs) from org.apache.hadoop.fs with pyspark
My code is:
orig1_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename1}')
orig2_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename2}')
dest_fs = spark._jvm.org.apache.hadoop.fs.Path(dest_path)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
fs.concat(dest_fs, list((orig1_fs , orig2_fs)))
But I get error:
error
How can I use the function?
That's because the 2nd argument of concat method is an Array not an ArrayList
# transform from `ArrayList<Path>` to `Path[]`
py_paths = [orig1_fs , orig2_fs]
java_paths = sc._gateway.new_array(spark._jvm.org.apache.hadoop.fs.Path, len(py_paths))
for i in range(len(py_paths)):
java_paths[i] = py_paths[i]
# you can use the new array now
fs.concat(dest_fs, java_paths)

Making a copy of a object pandas column as a category type giving warning

I'm making a copy of a pandas object column into a separate ordered column but I'm getting a warning and haven't figured out for the life of me how to do it the correct way.
I can't post the entire data frame up but here is the syntax I'm using:
marriage_cat_type = CategoricalDtype(categories= ['M_22', 'M_23', 'M_24', 'M_25', 'M_26', 'M_27', 'M_28', 'M_29', 'M_30'
, 'M_31', 'M_32', 'M_33', 'M_34', 'M_35', 'M_36', 'M_37', 'M_38', 'M_39'
, 'M_40', 'M_41', 'M_42', 'M_43', 'M_44', 'M_45', 'M_46', 'M_47', 'M_48'
, 'M_49', 'M_50', 'M_51', 'M_52', 'M_53', 'M_54', 'M_55', 'M_56', 'M_57'
, 'M_58', 'M_59', 'M_60', 'M_61', 'M_62', 'M_63', 'M_64', 'M_65', 'M_66'
, 'M_67', 'M_68', 'M_69', 'M_70', 'M_71', 'M_72', 'M_73', 'M_74', 'M_75'
, 'M_76', 'M_77', 'M_78', 'M_79', 'M_80', 'M_81', 'M_82', 'M_999', 'S_18'
, 'S_19', 'S_20', 'S_21', 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27'
, 'S_28', 'S_29', 'S_30', 'S_31', 'S_32', 'S_33', 'S_34', 'S_35', 'S_36'
, 'S_37', 'S_38', 'S_39', 'S_40', 'S_41', 'S_42', 'S_43', 'S_44', 'S_45'
, 'S_46', 'S_47', 'S_48', 'S_49', 'S_50', 'S_51', 'S_52', 'S_53', 'S_54'
, 'S_55', 'S_56', 'S_57', 'S_58', 'S_59', 'S_60', 'S_61', 'S_62', 'S_63'
, 'S_64', 'S_65', 'S_66', 'S_67', 'S_68', 'S_69', 'S_70', 'S_71', 'S_72'
, 'S_73', 'S_74', 'S_75', 'S_77', 'S_79', 'S_999'], ordered = True)
coll_train['marriage_statusXage_codes'] = coll_train['marital_statusXage2'].astype(marriage_cat_type)
I'm getting this warning.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
And I've tried this resulting in failure:
coll_train['marriage_statusXage_codes'] = coll_train.loc[:,
'marital_statusXage2'].astype(marriage_cat_type)
Can anyone point me in the right direction?
This is a chained assignment issue. That may be manipulated by pd.set_option('chained_assignment',None|'warn'|'raise').
The warnings are turned on, and Pandas does not like coll_train.
There are a couple options: Make sure coll_train is the source dataframe you intend to modify (which you are doing by placing a new column named marriage_statusXage_codes in it). If it is, and Pandas is wrong, set pd.set_option('chained_assignment',None). Can Pandas be wrong about that? I don't know.
Here is an illustration of setting a value on a slice.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""date,LASTA,LASTB,LASTC
1999-03-15,2.5597,8.20145,16.900
1999-03-31,2.7724,7.73057,16.955
1999-04-01,2.8321,7.63714,17.500
1999-04-06,2.8537,7.63703,17.750""")
df = pd.read_csv(csvdata, sep=",", index_col="date", parse_dates=True, infer_datetime_format=True)
pd.set_option('chained_assignment','warn')
a_slice = df['1999-03-31':'1999-04-01']
print(id(df), id(a_slice))
# generates the warning
a_slice['LASTA'] = 10
# original does not have the data set on a slice!
print(df[df['LASTA'] == 10]['LASTA'].any())
# create a new object to which values can be set, no warning.
a_slice = a_slice.copy()
a_slice['LASTA'] = 10
print(a_slice[a_slice['LASTA'] == 10]['LASTA'].any())
Results
0.20.3
(4549520208, 4594637776)
slicecopy.py:20: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
a_slice['LASTA'] = 10
False
True

InvalidResponse error in bigquery load_table_from_file

I am trying to upload a set of csv data into a BigQuery from a BytesIO object, but keep getting an error InvalidResponse: Response headers must contain header 'location'
Here is my code
# self.database = authenticated bigquery.Client
config = bigquery.LoadJobConfig()
config.skip_leading_rows = 1
config.source_format = bigquery.SourceFormat.CSV
config.allow_jagged_rows = True
schema = [
bigquery.SchemaField("date", "DATE", mode="REQUIRED"),
bigquery.SchemaField("page_id", "STRING", mode="REQUIRED")
]
# ... Appending a list of bigquery.SchemaField("name", "INTEGER")
config.schema = schema
table = self.get_or_create_table(name, config.schema) # returns TableReference
file = self.clip_data(local_fp, cutoff_date) # returns BytesIO
job = self.database.load_table_from_file(
file, table,
num_retries=self.options.num_retries,
job_id=uuid.uuid4().int,
job_config=config
) # Error is here.
I have tried searching around but I cannot find any reason or fix for this exception.
InvalidResponse: ('Response headers must contain header', 'location')
The problem was caused by not providing a location in the load_table_from_file method.
location="US"
was enough to fix the problem.

es.normalize_entity error variable not found in entity

Am using the featuretools documentation to learn entityset and am currently getting error KeyError: 'Variable: device not found in entity' for the following piece of code:
import featuretools as ft
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
customers_df
sessions_df = data["sessions"]
sessions_df.sample(5)
transactions_df = data["transactions"]
transactions_df.sample(10)
products_df = data["products"]
products_df
### Creating an entity set
es = ft.EntitySet(id="transactions")
### Adding entities
es = es.entity_from_dataframe(entity_id="transactions", dataframe=transactions_df, index="transaction_id", time_index="transaction_time", variable_types={"product_id": ft.variable_types.Categorical})
es
es["transactions"].variables
es = es.entity_from_dataframe(entity_id="products",dataframe=products_df,index="product_id")
es
### Adding new relationship
new_relationship = ft.Relationship(es["products"]["product_id"],
es["transactions"]["product_id"])
es = es.add_relationship(new_relationship)
es
### Creating entity from existing table
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index = "session_id",
additional_variables=["device",customer_id","zip_code"])
This is as per the URL - https://docs.featuretools.com/loading_data/using_entitysets.html
From the API es.normalise_entity it appears that the function would create new entity 'sessions' with index as 'session_id', and rest of the 3 variables however the error is:
C:\Users\s_belvi\AppData\Local\Continuum\Anaconda2\lib\site-packages\featuretools\entityset\entity.pyc in _get_variable(self, variable_id)
250 return v
251
--> 252 raise KeyError("Variable: %s not found in entity" % (variable_id))
253
254 #property
KeyError: 'Variable: device not found in entity'
Do we need to create entity "sessions" separately before using es.normalize_entity? Looks like something syntactically has gone wrong in the flow, some minor mistake..
The error here arises from device not being a column in your transactions_df. The "transactions" table referenced in that page of the documentation has more columns than demo.load_mock_customer in its dictionary form. You can find the rest of the columns using the return_single_table argument. Here's a full working example of normalize_entity which is only slightly modified from the code that you tried:
import featuretools as ft
data = ft.demo.load_mock_customer(return_single_table=True)
es = ft.EntitySet(id="Mock Customer")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=data,
index="transaction_id",
time_index="transaction_time",
variable_types={"product_id": ft.variable_types.Categorical})
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index = "session_id",
additional_variables=["device","customer_id","zip_code"])
This will return an EntitySet with two Entities and one Relationship:
Entityset: Mock Customer
Entities:
transactions [Rows: 500, Columns: 8]
sessions [Rows: 35, Columns: 5]
Relationships:
transactions.session_id -> sessions.session_id

Melting a List in column to multiple rows in dataframe

What I am trying to do is, I have a list of star_cast and list of a genre for a single movie entity. I want to melt this list down as a repeating entity in data frame so that I can store it in a database system.
director_name = ['chris','guy','bryan']
genre = [['mystery','thriller'],['comedy','crime'],['action','adventure','sci -fi']]
gross_vlaue = [2544,236544,265888]
imdb_ratings = [8.5,5.4,3.2]
metascores = [80.0,55.0,64.0]
movie_names = ['memento','snatch','x-men']
runtime = [113.0,102.0,104.0]
star_cast = [['abc','ced','gef'],['aaa','abc'],['act','cst','gst','hhs']]
votes = [200,2150,2350]
sample_data = pd.DataFrame({"movie_names":movie_names,
"imdb_ratings":imdb_ratings,
"metscores":metascores,
"votes":votes,
"runtime":runtime,
"genre":genre,
"director_name": director_name,
"star_cast": star_cast,
"gross_value":gross_vlaue
})
The above will generate a Data Frame sample I have.
director_name = ['chris','chris','chris','chris','chris','chris','guy','guy','guy','guy','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan']
genre = ['mystery','thriller','mystery','thriller','mystery','thriller','comedy','crime','comedy','crime','action','adventure','sci -fi','action','adventure','sci -fi','action','adventure','sci -fi','action','adventure','sci -fi']
gross_vlaue = [2544,2544,2544,2544,2544,2544,236544,236544,236544,236544,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888]
imdb_ratings = [8.5,8.5,8.5,8.5,8.5,8.5,5.4,5.4,5.4,5.4,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2]
metascores = [80.0,80.0,80.0,80.0,80.0,80.0,55.0,55.0,55.0,55.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0]
movie_names = ['memento','memento','memento','memento','memento','memento','snatch','snatch','snatch','snatch','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men']
runtime = [113.0,113.0,113.0,113.0,113.0,113.0,102.0,102.0,102.0,102.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0]
star_cast = ['abc','ced','gef','abc','ced','gef','aaa','abc','aaa','abc','act','cst','gst','hhs','act','cst','gst','hhs','act','cst','gst','hhs']
votes = [200,200,200,200,200,200,2150,2150,2150,2150,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350]
sample_result = pd.DataFrame({"movie_names":movie_names,
"imdb_ratings":imdb_ratings,
"metscores":metascores,
"votes":votes,
"runtime":runtime,
"genre":genre,
"director_name": director_name,
"star_cast": star_cast,
"gross_value":gross_vlaue
})
This will generate the format I want to conver my data into.
I tried using melt() but no luck there. Please help, as to how it can be achieved in an effective way. My dataset is fairly large, using for loops will be very slow. Is there any other way around solving this?

Resources