I'm making a copy of a pandas object column into a separate ordered column but I'm getting a warning and haven't figured out for the life of me how to do it the correct way.
I can't post the entire data frame up but here is the syntax I'm using:
marriage_cat_type = CategoricalDtype(categories= ['M_22', 'M_23', 'M_24', 'M_25', 'M_26', 'M_27', 'M_28', 'M_29', 'M_30'
, 'M_31', 'M_32', 'M_33', 'M_34', 'M_35', 'M_36', 'M_37', 'M_38', 'M_39'
, 'M_40', 'M_41', 'M_42', 'M_43', 'M_44', 'M_45', 'M_46', 'M_47', 'M_48'
, 'M_49', 'M_50', 'M_51', 'M_52', 'M_53', 'M_54', 'M_55', 'M_56', 'M_57'
, 'M_58', 'M_59', 'M_60', 'M_61', 'M_62', 'M_63', 'M_64', 'M_65', 'M_66'
, 'M_67', 'M_68', 'M_69', 'M_70', 'M_71', 'M_72', 'M_73', 'M_74', 'M_75'
, 'M_76', 'M_77', 'M_78', 'M_79', 'M_80', 'M_81', 'M_82', 'M_999', 'S_18'
, 'S_19', 'S_20', 'S_21', 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27'
, 'S_28', 'S_29', 'S_30', 'S_31', 'S_32', 'S_33', 'S_34', 'S_35', 'S_36'
, 'S_37', 'S_38', 'S_39', 'S_40', 'S_41', 'S_42', 'S_43', 'S_44', 'S_45'
, 'S_46', 'S_47', 'S_48', 'S_49', 'S_50', 'S_51', 'S_52', 'S_53', 'S_54'
, 'S_55', 'S_56', 'S_57', 'S_58', 'S_59', 'S_60', 'S_61', 'S_62', 'S_63'
, 'S_64', 'S_65', 'S_66', 'S_67', 'S_68', 'S_69', 'S_70', 'S_71', 'S_72'
, 'S_73', 'S_74', 'S_75', 'S_77', 'S_79', 'S_999'], ordered = True)
coll_train['marriage_statusXage_codes'] = coll_train['marital_statusXage2'].astype(marriage_cat_type)
I'm getting this warning.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
And I've tried this resulting in failure:
coll_train['marriage_statusXage_codes'] = coll_train.loc[:,
'marital_statusXage2'].astype(marriage_cat_type)
Can anyone point me in the right direction?
This is a chained assignment issue. That may be manipulated by pd.set_option('chained_assignment',None|'warn'|'raise').
The warnings are turned on, and Pandas does not like coll_train.
There are a couple options: Make sure coll_train is the source dataframe you intend to modify (which you are doing by placing a new column named marriage_statusXage_codes in it). If it is, and Pandas is wrong, set pd.set_option('chained_assignment',None). Can Pandas be wrong about that? I don't know.
Here is an illustration of setting a value on a slice.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""date,LASTA,LASTB,LASTC
1999-03-15,2.5597,8.20145,16.900
1999-03-31,2.7724,7.73057,16.955
1999-04-01,2.8321,7.63714,17.500
1999-04-06,2.8537,7.63703,17.750""")
df = pd.read_csv(csvdata, sep=",", index_col="date", parse_dates=True, infer_datetime_format=True)
pd.set_option('chained_assignment','warn')
a_slice = df['1999-03-31':'1999-04-01']
print(id(df), id(a_slice))
# generates the warning
a_slice['LASTA'] = 10
# original does not have the data set on a slice!
print(df[df['LASTA'] == 10]['LASTA'].any())
# create a new object to which values can be set, no warning.
a_slice = a_slice.copy()
a_slice['LASTA'] = 10
print(a_slice[a_slice['LASTA'] == 10]['LASTA'].any())
Results
0.20.3
(4549520208, 4594637776)
slicecopy.py:20: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
a_slice['LASTA'] = 10
False
True
Am using the featuretools documentation to learn entityset and am currently getting error KeyError: 'Variable: device not found in entity' for the following piece of code:
import featuretools as ft
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
customers_df
sessions_df = data["sessions"]
sessions_df.sample(5)
transactions_df = data["transactions"]
transactions_df.sample(10)
products_df = data["products"]
products_df
### Creating an entity set
es = ft.EntitySet(id="transactions")
### Adding entities
es = es.entity_from_dataframe(entity_id="transactions", dataframe=transactions_df, index="transaction_id", time_index="transaction_time", variable_types={"product_id": ft.variable_types.Categorical})
es
es["transactions"].variables
es = es.entity_from_dataframe(entity_id="products",dataframe=products_df,index="product_id")
es
### Adding new relationship
new_relationship = ft.Relationship(es["products"]["product_id"],
es["transactions"]["product_id"])
es = es.add_relationship(new_relationship)
es
### Creating entity from existing table
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index = "session_id",
additional_variables=["device",customer_id","zip_code"])
This is as per the URL - https://docs.featuretools.com/loading_data/using_entitysets.html
From the API es.normalise_entity it appears that the function would create new entity 'sessions' with index as 'session_id', and rest of the 3 variables however the error is:
C:\Users\s_belvi\AppData\Local\Continuum\Anaconda2\lib\site-packages\featuretools\entityset\entity.pyc in _get_variable(self, variable_id)
250 return v
251
--> 252 raise KeyError("Variable: %s not found in entity" % (variable_id))
253
254 #property
KeyError: 'Variable: device not found in entity'
Do we need to create entity "sessions" separately before using es.normalize_entity? Looks like something syntactically has gone wrong in the flow, some minor mistake..
The error here arises from device not being a column in your transactions_df. The "transactions" table referenced in that page of the documentation has more columns than demo.load_mock_customer in its dictionary form. You can find the rest of the columns using the return_single_table argument. Here's a full working example of normalize_entity which is only slightly modified from the code that you tried:
import featuretools as ft
data = ft.demo.load_mock_customer(return_single_table=True)
es = ft.EntitySet(id="Mock Customer")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=data,
index="transaction_id",
time_index="transaction_time",
variable_types={"product_id": ft.variable_types.Categorical})
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index = "session_id",
additional_variables=["device","customer_id","zip_code"])
This will return an EntitySet with two Entities and one Relationship:
Entityset: Mock Customer
Entities:
transactions [Rows: 500, Columns: 8]
sessions [Rows: 35, Columns: 5]
Relationships:
transactions.session_id -> sessions.session_id
What I am trying to do is, I have a list of star_cast and list of a genre for a single movie entity. I want to melt this list down as a repeating entity in data frame so that I can store it in a database system.
director_name = ['chris','guy','bryan']
genre = [['mystery','thriller'],['comedy','crime'],['action','adventure','sci -fi']]
gross_vlaue = [2544,236544,265888]
imdb_ratings = [8.5,5.4,3.2]
metascores = [80.0,55.0,64.0]
movie_names = ['memento','snatch','x-men']
runtime = [113.0,102.0,104.0]
star_cast = [['abc','ced','gef'],['aaa','abc'],['act','cst','gst','hhs']]
votes = [200,2150,2350]
sample_data = pd.DataFrame({"movie_names":movie_names,
"imdb_ratings":imdb_ratings,
"metscores":metascores,
"votes":votes,
"runtime":runtime,
"genre":genre,
"director_name": director_name,
"star_cast": star_cast,
"gross_value":gross_vlaue
})
The above will generate a Data Frame sample I have.
director_name = ['chris','chris','chris','chris','chris','chris','guy','guy','guy','guy','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan']
genre = ['mystery','thriller','mystery','thriller','mystery','thriller','comedy','crime','comedy','crime','action','adventure','sci -fi','action','adventure','sci -fi','action','adventure','sci -fi','action','adventure','sci -fi']
gross_vlaue = [2544,2544,2544,2544,2544,2544,236544,236544,236544,236544,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888]
imdb_ratings = [8.5,8.5,8.5,8.5,8.5,8.5,5.4,5.4,5.4,5.4,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2]
metascores = [80.0,80.0,80.0,80.0,80.0,80.0,55.0,55.0,55.0,55.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0]
movie_names = ['memento','memento','memento','memento','memento','memento','snatch','snatch','snatch','snatch','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men']
runtime = [113.0,113.0,113.0,113.0,113.0,113.0,102.0,102.0,102.0,102.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0]
star_cast = ['abc','ced','gef','abc','ced','gef','aaa','abc','aaa','abc','act','cst','gst','hhs','act','cst','gst','hhs','act','cst','gst','hhs']
votes = [200,200,200,200,200,200,2150,2150,2150,2150,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350]
sample_result = pd.DataFrame({"movie_names":movie_names,
"imdb_ratings":imdb_ratings,
"metscores":metascores,
"votes":votes,
"runtime":runtime,
"genre":genre,
"director_name": director_name,
"star_cast": star_cast,
"gross_value":gross_vlaue
})
This will generate the format I want to conver my data into.
I tried using melt() but no luck there. Please help, as to how it can be achieved in an effective way. My dataset is fairly large, using for loops will be very slow. Is there any other way around solving this?