pandas.dataframe.astype is not converting dtype - python-3.x

I am trying to convert some columns from object to Categorical columns.
# dtyp_cat = 'category'
# mapper = {'Segment':dtyp_cat,
# "Sub-Category":dtyp_cat,
# "Postal Code":dtyp_cat,
# "Region":dtyp_cat,
# }
df.astype({'Segment':'category'})
df.dtypes
But the output is still object type.
Dataset is hosted at:
url = r"https://raw.githubusercontent.com/jaegarbomb/TSF_GRIP/main/Retail_EDA/Superstore.csv"
df = pd.read_csv(url)

Do this:
df['Segment'] = df.Segment.astype('category')
Which returns
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ship Mode 9994 non-null object
1 Segment 9994 non-null category
2 Country 9994 non-null object
3 City 9994 non-null object
4 State 9994 non-null object
5 Postal Code 9994 non-null int64
6 Region 9994 non-null object
7 Category 9994 non-null object
8 Sub-Category 9994 non-null object
9 Sales 9994 non-null float64
10 Quantity 9994 non-null int64
11 Discount 9994 non-null float64
12 Profit 9994 non-null float64
dtypes: category(1), float64(3), int64(2), object(7)
memory usage: 946.9+ KB
EDIT
If you want to convert several columns (In your case, I suppose it is all that are objects, you need to drop those that aren't, convert what's left and then reattach the other columns.
df2 = df.drop([ 'Postal Code', 'Sales', 'Quantity', 'Discount', 'Profit'], axis=1)
df3 = df2.apply(lambda x: x.astype('category'))
which gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ship Mode 9994 non-null category
1 Segment 9994 non-null category
2 Country 9994 non-null category
3 City 9994 non-null category
4 State 9994 non-null category
5 Region 9994 non-null category
6 Category 9994 non-null category
7 Sub-Category 9994 non-null category
dtypes: category(8)
memory usage: 115.2 KB
I'll leave the appending the other columns to you. A hint would be:
df4 = pd.concat([df3, df], axis=1, sort=False)
df_final = df4.loc[:,~df4.columns.duplicated()]

Related

Filtering multiple rows when fixing OutOfBoundsDatetime: Out of bounds nanosecond timestamp multiple values

I have tried this code to convert one of my columns,search_departure_date, from the dataframe df into the datetimeformat to get the following error.
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182005 entries, 0 to 182004
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 182005 non-null datetime64[ns]
1 device_type 182005 non-null object
2 search_origin 182005 non-null object
3 search_destination 182005 non-null object
4 search_route 182005 non-null object
5 search_adult_count 157378 non-null float64
6 search_child_count 157378 non-null float64
7 search_cabin_class 157378 non-null object
8 search_type 182005 non-null object
9 search_departure_date 182005 non-null object
10 search_arrival_date 97386 non-null datetime64[ns]
df["search_departure_date"] = pd.to_datetime(df.loc[:, 'search_departure_date'], format='%Y-%m-%d')
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1478-06-14 17:17:56
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1479-03-23 17:17:56
so I am trying to filter out the rows with this timestamp value
df.loc[df['search_departure_date'] != '1478-06-14 17:17:56']
df.loc[df['search_departure_date'] != '1479-03-23 17:17:56']
df.loc[df['search_departure_date'] != '1478-06-14 17:17:56']
how can I do it for multiple timestamps? I noticed they all begin with 1478 or 1479, using just the pipe (|) operator to join lots of them together is cumbersome.
You can try errors='coerce'
df["search_departure_date"] = pd.to_datetime(df['search_departure_date'], errors='coerce')
If you want to filter out the rows, you can use str.match
m = df['search_departure_date'].str.match('147(8|9)')

How does pandas.merge scale?

I have two dataframes A and B. A consumes about 15 times more memory and has about 30 times more rows than B. However, if I merge A to itself it takes only about 15 seconds, whereas merging B to itself takes about 15 minutes (pandas version 1.1.5).
# this chain of operations takes about 15 seconds with A
(A # with B here and in the line below it takes about 15 minutes
.merge(A, how='left', indicator=True)
.query("_merge == 'left_only'")
.drop(columns='_merge')
.reset_index(drop=True))
Does anyone know how merges scale in pandas? Why is the merge of the large dataframe with itself so much faster than that of the small one with itself?
More details on the two dataframes:
Dataframe A:
RangeIndex: 7250117 entries, 0 to 7250116
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 Order_ID int64
1 Color object
2 Size object
3 Country object
4 pieces_demand float64
5 count int64
dtypes: float64(1), int64(2), object(3)
memory usage: 331.9+ MB
DataFrame B:
RangeIndex: 234805 entries, 0 to 234804
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order_ID 234805 non-null int64
1 Color 234805 non-null object
2 Size 234805 non-null object
3 pieces_demand 234805 non-null float64
4 pieces_ordered 234805 non-null float64
5 Size_Index 234805 non-null int64
6 dist_demand 234805 non-null float64
7 dist_ordered 234805 non-null float64
8 dist_discrete_demand 234805 non-null float64
9 dist_discrete_ordered 234805 non-null float64
10 count 234805 non-null int64
dtypes: float64(6), int64(3), object(2)
memory usage: 19.7+ MB

seaborn barplot TypeError

This is the info for my dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 550068 non-null int64
1 pID 550068 non-null object
2 Gender 550068 non-null category
3 Age 550068 non-null category
4 Occupation 550068 non-null category
5 City 550068 non-null category
6 Years 550068 non-null int64
7 Married 550068 non-null category
8 pCat1 550068 non-null category
9 pCat2 376430 non-null category
10 pCat3 166821 non-null category
11 Purchase 550068 non-null int64
dtypes: category(8), int64(3), object(1)
memory usage: 18.9+ MB
I am trying to plot a bar chart between City and Purchase using seaborn:
sns.barplot(data = train_df, x = "City", y = "Purchase")
I am getting the following error:
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
What am I doing wrong, and how can I fix it?
Thank you in advance.
NOTE: I am using Anaconda Spyder 4.0 IDE and python 3.7

printing info() at Pandas at the report the entries and index number are not the same

at Jupyter notebook I Printed df.info() the result is
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20620 entries, 0 to 24867
Data columns (total 3 columns):
neighborhood 20620 non-null object
bedrooms 20620 non-null float64
price 20620 non-null float64
dtypes: float64(2), object(1)
memory usage: 644.4+ KB
why it shows 20620 entries form 0 to 24867? The last number (24867) should be 20620 or 20619
It means that not every possible index value has been used.
For example,
In [13]: df = pd.DataFrame([10,20], index=[0,100])
In [14]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 100
Data columns (total 1 columns):
0 2 non-null int64
dtypes: int64(1)
memory usage: 32.0 bytes
df has 2 entries, but the Int64Index "ranges" from 0 to 100.
DataFrames can easily end up like this if rows have been deleted, or if df is a sub-DataFrame of another DataFrame.
If you reset the index, the index labels will be renumbered in order, starting from 0:
In [17]: df.reset_index(drop=True)
Out[17]:
0
0 10
1 20
In [18]: df.reset_index(drop=True).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
0 2 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes
To be more precise, as Chris points out, the line
Int64Index: 2 entries, 0 to 100
is merely reporting the first and last value in the Int64Index. It's not reporting min or max values. There can be higher or lower integers in the index:
In [32]: pd.DataFrame([10,20,30], index=[50,0,50]).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 50 to 50 # notice index value 0 is not mentioned
Data columns (total 1 columns):
0 3 non-null int64
dtypes: int64(1)
memory usage: 48.0 bytes

Merge changes Pandas types

I'm using Python 3 (don't know if the info is relevant).
I have 2 Pandas DataFrames (coming from read_csv()): Compact and SDSS_DR7_to_DR8. Before merge, they contain types as follow :
Compact.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2070 entries, 0 to 2069
Data columns (total 8 columns):
Group 2070 non-null int64
Id 2070 non-null int64
RA 2070 non-null float64
Dec 2070 non-null float64
z 2070 non-null float64
R 2070 non-null float64
G 2070 non-null float64
objid 2070 non-null int64
dtypes: float64(5), int64(3)
memory usage: 129.5 KB
And
SDSS_DR7_to_DR8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 5 columns):
specobjid 243500 non-null int64
dr8objid 243500 non-null int64
dr7objid 243500 non-null int64
ra 243500 non-null float64
dec 243500 non-null float64
dtypes: float64(2), int64(3)
memory usage: 9.3 MB
I perform a Compact=pd.merge(Compact, SDSS_DR7_to_DR8, left_on=['objid'], right_on=['dr8objid'], how='left'). It is executed without error, but the result is a mess. When I check the types in the new DataFrame, I get this:
Compact.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070 entries, 0 to 2069
Data columns (total 13 columns):
Group 2070 non-null int64
Id 2070 non-null int64
RA 2070 non-null float64
Dec 2070 non-null float64
z 2070 non-null float64
R 2070 non-null float64
G 2070 non-null float64
objid 2070 non-null int64
specobjid 1275 non-null float64
dr8objid 1275 non-null float64
dr7objid 1275 non-null float64
ra 1275 non-null float64
dec 1275 non-null float64
dtypes: float64(10), int64(3)
memory usage: 226.4 KB
So during the merge, dr8objid (and some others) has (have) been cast to float64. How is it possible, and what can I do to prevent this (hoping this is the origin of the mess in the merge)?
EDIT
So, to be more specific: if I create df
df=pd.DataFrame(data=[[1000000000000000000,1]], columns=['key','data'])
key and data are both int64. I create a transcoding df:
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]],
columns=['key','key2'])
which 2 keys are int64. Then
df2 = pd.merge(df, trans, on=['key'], how='left')
Gives a nice result, and key, key2 and data are still int64.
Nevertheless, if I define
df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
columns=['key','data'])
Now after the merge, I get
and now key2 has switched to a float64. How to prevent this? Is it because NaN must be connected with a float ? If so, is it possible to set the merge to define the result of merging to 0 or -1 if there is no correspondance, keeping the whole column to int64?
Update: in Pandas 0.24, there now are Nullable integer data types.
As of this writing, the Pandas does not seem choose the nullable int data type for the result of the merge. But it's possible to convert both arrays to the nullable int type Int64 before the merge.
Consider
df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
columns=['key','data']).astype("Int64")
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]],
columns=['key','key2']).astype("Int64")
df2 = pd.merge(df, trans, on=['key'], how='left')
Result:
>>> df2
key data key2
0 1000000000000000000 1 2000000000000000000
1 1000000000000000001 2 <NA>
>>> df2.dtypes
key Int64
data Int64
key2 Int64
dtype: object
Original answer, for Pandas < v0.24:
Is it because NaN must be connected with a float ?
Correct. There is no NaN value in an int, so missing values can only be represented in floats.
You could either filter your data before merging, making sure that there are no NaNs created.
Or you could fill in the NaNs with a value of your choosing after the merge, and then restore the dtype.

Resources