seaborn barplot TypeError - python-3.x

This is the info for my dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 550068 non-null int64
1 pID 550068 non-null object
2 Gender 550068 non-null category
3 Age 550068 non-null category
4 Occupation 550068 non-null category
5 City 550068 non-null category
6 Years 550068 non-null int64
7 Married 550068 non-null category
8 pCat1 550068 non-null category
9 pCat2 376430 non-null category
10 pCat3 166821 non-null category
11 Purchase 550068 non-null int64
dtypes: category(8), int64(3), object(1)
memory usage: 18.9+ MB
I am trying to plot a bar chart between City and Purchase using seaborn:
sns.barplot(data = train_df, x = "City", y = "Purchase")
I am getting the following error:
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
What am I doing wrong, and how can I fix it?
Thank you in advance.
NOTE: I am using Anaconda Spyder 4.0 IDE and python 3.7

Related

How to parse correctly to date in python from specific ISO format

I connected to a table on a db where there are two columns with dates.
I had no problem to parse the column with values formatted as this: 2017-11-03
But I don't find a way to parse the other column with dates formatted as this: 2017-10-03 05:06:52.840 +02:00
My attempts
If I parse a single value through the strptime method
dt.datetime.strptime("2017-12-14 22:16:24.037 +02:00", "%Y-%m-%d %H:%M:%S.%f %z")
I get the correct output
datetime.datetime(2017, 12, 14, 22, 16, 24, 37000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
but if I try to use the same code format while parsing the table to the dataframe, the column dtype is an object:
Licenze_FromGY = pd.read_sql(query, cnxn, parse_dates={"EndDate":"%Y-%m-%d", "LastUpd":"%Y-%m-%d %H:%M:%S.%f %z"})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 1000 non-null object
I also tried to change the code format either in the read_sql method or in the pd.to_datetime() method, but then all the values become NaT:
Licenze_FromGY["LastUpd"] = pd.to_datetime(Licenze_FromGY["LastUpd"], format="%Y-%m-%d %H:%M:%S.%fZ", errors="coerce")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 0 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](2), int64(2), object(5)
memory usage: 71.4+ KB
None
Anyone can help?
pandas cannot handle mixed UTC offsets in one Series (column). Assuming you have data like this
import pandas as pd
df = pd.DataFrame({"datetime": ["2017-12-14 22:16:24.037 +02:00",
"2018-08-14 22:16:24.037 +03:00"]})
if you just parse to datetime,
df["datetime"] = pd.to_datetime(df["datetime"])
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: object
you get dtype object. The elements of the series are of the Python datetime.datetime dtype. That limits the datetime functionality, compared to the pandas datetime dtype.
You can get that e.g. by parsing to UTC:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
df["datetime"]
0 2017-12-14 20:16:24.037000+00:00
1 2018-08-14 19:16:24.037000+00:00
Name: datetime, dtype: datetime64[ns, UTC]
You might set an appropriate time zone to re-create the UTC offset:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True).dt.tz_convert("Europe/Athens")
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: datetime64[ns, Europe/Athens]

Filtering multiple rows when fixing OutOfBoundsDatetime: Out of bounds nanosecond timestamp multiple values

I have tried this code to convert one of my columns,search_departure_date, from the dataframe df into the datetimeformat to get the following error.
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182005 entries, 0 to 182004
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 182005 non-null datetime64[ns]
1 device_type 182005 non-null object
2 search_origin 182005 non-null object
3 search_destination 182005 non-null object
4 search_route 182005 non-null object
5 search_adult_count 157378 non-null float64
6 search_child_count 157378 non-null float64
7 search_cabin_class 157378 non-null object
8 search_type 182005 non-null object
9 search_departure_date 182005 non-null object
10 search_arrival_date 97386 non-null datetime64[ns]
df["search_departure_date"] = pd.to_datetime(df.loc[:, 'search_departure_date'], format='%Y-%m-%d')
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1478-06-14 17:17:56
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1479-03-23 17:17:56
so I am trying to filter out the rows with this timestamp value
df.loc[df['search_departure_date'] != '1478-06-14 17:17:56']
df.loc[df['search_departure_date'] != '1479-03-23 17:17:56']
df.loc[df['search_departure_date'] != '1478-06-14 17:17:56']
how can I do it for multiple timestamps? I noticed they all begin with 1478 or 1479, using just the pipe (|) operator to join lots of them together is cumbersome.
You can try errors='coerce'
df["search_departure_date"] = pd.to_datetime(df['search_departure_date'], errors='coerce')
If you want to filter out the rows, you can use str.match
m = df['search_departure_date'].str.match('147(8|9)')

Pandas csv reader - how to force a column to be a specific data type (and replace NaN with null)

I am just getting started with Pandas and I am reading a csv file using the read_csv() method. The difficulty I am having is needing to set a column to a specific data type.
df = pd.read_csv('test_data.csv', delimiter=',', index_col=False)
my df looks like this:
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_code 4 non-null object
1 store_code 4 non-null int64
2 cost1 4 non-null float64
3 start_date 4 non-null int64
4 end_date 4 non-null int64
5 quote_reference 0 non-null float64
6 min1 4 non-null int64
7 cost2 2 non-null float64
8 min2 2 non-null float64
9 cost3 1 non-null float64
10 min3 1 non-null float64
dtypes: float64(6), int64(4), object(1)
memory usage: 480.0+ bytes
you can see that I have multiple 'min' columns min1, min2, min3
min1 is correctly detected as an int64, but min2 and min3 are float64.
this is due to min1 being fully populated, whereas min2, and min3 are sparsely populated.
here is my df:
as you can see min2 has 2 NaN values.
trying to change the data type using
df['min2'] = df['min2'].astype('int')
I get this error:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Ideally I want to change the data type to Int, and have NaN replaced by a NULL (ie don't want a 0).
I have tried a variety of methods, ie fillna, but can't crack this.
All help greatly appreciated.
Since pandas 1.0, you can use a generic pandas.NA to replace numpy.nan. This is useful to serve as an integer NA.
To perform the convertion, use the "Int64" type (note the capital I).
df['min2'] = df['min2'].astype('Int64')
Example:
s = pd.Series([1, 2, None, 3])
s.astype('Int64')
Or:
pd.Series([1, 2, None, 3], dtype='Int64')
Output:
0 1
1 2
2 <NA>
3 3
dtype: Int64

How does pandas.merge scale?

I have two dataframes A and B. A consumes about 15 times more memory and has about 30 times more rows than B. However, if I merge A to itself it takes only about 15 seconds, whereas merging B to itself takes about 15 minutes (pandas version 1.1.5).
# this chain of operations takes about 15 seconds with A
(A # with B here and in the line below it takes about 15 minutes
.merge(A, how='left', indicator=True)
.query("_merge == 'left_only'")
.drop(columns='_merge')
.reset_index(drop=True))
Does anyone know how merges scale in pandas? Why is the merge of the large dataframe with itself so much faster than that of the small one with itself?
More details on the two dataframes:
Dataframe A:
RangeIndex: 7250117 entries, 0 to 7250116
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 Order_ID int64
1 Color object
2 Size object
3 Country object
4 pieces_demand float64
5 count int64
dtypes: float64(1), int64(2), object(3)
memory usage: 331.9+ MB
DataFrame B:
RangeIndex: 234805 entries, 0 to 234804
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order_ID 234805 non-null int64
1 Color 234805 non-null object
2 Size 234805 non-null object
3 pieces_demand 234805 non-null float64
4 pieces_ordered 234805 non-null float64
5 Size_Index 234805 non-null int64
6 dist_demand 234805 non-null float64
7 dist_ordered 234805 non-null float64
8 dist_discrete_demand 234805 non-null float64
9 dist_discrete_ordered 234805 non-null float64
10 count 234805 non-null int64
dtypes: float64(6), int64(3), object(2)
memory usage: 19.7+ MB

pandas.dataframe.astype is not converting dtype

I am trying to convert some columns from object to Categorical columns.
# dtyp_cat = 'category'
# mapper = {'Segment':dtyp_cat,
# "Sub-Category":dtyp_cat,
# "Postal Code":dtyp_cat,
# "Region":dtyp_cat,
# }
df.astype({'Segment':'category'})
df.dtypes
But the output is still object type.
Dataset is hosted at:
url = r"https://raw.githubusercontent.com/jaegarbomb/TSF_GRIP/main/Retail_EDA/Superstore.csv"
df = pd.read_csv(url)
Do this:
df['Segment'] = df.Segment.astype('category')
Which returns
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ship Mode 9994 non-null object
1 Segment 9994 non-null category
2 Country 9994 non-null object
3 City 9994 non-null object
4 State 9994 non-null object
5 Postal Code 9994 non-null int64
6 Region 9994 non-null object
7 Category 9994 non-null object
8 Sub-Category 9994 non-null object
9 Sales 9994 non-null float64
10 Quantity 9994 non-null int64
11 Discount 9994 non-null float64
12 Profit 9994 non-null float64
dtypes: category(1), float64(3), int64(2), object(7)
memory usage: 946.9+ KB
EDIT
If you want to convert several columns (In your case, I suppose it is all that are objects, you need to drop those that aren't, convert what's left and then reattach the other columns.
df2 = df.drop([ 'Postal Code', 'Sales', 'Quantity', 'Discount', 'Profit'], axis=1)
df3 = df2.apply(lambda x: x.astype('category'))
which gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ship Mode 9994 non-null category
1 Segment 9994 non-null category
2 Country 9994 non-null category
3 City 9994 non-null category
4 State 9994 non-null category
5 Region 9994 non-null category
6 Category 9994 non-null category
7 Sub-Category 9994 non-null category
dtypes: category(8)
memory usage: 115.2 KB
I'll leave the appending the other columns to you. A hint would be:
df4 = pd.concat([df3, df], axis=1, sort=False)
df_final = df4.loc[:,~df4.columns.duplicated()]

Resources