printing info() at Pandas at the report the entries and index number are not the same - python-3.x

at Jupyter notebook I Printed df.info() the result is
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20620 entries, 0 to 24867
Data columns (total 3 columns):
neighborhood 20620 non-null object
bedrooms 20620 non-null float64
price 20620 non-null float64
dtypes: float64(2), object(1)
memory usage: 644.4+ KB
why it shows 20620 entries form 0 to 24867? The last number (24867) should be 20620 or 20619

It means that not every possible index value has been used.
For example,
In [13]: df = pd.DataFrame([10,20], index=[0,100])
In [14]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 100
Data columns (total 1 columns):
0 2 non-null int64
dtypes: int64(1)
memory usage: 32.0 bytes
df has 2 entries, but the Int64Index "ranges" from 0 to 100.
DataFrames can easily end up like this if rows have been deleted, or if df is a sub-DataFrame of another DataFrame.
If you reset the index, the index labels will be renumbered in order, starting from 0:
In [17]: df.reset_index(drop=True)
Out[17]:
0
0 10
1 20
In [18]: df.reset_index(drop=True).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
0 2 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes
To be more precise, as Chris points out, the line
Int64Index: 2 entries, 0 to 100
is merely reporting the first and last value in the Int64Index. It's not reporting min or max values. There can be higher or lower integers in the index:
In [32]: pd.DataFrame([10,20,30], index=[50,0,50]).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 50 to 50 # notice index value 0 is not mentioned
Data columns (total 1 columns):
0 3 non-null int64
dtypes: int64(1)
memory usage: 48.0 bytes

Related

How to parse correctly to date in python from specific ISO format

I connected to a table on a db where there are two columns with dates.
I had no problem to parse the column with values formatted as this: 2017-11-03
But I don't find a way to parse the other column with dates formatted as this: 2017-10-03 05:06:52.840 +02:00
My attempts
If I parse a single value through the strptime method
dt.datetime.strptime("2017-12-14 22:16:24.037 +02:00", "%Y-%m-%d %H:%M:%S.%f %z")
I get the correct output
datetime.datetime(2017, 12, 14, 22, 16, 24, 37000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
but if I try to use the same code format while parsing the table to the dataframe, the column dtype is an object:
Licenze_FromGY = pd.read_sql(query, cnxn, parse_dates={"EndDate":"%Y-%m-%d", "LastUpd":"%Y-%m-%d %H:%M:%S.%f %z"})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 1000 non-null object
I also tried to change the code format either in the read_sql method or in the pd.to_datetime() method, but then all the values become NaT:
Licenze_FromGY["LastUpd"] = pd.to_datetime(Licenze_FromGY["LastUpd"], format="%Y-%m-%d %H:%M:%S.%fZ", errors="coerce")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 0 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](2), int64(2), object(5)
memory usage: 71.4+ KB
None
Anyone can help?
pandas cannot handle mixed UTC offsets in one Series (column). Assuming you have data like this
import pandas as pd
df = pd.DataFrame({"datetime": ["2017-12-14 22:16:24.037 +02:00",
"2018-08-14 22:16:24.037 +03:00"]})
if you just parse to datetime,
df["datetime"] = pd.to_datetime(df["datetime"])
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: object
you get dtype object. The elements of the series are of the Python datetime.datetime dtype. That limits the datetime functionality, compared to the pandas datetime dtype.
You can get that e.g. by parsing to UTC:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
df["datetime"]
0 2017-12-14 20:16:24.037000+00:00
1 2018-08-14 19:16:24.037000+00:00
Name: datetime, dtype: datetime64[ns, UTC]
You might set an appropriate time zone to re-create the UTC offset:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True).dt.tz_convert("Europe/Athens")
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: datetime64[ns, Europe/Athens]

Pandas csv reader - how to force a column to be a specific data type (and replace NaN with null)

I am just getting started with Pandas and I am reading a csv file using the read_csv() method. The difficulty I am having is needing to set a column to a specific data type.
df = pd.read_csv('test_data.csv', delimiter=',', index_col=False)
my df looks like this:
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_code 4 non-null object
1 store_code 4 non-null int64
2 cost1 4 non-null float64
3 start_date 4 non-null int64
4 end_date 4 non-null int64
5 quote_reference 0 non-null float64
6 min1 4 non-null int64
7 cost2 2 non-null float64
8 min2 2 non-null float64
9 cost3 1 non-null float64
10 min3 1 non-null float64
dtypes: float64(6), int64(4), object(1)
memory usage: 480.0+ bytes
you can see that I have multiple 'min' columns min1, min2, min3
min1 is correctly detected as an int64, but min2 and min3 are float64.
this is due to min1 being fully populated, whereas min2, and min3 are sparsely populated.
here is my df:
as you can see min2 has 2 NaN values.
trying to change the data type using
df['min2'] = df['min2'].astype('int')
I get this error:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Ideally I want to change the data type to Int, and have NaN replaced by a NULL (ie don't want a 0).
I have tried a variety of methods, ie fillna, but can't crack this.
All help greatly appreciated.
Since pandas 1.0, you can use a generic pandas.NA to replace numpy.nan. This is useful to serve as an integer NA.
To perform the convertion, use the "Int64" type (note the capital I).
df['min2'] = df['min2'].astype('Int64')
Example:
s = pd.Series([1, 2, None, 3])
s.astype('Int64')
Or:
pd.Series([1, 2, None, 3], dtype='Int64')
Output:
0 1
1 2
2 <NA>
3 3
dtype: Int64

How does pandas.merge scale?

I have two dataframes A and B. A consumes about 15 times more memory and has about 30 times more rows than B. However, if I merge A to itself it takes only about 15 seconds, whereas merging B to itself takes about 15 minutes (pandas version 1.1.5).
# this chain of operations takes about 15 seconds with A
(A # with B here and in the line below it takes about 15 minutes
.merge(A, how='left', indicator=True)
.query("_merge == 'left_only'")
.drop(columns='_merge')
.reset_index(drop=True))
Does anyone know how merges scale in pandas? Why is the merge of the large dataframe with itself so much faster than that of the small one with itself?
More details on the two dataframes:
Dataframe A:
RangeIndex: 7250117 entries, 0 to 7250116
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 Order_ID int64
1 Color object
2 Size object
3 Country object
4 pieces_demand float64
5 count int64
dtypes: float64(1), int64(2), object(3)
memory usage: 331.9+ MB
DataFrame B:
RangeIndex: 234805 entries, 0 to 234804
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order_ID 234805 non-null int64
1 Color 234805 non-null object
2 Size 234805 non-null object
3 pieces_demand 234805 non-null float64
4 pieces_ordered 234805 non-null float64
5 Size_Index 234805 non-null int64
6 dist_demand 234805 non-null float64
7 dist_ordered 234805 non-null float64
8 dist_discrete_demand 234805 non-null float64
9 dist_discrete_ordered 234805 non-null float64
10 count 234805 non-null int64
dtypes: float64(6), int64(3), object(2)
memory usage: 19.7+ MB

Merge changes Pandas types

I'm using Python 3 (don't know if the info is relevant).
I have 2 Pandas DataFrames (coming from read_csv()): Compact and SDSS_DR7_to_DR8. Before merge, they contain types as follow :
Compact.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2070 entries, 0 to 2069
Data columns (total 8 columns):
Group 2070 non-null int64
Id 2070 non-null int64
RA 2070 non-null float64
Dec 2070 non-null float64
z 2070 non-null float64
R 2070 non-null float64
G 2070 non-null float64
objid 2070 non-null int64
dtypes: float64(5), int64(3)
memory usage: 129.5 KB
And
SDSS_DR7_to_DR8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 5 columns):
specobjid 243500 non-null int64
dr8objid 243500 non-null int64
dr7objid 243500 non-null int64
ra 243500 non-null float64
dec 243500 non-null float64
dtypes: float64(2), int64(3)
memory usage: 9.3 MB
I perform a Compact=pd.merge(Compact, SDSS_DR7_to_DR8, left_on=['objid'], right_on=['dr8objid'], how='left'). It is executed without error, but the result is a mess. When I check the types in the new DataFrame, I get this:
Compact.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070 entries, 0 to 2069
Data columns (total 13 columns):
Group 2070 non-null int64
Id 2070 non-null int64
RA 2070 non-null float64
Dec 2070 non-null float64
z 2070 non-null float64
R 2070 non-null float64
G 2070 non-null float64
objid 2070 non-null int64
specobjid 1275 non-null float64
dr8objid 1275 non-null float64
dr7objid 1275 non-null float64
ra 1275 non-null float64
dec 1275 non-null float64
dtypes: float64(10), int64(3)
memory usage: 226.4 KB
So during the merge, dr8objid (and some others) has (have) been cast to float64. How is it possible, and what can I do to prevent this (hoping this is the origin of the mess in the merge)?
EDIT
So, to be more specific: if I create df
df=pd.DataFrame(data=[[1000000000000000000,1]], columns=['key','data'])
key and data are both int64. I create a transcoding df:
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]],
columns=['key','key2'])
which 2 keys are int64. Then
df2 = pd.merge(df, trans, on=['key'], how='left')
Gives a nice result, and key, key2 and data are still int64.
Nevertheless, if I define
df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
columns=['key','data'])
Now after the merge, I get
and now key2 has switched to a float64. How to prevent this? Is it because NaN must be connected with a float ? If so, is it possible to set the merge to define the result of merging to 0 or -1 if there is no correspondance, keeping the whole column to int64?
Update: in Pandas 0.24, there now are Nullable integer data types.
As of this writing, the Pandas does not seem choose the nullable int data type for the result of the merge. But it's possible to convert both arrays to the nullable int type Int64 before the merge.
Consider
df=pd.DataFrame(data=[[1000000000000000000,1],[1000000000000000001,2]],
columns=['key','data']).astype("Int64")
trans=pd.DataFrame(data=[[1000000000000000000,2000000000000000000]],
columns=['key','key2']).astype("Int64")
df2 = pd.merge(df, trans, on=['key'], how='left')
Result:
>>> df2
key data key2
0 1000000000000000000 1 2000000000000000000
1 1000000000000000001 2 <NA>
>>> df2.dtypes
key Int64
data Int64
key2 Int64
dtype: object
Original answer, for Pandas < v0.24:
Is it because NaN must be connected with a float ?
Correct. There is no NaN value in an int, so missing values can only be represented in floats.
You could either filter your data before merging, making sure that there are no NaNs created.
Or you could fill in the NaNs with a value of your choosing after the merge, and then restore the dtype.

Offset a part of a dataframe by hours

I'm have a portion of measurements where the instrument time has been configured wrong, and this needs to be corrected.
So I'm trying to offset or shift a part of my dataframe by 2 hours, but I can't seem to get it to work with my following code:
dfs['2015-03-23 10:45:00':'2015-03-23 13:15:00'].shift(freq=datetime.timedelta(hours=2))
I don't know if it can be done this easily though.
Hope someone understands my issue :)
>>> dfs.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11979 entries, 2015-03-23 10:45:05 to 2015-03-23 16:19:32
Data columns (total 11 columns):
CH-1[V] 11979 non-null float64
CH-2[V] 11979 non-null float64
CH-3[V] 11979 non-null float64
CH-4[V] 11979 non-null float64
CH-5[V] 11979 non-null float64
CH-6[V] 11979 non-null float64
CH-7[V] 11979 non-null float64
CH-9[C] 11979 non-null float64
CH-10[C] 11979 non-null float64
Event 11979 non-null int64
Unnamed: 11 0 non-null float64
dtypes: float64(10), int64(1)
memory usage: 1.1 MB
Pandas Indexes are not mutable, so we can't change the index in-place.
We could however make the index a DataFrame column, modify the column, and then reset the index:
import numpy as np
import pandas as pd
# Suppose this is your `dfs`:
index = pd.date_range('2015-03-23 10:45:05', '2015-03-23 16:19:32', freq='T')
N = len(index)
dfs = pd.DataFrame(np.arange(N), index=index)
# move the index into a column
dfs = dfs.reset_index()
mask = (index >= '2015-03-23 10:45:00') & (index <= '2015-03-23 13:15:00')
# shift the masked values in the column
dfs.loc[mask, 'index'] += pd.Timedelta(hours=2)
# use the index column as the index
dfs = dfs.set_index(['index'])
This shows the index has been shifted by 2 hours:
In [124]: dfs.iloc[np.where(mask)[0].max()-1:].head(5)
Out[124]:
0
index
2015-03-23 15:13:05 148
2015-03-23 15:14:05 149 <-- shifted by 2 hours
2015-03-23 13:15:05 150 <-- unchanged
2015-03-23 13:16:05 151
2015-03-23 13:17:05 152

Resources