Pandas Mixed Type to Integer - python-3.x

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':['A','B','C','D'],
'C':['1','12','*','8']
})
df
A C
0 A 1
1 B 12
2 C *
3 D 8
I'd like to remove all instances of '*' and convert the rest to integer.
There may be some instances of 'nan' or 'NaN' in my actual data.

You could use pd.to_numeric to convert the C column to numeric values. Passing errors='coerce' tells pd.to_numeric to set non-numeric values to NaN.
import pandas as pd
df = pd.DataFrame(
{'A':['A','B','C','D'],
'C':['1','12','*','8'] })
df['C'] = pd.to_numeric(df['C'], errors='coerce')
print(df)
prints
A C
0 A 1.0
1 B 12.0
2 C NaN
3 D 8.0
Since NaN values are only allowed in columns with floating-point dtype (or object dtype), the column can not be set to an integer dtype.

int() is the Python standard built-in function to convert a string into an integer value. Convert the column to int using int().
For parsing integers instead of floats, you can use the isdigit() function for string objects.
If you run isdigit() after int(), you can filter the data to rows where the value for column C is an integer.

Related

pandasObject.index() Vs reindexing using series

The functionality of reindexing in python pandas can also be done python Series as below.
import pandas as pd
order = ['a','c','b']
series_data = pd.Series([1,2,3],index=order)
series_data
In that case why do we explicitly go for reindex?
Let's take an example using index available in Series
s = pd.Series([1,2,3], index=['k','f','t'])
s
# k 1
# f 2
# t 3
# dtype: int64
We can state that above series got assigned index with a datatype of int64.
Now let's proceed with reindex:
order = ['k','c','b']
s.reindex(order)
# k 1.0
# c NaN
# b NaN
# dtype: float64
As you can observe we passed two new index c and b which were not there in original series, so those values are assigned equal to NaN. Since NaN has dtype of float64, hence a final series results into only three indexes k, c and b with dtype as float64.
I hope this clears how index inside Series is different from reindex outside.
You can refer below link to understand about reindexing.
https://www.tutorialspoint.com/python_pandas/python_pandas_reindexing.htm

Dividing a pandas dataframe column of dtype float64 by a floating number is returning only integer

I am using python 3.6.5 and I am getting this problem:
df = pd.DataFrame({'a' : [1,2,3] , 'b' : [4,5,7]} , dtype = np.float64)
and if I do any kind of division, for instance
df['a']/2.0 = 0 0
1 1
2 2
Name: a, dtype: float64
And actually even if I try to build a dataframe using an array containing decimals, pd.DataFrame is turning everything into integer, eventhough it says that the dtype is float64. Thanx for ur help

Why is call to sum() on a data frame generating wrong numbers?

I want to sum the numerical values in each row (Store A to Store D) for the month of June and place them in an appended column 'Sum'. But the results generate very huge sum values which are wrong. How to get correct sum?
This code was run using Python 3.6 :
import pandas as pd
import numpy as np
data = np.array([
['', 'week','storeA','storeB','storeC','storeD'],
[0,"2014-05-04",2643,8257,3893,6231],
[1,"2014-05-11",6444,5736,5634,7092],
[2,"2014-05-18",9646,2552,4253,5447],
[3,"2014-05-25",5960,10740,8264,6063],
[4,"2014-06-04",5960,10740,8264,6063],
[5,"2014-06-12",7412,7374,3208,3985]
])
df= pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
# get rows of table which match Year,Month for last month
df2 = df[df['week'].str.contains("2014-06")].copy()
print(df2)
# generate col summing up each row
col_list = list(df2)
print(col_list)
col_list.remove('week')
print(col_list)
df2['Sum'] = df2[col_list].sum(axis=1)
print(df2)
Output of Sum column for rows 4 and 5:
Row4 - 5.960107e+16
Row5 - 7.412737e+15
Use astype, to convert those strings to ints and sum works properly:
df2['Sum'] = df2[col_list].astype(int).sum(axis=1)
Output:
week storeA storeB storeC storeD Sum
4 2014-06-04 5960 10740 8264 6063 31027
5 2014-06-12 7412 7374 3208 3985 21979
What was happening,you were summing (concatenating) strings.
Because of the way your array is defined, with mixed strings and objects, everything is coerced to string. Take a look at this:
df.dtypes
week object
storeA object
storeB object
storeC object
storeD object
dtype: object
You have columns of strings, and sum on string dataframes results in concatenation.
The solution is to convert these to integers first -
df2[col_list] = df2[col_list].astype(int)
Your code then works.
df2[col_list].sum(axis=1)
4 31027
5 21979
dtype: int64
Alternatively, declare data as a object array -
data = np.array([[...], [...], ...], dtype=object)
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])
Next, perform a soft conversion using infer_objects (new in v0.22):
df = df.infer_objects()
df.dtypes
week object
storeA int64
storeB int64
storeC int64
storeD int64
dtype: object
Works like a charm.

How to change data types "object" in Pandas dataframe after importing a CSV?

I have imported a CSV file as a Pandas dataframe. When I run df.dtypes I get most columns as "object", which is useless for taking into Bokeh for charts.
I need to change a column as int, another column as date, and the rest as strings.
I see the data types only once I import it. Would you recommend changing it during import (how?), or after import?
I think for datetime need parse_dates parameter in read_csv.
If you have int column and dont get int64 dtype, I think there are some strings maybe empty strings, because read_csv aoutomatically cast dtypes.
Then need convert bad data to NaN by to_numeric - but get float column because NaN has float type. So need replace NaN to some int (e.g. 0) and then cast to int:
df['col_int'] = pd.to_numeric(df['col_int'], errors='coerce').fillna(0).astype(int)
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a;b;c;d
A;2015-01-01;3;e
S;2015-01-03;4;r
D;2015-01-05;5r;t"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";", parse_dates=[1])
print (df)
a b c d
0 A 2015-01-01 3 e
1 S 2015-01-03 4 r
2 D 2015-01-05 5r t
print (df.dtypes)
a object
b datetime64[ns]
c object
d object
dtype: object
df['c'] = pd.to_numeric(df['c'], errors='coerce').fillna(0).astype(int)
print (df)
a b c d
0 A 2015-01-01 3 e
1 S 2015-01-03 4 r
2 D 2015-01-05 0 t
print (df.dtypes)
a object
b datetime64[ns]
c int32
d object
dtype: object
For change dtypes need dtype parameter:
temp=u"""a;b;c;d
A;10;3;e
S;2;4;r
D;6;1;t"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";", dtype={'b':str, 'c':float})
print (df)
a b c d
0 A 10 3.0 e
1 S 2 4.0 r
2 D 6 1.0 t
print (df.dtypes)
a object
b object
c float64
d object
dtype: object
During reading a csv file:
Use dtype or converters attribute in read_csv in pandas
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv',dtypes = {'a':float64,'b':int32},headers=None)
Here,automatically the types will be read as the datatype you specified.
After having read the csv file:
Use astype function to change the column types.
Check this code.
Consider you have two columns
df[['a', 'b']] = df[['a', 'b']].astype(float)
The advantage of this is you change type of multiple columns at once.
Use one-hot encoding.
here datatype converts from object to category and then it converts to int64.
But this method is used in categorical data.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
dataframe = pd.read_csv('path//filename.csv')
dataframe ["attribute"] = dataframe ["attribute"].astype('category')
dataframe ["attribute_cat"] = dataframe ["attribute"].cat.codes

Convert pandas dataframe column containing Excel general numbers into datetime object

I have a dataframe that I constructed from pulling data from SQL using pd.read_sql_query(). I have one column that has dates but in excel general number format. How do convert this column into datetime object.
I can convert one value with the xlrd library but looking for the best way to convert the entire column.
datetime_value = datetime(*xlrd.xldate_as_tuple(42369, 0))
You can use map to apply a lambda function performing that operation to every entry in a column:
import pandas as pd
import xlrd
from datetime import datetime
# Create dummy dataframe
df = pd.DataFrame({
"date": [42369, 42370, 42371, 42372]
})
print df.to_string()
# Convert values into a new column named "converted"
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
print df.to_string()
Before conversion:
date
0 42369
1 42370
2 42371
3 42372
After:
date converted
0 42369 2015-12-31
1 42370 2016-01-01
2 42371 2016-01-02
3 42372 2016-01-03
Is this what you are looking for?
Update:
To make this work with string entries, you could either tell Pandas to treat the column as ints or floats:
# int
df["converted"] = df["date"].astype(int).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
# float
df["converted"] = df["date"].astype(float).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
or just cast x to int or float within the lambda function:
# int
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(int(x), 0)))
# float
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(float(x), 0)))

Resources