I have a dataframe that I constructed from pulling data from SQL using pd.read_sql_query(). I have one column that has dates but in excel general number format. How do convert this column into datetime object.
I can convert one value with the xlrd library but looking for the best way to convert the entire column.
datetime_value = datetime(*xlrd.xldate_as_tuple(42369, 0))
You can use map to apply a lambda function performing that operation to every entry in a column:
import pandas as pd
import xlrd
from datetime import datetime
# Create dummy dataframe
df = pd.DataFrame({
"date": [42369, 42370, 42371, 42372]
})
print df.to_string()
# Convert values into a new column named "converted"
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
print df.to_string()
Before conversion:
date
0 42369
1 42370
2 42371
3 42372
After:
date converted
0 42369 2015-12-31
1 42370 2016-01-01
2 42371 2016-01-02
3 42372 2016-01-03
Is this what you are looking for?
Update:
To make this work with string entries, you could either tell Pandas to treat the column as ints or floats:
# int
df["converted"] = df["date"].astype(int).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
# float
df["converted"] = df["date"].astype(float).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
or just cast x to int or float within the lambda function:
# int
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(int(x), 0)))
# float
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(float(x), 0)))
Related
I have a following column in a dataframe:
COLUMN_NAME
1
0
1
1
65280
65376
65280
I want to convert 5 digit values in a column to their corresponding binary values. I know how to convert it by using bin() function, but i don't know how to apply it only to rows that has 5digits.
Note that the column contains only values with either 1 or 5 digits. Values with 1 digit is only 1 or 0.
import pandas as pd
import numpy as np
data = {'c': [1,0,1,1,65280,65376,65280] }
df = pd.DataFrame (data, columns = ['c'])
// create another column 'clen' which has length of 'c'
df['clen'] = df['c'].astype(str).map(len)
//check condition and apply bin function to entire column
df.loc[df['clen']==5,'c'] = df['c'].apply(bin)
I have a "pandas.core.frame.DataFrame" "seied_log":
seied_log
Out[155]:
0
0 5.264761
1 5.719328
2 6.420809
3 6.129704
...
What I run is ARIMA model:
model = ARIMA(seied_log, order=(2, 1, 0))
Hovewer I receive the following mistake:
ValueError: Given a pandas object and the index does not contain dates
What I need, is to define a "date" column. These are yearly observations. How can I define a column with date starting from 1978?
If your index is 0 through n_obs-1, then simply
from datetime import datetime
seied_log["date"] = seied_log.index.map(lambda idx: datetime(year=1978+idx, month=1, day=1)
I want to sum the numerical values in each row (Store A to Store D) for the month of June and place them in an appended column 'Sum'. But the results generate very huge sum values which are wrong. How to get correct sum?
This code was run using Python 3.6 :
import pandas as pd
import numpy as np
data = np.array([
['', 'week','storeA','storeB','storeC','storeD'],
[0,"2014-05-04",2643,8257,3893,6231],
[1,"2014-05-11",6444,5736,5634,7092],
[2,"2014-05-18",9646,2552,4253,5447],
[3,"2014-05-25",5960,10740,8264,6063],
[4,"2014-06-04",5960,10740,8264,6063],
[5,"2014-06-12",7412,7374,3208,3985]
])
df= pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
# get rows of table which match Year,Month for last month
df2 = df[df['week'].str.contains("2014-06")].copy()
print(df2)
# generate col summing up each row
col_list = list(df2)
print(col_list)
col_list.remove('week')
print(col_list)
df2['Sum'] = df2[col_list].sum(axis=1)
print(df2)
Output of Sum column for rows 4 and 5:
Row4 - 5.960107e+16
Row5 - 7.412737e+15
Use astype, to convert those strings to ints and sum works properly:
df2['Sum'] = df2[col_list].astype(int).sum(axis=1)
Output:
week storeA storeB storeC storeD Sum
4 2014-06-04 5960 10740 8264 6063 31027
5 2014-06-12 7412 7374 3208 3985 21979
What was happening,you were summing (concatenating) strings.
Because of the way your array is defined, with mixed strings and objects, everything is coerced to string. Take a look at this:
df.dtypes
week object
storeA object
storeB object
storeC object
storeD object
dtype: object
You have columns of strings, and sum on string dataframes results in concatenation.
The solution is to convert these to integers first -
df2[col_list] = df2[col_list].astype(int)
Your code then works.
df2[col_list].sum(axis=1)
4 31027
5 21979
dtype: int64
Alternatively, declare data as a object array -
data = np.array([[...], [...], ...], dtype=object)
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])
Next, perform a soft conversion using infer_objects (new in v0.22):
df = df.infer_objects()
df.dtypes
week object
storeA int64
storeB int64
storeC int64
storeD int64
dtype: object
Works like a charm.
I have imported a CSV file as a Pandas dataframe. When I run df.dtypes I get most columns as "object", which is useless for taking into Bokeh for charts.
I need to change a column as int, another column as date, and the rest as strings.
I see the data types only once I import it. Would you recommend changing it during import (how?), or after import?
I think for datetime need parse_dates parameter in read_csv.
If you have int column and dont get int64 dtype, I think there are some strings maybe empty strings, because read_csv aoutomatically cast dtypes.
Then need convert bad data to NaN by to_numeric - but get float column because NaN has float type. So need replace NaN to some int (e.g. 0) and then cast to int:
df['col_int'] = pd.to_numeric(df['col_int'], errors='coerce').fillna(0).astype(int)
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a;b;c;d
A;2015-01-01;3;e
S;2015-01-03;4;r
D;2015-01-05;5r;t"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";", parse_dates=[1])
print (df)
a b c d
0 A 2015-01-01 3 e
1 S 2015-01-03 4 r
2 D 2015-01-05 5r t
print (df.dtypes)
a object
b datetime64[ns]
c object
d object
dtype: object
df['c'] = pd.to_numeric(df['c'], errors='coerce').fillna(0).astype(int)
print (df)
a b c d
0 A 2015-01-01 3 e
1 S 2015-01-03 4 r
2 D 2015-01-05 0 t
print (df.dtypes)
a object
b datetime64[ns]
c int32
d object
dtype: object
For change dtypes need dtype parameter:
temp=u"""a;b;c;d
A;10;3;e
S;2;4;r
D;6;1;t"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";", dtype={'b':str, 'c':float})
print (df)
a b c d
0 A 10 3.0 e
1 S 2 4.0 r
2 D 6 1.0 t
print (df.dtypes)
a object
b object
c float64
d object
dtype: object
During reading a csv file:
Use dtype or converters attribute in read_csv in pandas
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv',dtypes = {'a':float64,'b':int32},headers=None)
Here,automatically the types will be read as the datatype you specified.
After having read the csv file:
Use astype function to change the column types.
Check this code.
Consider you have two columns
df[['a', 'b']] = df[['a', 'b']].astype(float)
The advantage of this is you change type of multiple columns at once.
Use one-hot encoding.
here datatype converts from object to category and then it converts to int64.
But this method is used in categorical data.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
dataframe = pd.read_csv('path//filename.csv')
dataframe ["attribute"] = dataframe ["attribute"].astype('category')
dataframe ["attribute_cat"] = dataframe ["attribute"].cat.codes
Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':['A','B','C','D'],
'C':['1','12','*','8']
})
df
A C
0 A 1
1 B 12
2 C *
3 D 8
I'd like to remove all instances of '*' and convert the rest to integer.
There may be some instances of 'nan' or 'NaN' in my actual data.
You could use pd.to_numeric to convert the C column to numeric values. Passing errors='coerce' tells pd.to_numeric to set non-numeric values to NaN.
import pandas as pd
df = pd.DataFrame(
{'A':['A','B','C','D'],
'C':['1','12','*','8'] })
df['C'] = pd.to_numeric(df['C'], errors='coerce')
print(df)
prints
A C
0 A 1.0
1 B 12.0
2 C NaN
3 D 8.0
Since NaN values are only allowed in columns with floating-point dtype (or object dtype), the column can not be set to an integer dtype.
int() is the Python standard built-in function to convert a string into an integer value. Convert the column to int using int().
For parsing integers instead of floats, you can use the isdigit() function for string objects.
If you run isdigit() after int(), you can filter the data to rows where the value for column C is an integer.