I would like to convert an int to string in dataframe - python-3.x

I would like to convert a column in dataframe to a string
it looks like this :
company department id family name start_date end_date
abc sales 38221925 Levy nali 16/05/2017 01/01/2018
I want to convert the id from int to string
I tried
data['id']=data['id'].to_string()
and
data['id']=data['id'].astype(str)
got dtype('O')
I expect to receive string

This is intended behaviour. This is how pandas stores strings.
From the docs
Pandas uses the object dtype for storing strings.
For a simple test, you can make a dummy dataframe and check it's dtype too.
import pandas as pd
df = pd.DataFrame(["abc", "ab"])
df[0].dtype
#Output:
dtype('O')

You can do that by using apply() function in this way:
data['id'] = data['id'].apply(lambda x: str(x))
This will convert all the values of id column to string.
You can ensure the type of the values like this:
type(data['id'][0]) (It is checking the first value of 'id' column)
This will give the output str.
And data['id'].dtype will give dtype('O') that is object.
You can also use data.info() to check all the information about that DataFrame.

str(12)
>>'12'
Can easily convert to a String

Related

sqlalchemy.exc.StatementError: (builtins.TypeError) SQLite Date type only accepts Python date objects as input

I have pandas data frame that contains Month and Year values in a yyyy-mm format. I am using pd.to_sql to set the data type value to sent it to .db file.
I keep getting error:
sqlalchemy.exc.StatementError: (builtins.TypeError) SQLite Date type only accepts Python date objects as input.
Is there a way to set 'Date' Data type for 'MonthYear' (yyyy-mm) column? Or it should be set in a VARCHAR? I tried changing it to different types pandas's datetime data type, none of them seem to work.
I don't have any issues with 'full_date', it assigns it properly. Data type for 'full_date' is datetime64[ns] in pandas.
MonthYear full_date
2015-03 2012-03-11
2015-04 2013-08-19
2010-12 2012-06-29
2012-01 2018-01-01
df.to_sql('MY_TABLE', con=some_connection,
dtype={'MonthYear':sqlalchemy.types.Date(),
'full_date':sqlalchemy.types.Date()})
My opinion is that you shouldn't store unnecessarily the extra column in your database when you can derive it from the 'full_date' column.
One issue you'll run into is that SQLite doesn't have a DATE type. So, you need to parse the dates upon extraction with your query. Full example:
import datetime as dt
import numpy as np
import pandas as pd
import sqlite3
# I'm using datetime64[ns] because that's what you say you have
df = pd.DataFrame({'full_date': [np.datetime64('2012-03-11')]})
con = sqlite3.connect(":memory:")
df.to_sql("MY_TABLE", con, index=False)
new_df = pd.read_sql_query("SELECT * FROM MY_TABLE;", con,
parse_dates={'full_date':'%Y-%m-%d'})
Result:
In [111]: new_df['YearMonth'] = new_df['full_date'].dt.strftime('%Y-%m')
In [112]: new_df
Out[112]:
full_date YearMonth
0 2012-03-11 2012-03

Converting dates to numbers on Python [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

Converting Pandas DataFrame OrderedSet column into list

I have a Pandas DataFrame, one column, is an OrderedSet like this:
df
OrderedSetCol
0 OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])
This is:
from ordered_set import OrderedSet
I am just trying to convert this column into list:
df['OrderedSetCol_list'] = df['OrderedSetCol'].apply(lambda x: ast.literal_eval(str("\'" + x.replace('OrderedSet(','').replace(')','') + "\'")))
The code executes succesfully, but, my column type is still str and not list
type(df.loc[0]['OrderedSetCol_list'])
str
What am I doing wrong?
EDIT: My OrderedSetCol is also a string column as I am reading a file from a disk, which was originally saved from OrderedSet column.
Expected Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
You can apply a list calling just like you would do with the OrderedSet itself:
df = pd.DataFrame({'OrderedSetCol':[OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])]})
df.OrderedSetCol.apply(list)
Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
If your data type string column:
df.OrderedSetCol.str.findall('\d+')

How do I convert multiple `string` columns in my dataframe to datetime columns?

I am in the process of converting multiple string columns to date time columns, but I am running into the following issues:
Example column 1:
1/11/2018 9:00:00 AM
Code:
df = df.withColumn(df.column_name, to_timestamp(df.column_name, "MM/dd/yyyy hh:mm:ss aa"))
This works okay
Example column 2:
2019-01-10T00:00:00-05:00
Code:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyy-MM-dd'T'HH:mm:ss'-05:00'"))
This works okay
Example column 3:
20190112
Code:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyyMMdd"))
This does not work. I get this error:
AnalysisException: "cannot resolve 'unix_timestamp(t.`date`,
'yyyyMMdd')' due to data type mismatch: argument 1 requires (string or
date or timestamp) type, however, 't.`date`' is of int type.
I feel like it should be straightforward, but I am missing something.
The error is pretty self explanatory, you need your column yo be a String.
Are you sure your column is already a String? It seems not. You can cast it to String first with column.cast
import org.apache.spark.sql.types._
df = df.withColumn(df.column_name, to_date(df.column_name.cast(StringType), "yyyyMMdd")

Pandas is messing with a high resolution integer on read_csv

EDIT: This was Excel's fault changing the data type, not Pandas.
When I read a CSV using pd.read_csv(file) a column of super long ints gets converted to a low res float. These ints are a date time in microseconds.
example:
CSV Columns of some values:
15555071095204000
15555071695202000
15555072295218000
15555072895216000
15555073495207000
15555074095206000
15555074695212000
15555075295202000
15555075895210000
15555076495216000
15555077095230000
15555077695206000
15555078295212000
15555078895218000
15555079495209000
15555080095208000
15555080530515000
15555086531880000
15555092531889000
15555098531886000
15555104531886000
15555110531890000
15555116531876000
15555122531873000
15555128531884000
15555134531884000
15555140531887000
15555146531874000
pd.read_csv produces: 1.55551e+16
how do I get it to report the exact int?
I've tried using: float_precision='high'
It's possible that this is caused by the way Pandas handles missing values, meaning that your column is importing as floats, to allow the missing values to be coded as NaN.
A simple solution would be to force the column to import as a str, then impute or remove missing values, and the convert to int:
import pandas as pd
df = pd.read_csv(file, dtypes={'col1': str}) # Edit to use appropriate column reference
# If you want to just remove rows with missing values, something like:
df = df[df.col1 != '']
# Then convert to integer
df.col1 = df.col1.astype('int64')
With a Minimal, Complete and Verifiable Example we can pinpoint the problem and update the code to accurately solve it.

Resources