I would like to convert a column in dataframe to a string
it looks like this :
company department id family name start_date end_date
abc sales 38221925 Levy nali 16/05/2017 01/01/2018
I want to convert the id from int to string
I tried
got dtype('O')
I expect to receive string

This is intended behaviour. This is how pandas stores strings.
From the docs
Pandas uses the object dtype for storing strings.
For a simple test, you can make a dummy dataframe and check it's dtype too.
import pandas as pd
df = pd.DataFrame(["abc", "ab"])

You can do that by using apply() function in this way:
data['id'] = data['id'].apply(lambda x: str(x))
This will convert all the values of id column to string.
You can ensure the type of the values like this:
type(data['id'][0]) (It is checking the first value of 'id' column)
This will give the output str.
And data['id'].dtype will give dtype('O') that is object.
You can also use data.info() to check all the information about that DataFrame.

sqlalchemy.exc.StatementError: (builtins.TypeError) SQLite Date type only accepts Python date objects as input

I have pandas data frame that contains Month and Year values in a yyyy-mm format. I am using pd.to_sql to set the data type value to sent it to .db file.
I keep getting error:
sqlalchemy.exc.StatementError: (builtins.TypeError) SQLite Date type only accepts Python date objects as input.
Is there a way to set 'Date' Data type for 'MonthYear' (yyyy-mm) column? Or it should be set in a VARCHAR? I tried changing it to different types pandas's datetime data type, none of them seem to work.
I don't have any issues with 'full_date', it assigns it properly. Data type for 'full_date' is datetime64[ns] in pandas.
MonthYear full_date
2015-03 2012-03-11
2015-04 2013-08-19
2010-12 2012-06-29
2012-01 2018-01-01
df.to_sql('MY_TABLE', con=some_connection,
My opinion is that you shouldn't store unnecessarily the extra column in your database when you can derive it from the 'full_date' column.
One issue you'll run into is that SQLite doesn't have a DATE type. So, you need to parse the dates upon extraction with your query. Full example:
import datetime as dt
import numpy as np
import pandas as pd
import sqlite3
# I'm using datetime64[ns] because that's what you say you have
df = pd.DataFrame({'full_date': [np.datetime64('2012-03-11')]})
con = sqlite3.connect(":memory:")
df.to_sql("MY_TABLE", con, index=False)
new_df = pd.read_sql_query("SELECT * FROM MY_TABLE;", con,
In [111]: new_df['YearMonth'] = new_df['full_date'].dt.strftime('%Y-%m')
In [112]: new_df
full_date YearMonth
0 2012-03-11 2012-03

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
>>> df
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
Converting Pandas DataFrame OrderedSet column into list

I have a Pandas DataFrame, one column, is an OrderedSet like this:
0 OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])
This is:
from ordered_set import OrderedSet
I am just trying to convert this column into list:
df['OrderedSetCol_list'] = df['OrderedSetCol'].apply(lambda x: ast.literal_eval(str("\'" + x.replace('OrderedSet(','').replace(')','') + "\'")))
The code executes succesfully, but, my column type is still str and not list
What am I doing wrong?
EDIT: My OrderedSetCol is also a string column as I am reading a file from a disk, which was originally saved from OrderedSet column.
Expected Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
You can apply a list calling just like you would do with the OrderedSet itself:
df = pd.DataFrame({'OrderedSetCol':[OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])]})
[1721754, 3622558, 2550234, 2344034, 8550040]
How do I convert multiple `string` columns in my dataframe to datetime columns?

I am in the process of converting multiple string columns to date time columns, but I am running into the following issues:
Example column 1:
1/11/2018 9:00:00 AM
df = df.withColumn(df.column_name, to_timestamp(df.column_name, "MM/dd/yyyy hh:mm:ss aa"))
This works okay
Example column 2:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyy-MM-dd'T'HH:mm:ss'-05:00'"))
This works okay
Example column 3:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyyMMdd"))
This does not work. I get this error:
AnalysisException: "cannot resolve 'unix_timestamp(t.`date`,
'yyyyMMdd')' due to data type mismatch: argument 1 requires (string or
date or timestamp) type, however, 't.`date`' is of int type.
I feel like it should be straightforward, but I am missing something.
The error is pretty self explanatory, you need your column yo be a String.
Are you sure your column is already a String? It seems not. You can cast it to String first with column.cast
import org.apache.spark.sql.types._
df = df.withColumn(df.column_name, to_date(df.column_name.cast(StringType), "yyyyMMdd")

Pandas is messing with a high resolution integer on read_csv

EDIT: This was Excel's fault changing the data type, not Pandas.
When I read a CSV using pd.read_csv(file) a column of super long ints gets converted to a low res float. These ints are a date time in microseconds.
CSV Columns of some values:
pd.read_csv produces: 1.55551e+16
how do I get it to report the exact int?
I've tried using: float_precision='high'
It's possible that this is caused by the way Pandas handles missing values, meaning that your column is importing as floats, to allow the missing values to be coded as NaN.
A simple solution would be to force the column to import as a str, then impute or remove missing values, and the convert to int:
import pandas as pd
df = pd.read_csv(file, dtypes={'col1': str}) # Edit to use appropriate column reference
# If you want to just remove rows with missing values, something like:
df = df[df.col1 != '']
# Then convert to integer
df.col1 = df.col1.astype('int64')
