TypeError on Pandas DataFrame - python-3.x

I have an error trying to convert integer numbers to DateTime format on a CSV file using Pandas.
The code I'm using is:
import pandas as pd
from datetime import datetime,timedelta
data=pd.read_csv("Dataset.csv",low_memory=False)
data.Date = data.Date.apply(lambda x:datetime.strptime(x, '%Y-%m-%d'))
The DataFrame is:
The error is:
TypeError: strptime() argument 1 must be str, not int
Does anyone know what is wrong here?
Thank you!!

Related

How to change first row values of csv file?

I am trying to change all negative values to 0 in excel files.
However, it seems like the pandas skips the first row.
Please help me with preventing skip issue! Thank you.
Here is the code:
# importing pandas module
import pandas as pd
import numpy as np
import csv
from pandas import DataFrame
df = pd.read_csv("FAPI-N2-rere_2D_modified.csv")
df[df < 0 ] = 0
df.to_csv('FAPI-N2-rere_2D_modified2.csv')
==========================================================
I have tried to add some codes into the above,
# importing pandas module
import pandas as pd
import numpy as np
import csv
from pandas import DataFrame
df = pd.read_csv("FAPI-N2-rere_2D_modified.csv", **header = None**)
df[df < 0 ] = 0
df.to_csv('FAPI-N2-rere_2D_modified2.csv')
However, i keep getting the typeerror:
TypeError: '<' not supported between instances of 'str' and 'int'
I would be so much appreciated if anyone could please help me.
Thank you so much!
Some columns of your dataframe contains strings.
If their content is numeric you can try to convert them into numeric datatype, here it is explained how to do that:
Change column type in pandas
If you have columns containing strings, you need to replace your negative values only on numeric columns.

Turn numeric text string with powers of ten nomenclator (e+) into float in python pandas

I've got a dataframe with more than 30000 rows and almost 40 columns exported from a csv file.
The most part of it mixes str with int features.
-integers are int
-floats and powers of ten are str
It looks like this:
Id A B
1 2.5220019e+008 1742087
2 1.7766118e+008 2223964.5
3 3.3750285e+008 2705867.8
4 97782360 2.5220019e+008
I've tried the following code:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, LineString, shape
df = pd.read_csv('mycsvfile.csv').astype(float)
Which yields the this error message:
ValueError: could not convert string to float: '-1.#IND'
I guess that it has to do about the exponencial nomenclator of powers of ten (e+) that the python libraries isn't able to transform.
Is there a way to fix it?
From my conversation with QuangHoang I should apply the function:
pd.to_numeric(df['column'], errors='coerce')
Since almost the whole DataFrame are str objects, I ran the following code line:
df2 = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))

Convert panda column to a string

I am trying to run the below script to add to columns to the left of a file; however it keeps giving me
valueError: header must be integer or list of integers
Below is my code:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv",header='true')
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
Any ways to make the creation_dt column a string?
According to pandas docs header is row number(s) to use as the column names, and the start of the data and must be int or list of int. So you have to pass header=0 to read_csv method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Also, pandas automatically creates dataframe from read file, you don't need to do it additionally. Use just
df = pd.read_csv("/home/ex.csv", header=0)
You can try:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv")
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=str(pd.to_datetime('today')))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)

Pandas plotting graph with timestamp

pandas 0.23.4
python 3.5.3
I have some code that looks like below
import pandas as pd
from datetime import datetime
from matplotlib import pyplot
def dateparse():
return datetime.strptime("2019-05-28T00:06:20,927", '%Y-%m-%dT%H:%M:%S,%f')
series = pd.read_csv('sample.csv', delimiter=";", parse_dates=True,
date_parser=dateparse, header=None)
series.plot()
pyplot.show()
The CSV file looks like below
2019-05-28T00:06:20,167;2070
2019-05-28T00:06:20,426;147
2019-05-28T00:06:20,927;453
2019-05-28T00:06:22,688;2464
2019-05-28T00:06:27,260;216
As you can see 2019-05-28T00:06:20,167 is the timestamp with milliseconds and 2070 is the value that I want plotted.
When I run this the graph gets printed however on the X-Axis I see numbers which is a bit odd. I was expecting to see actual timestamps (like MS Excel). Can someone tell me what I am doing wrong?
You did not set datetime as index. Aslo, you don't need a date parser, just pass the columns you want to parse:
dfstr = '''2019-05-28T00:06:20,167;2070
2019-05-28T00:06:20,426;147
2019-05-28T00:06:20,927;453
2019-05-28T00:06:22,688;2464
2019-05-28T00:06:27,260;216'''
df = pd.read_csv(pd.compat.StringIO(dfstr), sep=';',
header=None, parse_dates=[0])
plt.plot(df[0], df[1])
plt.show()
Output:
Or:
df.set_index(0)[1].plot()
gives a little better plot:

numpy busday_count for Days difference gives TypeError: dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'"

I am trying to calculate number of days between two dates and i get an error
TypeError: ("Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'", 'occurred at index 0')
The column REASSIGN_DATE has blank values in few rows which i am filling using another Column called SETUP_DATE to remove and blank values and fill a date.
Then i am trying to calculate the number days between these two days using the below code.
import pandas as pd
import numpy as np
import datetime
from datetime import date
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import xlrd
import workdays
import defusedxml
from xlrd import open_workbook
defusedxml.defuse_stdlib()
def secure_open_workbook(**kwargs):
try:
return open_workbook(**kwargs)
except EntitiesForbidden:
raise ValueError('Please use a xlsx file without XEE')
#loading Raw Data
releases = pd.read_excel(r'C:\Desktop\Releases.xlsx',
sheet_name = 'Releases',
header = 0
)
releases.loc[releases['REASSIGN_DATE'].isnull(),'REASSIGN_DATE']=releases['SETUP_DATE']
releases['REASSIGN_DATE']=pd.to_datetime(releases['REASSIGN_DATE'])
releases['RELEASED_DATE']=pd.to_datetime(releases['RELEASED_DATE'])
releases['RELEASED_DAYS']=releases.apply(lambda x:
np.busday_count(x.REASSIGN_DATE,x.RELEASED_DATE),axis =1)
releases_2=releases.drop(['SETUPDATE','RELEASEDDATE','REASSIGNDATE'],axis=1)
I get the error.
I even tried to add astype('datetime64[D]') as well however i get the error astype('datetime64[ns]') cannot be converted to astype('datetime64[D]').
How could i avoid this error.
Regards,
Ren.

Resources