Cannot convert object to np.int64 with Numpy - python-3.x

I have a dataframe with 3 columns with the following dtypes:
df.info()
tconst object
directors object
writers object
Please see the data itself:
Now, I have to change the column tconst to dtype:int64. I tried this code but it throws an error:
df = pd.read_csv('title.crew.tsv',
header=None,sep='\t',
encoding= 'latin1',
names = ['tconst', 'directors','writers'],
dtype={'tconst': np.int64,'directors':np.int64})
Error 1:ValueError: invalid literal for int() with base 10: 'tconst'
Error:TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
What is going wrong here?

In my opinion problem here is parameter header=None which is used for read file with no csv header.
Solution is remove it, because in file is first row header, which is conveterted to columns names of DataFrame:
df = pd.read_csv('title.crew.tsv',
sep='\t',
encoding= 'latin1')
Another problem is tt and nm prefix in columns, so cannot be converted to integers.
Solution is:
df['tconst'] = df['tconst'].str[2:].astype(int)

Related

try convert string to date row per row in pandas or similar

I need to join dataframes with dates in the format '%Y%m%d'. Some data is wrong or missing and when I put pandas with:
try: df['data'] = pd.to_datetime(df['data'], format='%Y%m%d')
except: pass
If 1 row is wrong, it fails to convert the whole column. I would like it to skip only the rows with error without converting.
I could solve this by lopping with datetime, but my question is, is there a better solution for this with pandas?
Pass errors = 'coerce' to pd.to_datetime to convert the values with wrong date format to NaT. Then you can use Series.fillna to fill those NaT with the input values.
df['data'] = (
pd.to_datetime(df['data'], format='%Y%m%d', errors='coerce')
.fillna(df['data'])
)
From the docs
errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
If 'raise', then invalid parsing will raise an exception.
If 'coerce', then invalid parsing will be set as NaT.
If 'ignore', then invalid parsing will return the input.

DF - data conversion issues

I have a question regarding a project that I am doing. I am getting a conversion error, that cannot convert the series to <class 'int'> and cannot see the reason why. the values that I've got are int64 meanwhile the system tries to convert to a base 10.
I have a csv file called "test.csv" and it is structured like this:
date,value
2016-05-09,1201
2016-05-10,2329
2016-05-11,1716
2016-05-12,10539
...
I import the data, parse the dates and set the index column to 'date'.
df = pd.read_csv("test.csv", parse_dates=True)
df = df.set_index('date')
Afterwards I clean the data of the first and last 2.5%
df = df[(df['value'] >= (df['value'].quantile(0.025))) &(df['value'] <= (df['value'].quantile(0.975)))]
I print the data types that I've got and find only one:
print (df.dtypes)
value int64
dtype: object
If I run it against this code (as part of a test):
actual = int(time_series_visualizer.df.count(numeric_only=True))
I get this error:
TypeError: cannot convert the series to <class 'int'>
I was tried to convert to another type to see if it was an issue with int64.
tried:
df.value.astype(int)
df.value.astype(float)
but both didn't work.
Does anyone have any suggestions that I could try?
thanks

Data frame data type conflict...conver

I am receiving the error below upon running a python file.
'invalid literal for int() with base 10: 'data missing'
It looks as though some of the data in my dataframe is not of a type compatable with an arithmetic operation I would like to perform.
Can someone advise on how I might be able to locate the position of data that is giving the error? And or bypass the entire error with a preprocessing step that allows for the normalization step?
I am confused because missing data should be dropped with the df1.dropna and if not there
The original line throwing the error was the line used to normalize the data. (last line below)
i've tried to convert the dataframe with
df1 = df1.astype(int)
df1 = pd.concat([df2,df3], axis = 1, join_axes = [df2.index])
df1 = df1.fillna(method = 'bfill')
df1 = df1.dropna(axis =0)
df1 = df1.astype(int)
df1 = (df1 - df1.min())/(df1.max() - df1.min())
I think u should try df1.dtypes to check the data type of each column first
here is the documentation of that:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

Pandas read_csv mixed types columns as string

Is there any option in pandas' read_csv function that can automatically convert every item of an object dtype as str.
For example, I get the following when trying to read a CSV file:
mydata = pandas.read_csv(myfile, sep="|", header=None)
C:\...\pandas\io\parsers.py:1159: DtypeWarning: Columns (6,635) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
Is there a way such that (i) the warning is suppressed from printing, but (ii) I can capture the warning message in a string from where I can extract the specific columns, e.g. 6 and 635 in this case (so that I can fix the dtype subsequently)? Or, alternatively, if I can specify whenever there are mixed types, the read_csv function should convert the values in that column to str ?
I'm using Python 3.4.2 and Pandas 0.15.2
The Dtypewarning is a Warning which can be caught and acted on. See here for more information. To catch the warning we need to wrap the execution in a warnings.catch_warnings block. The warning message and columns affected can be extracted using regex, then used to set the correct column type using .astype(target_type)
import re
import pandas
import warnings
myfile = 'your_input_file_here.txt'
target_type = str # The desired output type
with warnings.catch_warnings(record=True) as ws:
warnings.simplefilter("always")
mydata = pandas.read_csv(myfile, sep="|", header=None)
print("Warnings raised:", ws)
# We have an error on specific columns, try and load them as string
for w in ws:
s = str(w.message)
print("Warning message:", s)
match = re.search(r"Columns \(([0-9,]+)\) have mixed types\.", s)
if match:
columns = match.group(1).split(',') # Get columns as a list
columns = [int(c) for c in columns]
print("Applying %s dtype to columns:" % target_type, columns)
mydata.iloc[:,columns] = mydata.iloc[:,columns].astype(target_type)
The result should be the same DataFrame with the problematic columns set to a str type. It is worth noting that string columns in a Pandas DataFrame are reported as object.
As noted in the error message itself, the simplest way to avoid pd.read_csv from returning mixed dtypes is to set low_memory=False:
df = pd.read_csv(..., low_memory=False)
This luxury is however not available when concatenating multiple dataframes using pd.concat.

Resources