Data frame data type conflict...conver - python-3.x

I am receiving the error below upon running a python file.
'invalid literal for int() with base 10: 'data missing'
It looks as though some of the data in my dataframe is not of a type compatable with an arithmetic operation I would like to perform.
Can someone advise on how I might be able to locate the position of data that is giving the error? And or bypass the entire error with a preprocessing step that allows for the normalization step?
I am confused because missing data should be dropped with the df1.dropna and if not there
The original line throwing the error was the line used to normalize the data. (last line below)
i've tried to convert the dataframe with
df1 = df1.astype(int)
df1 = pd.concat([df2,df3], axis = 1, join_axes = [df2.index])
df1 = df1.fillna(method = 'bfill')
df1 = df1.dropna(axis =0)
df1 = df1.astype(int)
df1 = (df1 - df1.min())/(df1.max() - df1.min())

I think u should try df1.dtypes to check the data type of each column first
here is the documentation of that:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

Related

Join different DataFrames using loop in Pyspark

I have 5 CSV files in a file, and want to join them in one data frame in Pyspark: I use the code below:
name_file =['A', 'B', 'C', 'D', 'V']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=False,
inferSchema= True)
full_data=full_data.join(n,["id"])
Error: I got an unexpected result > The last dataframe joined just with itself.
Expected Result: There should be 6 columns, each CSV has 2 data frames one of them in common with others. The join should be on this column. As a result, the final data frame should have a common column and 5 special columns from each CSV file.
There seem to be several things wrong with the code or perhaps you have not provided the complete code.
Have you defined fullpath?
You have set header=False then how will spark know that there is an
"id" column?
Your indentation looks wrong under the for loop.
full_data has not been defined yet, so how are you using it on the
right side of the evaluation within the for loop? I suspect you have
initialized this to the first csv file and then attempting to join
it with first csv again.
I ran a small test on the below code which worked for me and addresses the questions I've raised above. You can adjust it to your need.
fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
,header=True,
inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=True,
inferSchema= True)
full_data=full_data.join(n,["id"])
full_data.show(5)

Why SettingWithCopyWarning is raised using .loc?

I have checked similar questions on SO with the SettingWithCopyWarning error raised using .loc but I still don't understand why I have the error in the following example.
It appears line 3, I succeed to make it disappear with .copy() but I would like to understand why .loc didn't work specifically here.
Does making a conditional slice creates a view even if it's .loc ?
df = pd.DataFrame( data=[0,1,2,3,4,5], columns=['A'])
df.loc[:,'B'] = df.loc[:,'A'].values
dfa = df.loc[df.loc[:,'A'] < 4,:] # here .copy() removes the error
dfa.loc[:,'C'] = [3,2,1,0]
Edit : pandas version is 1.2.4
dfa = df.loc[df.loc[:,'A'] < 4,:]<br>
dfa is a slice of the df dataframe, still referencing the dataframe, a view..copy creates a separate copy, not just a view of the first dataframe.
dfa.loc[:,'C'] = [3,2,1,0]
When it's a view not a copy, you are getting the warning : A value is trying to be set on a copy of a slice from a DataFrame.
.loc is locating the conditions you give it, but it's still a view that you're setting values to if you don't make it a copy of the dataframe.

ValueError: could not convert string to float: 'Pregnancies'

def loadCsv(filename):
lines = csv.reader(open('diabetes.csv'))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]
return dataset
Hello, I'm trying to implement Naive-Bayes but its giving me this error even though i've manually changed the type of each column to float.
it's still giving me error.
Above is the function to convert.
The ValueError is because the code is trying to cast (convert) the items in the CSV header row, which are strings, to floats. You could just skip the first row of the CSV file, for example:
for i in range(1, len(dataset)): # specifying 1 here will skip the first row
dataset[i] = [float(x) for x in dataset[i]
Note: that would leave the first item in dataset as the headers (str).
Personally, I'd use pandas, which has a read_csv() method, which will load the data directly into a dataframe.
For example:
import pandas as pd
dataset = pd.read_csv('diabetes.csv')
This will give you a dataframe though, not a list of lists. If you really want a list of lists, you could use dataset.values.tolist().

Cannot convert object to np.int64 with Numpy

I have a dataframe with 3 columns with the following dtypes:
df.info()
tconst object
directors object
writers object
Please see the data itself:
Now, I have to change the column tconst to dtype:int64. I tried this code but it throws an error:
df = pd.read_csv('title.crew.tsv',
header=None,sep='\t',
encoding= 'latin1',
names = ['tconst', 'directors','writers'],
dtype={'tconst': np.int64,'directors':np.int64})
Error 1:ValueError: invalid literal for int() with base 10: 'tconst'
Error:TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'
What is going wrong here?
In my opinion problem here is parameter header=None which is used for read file with no csv header.
Solution is remove it, because in file is first row header, which is conveterted to columns names of DataFrame:
df = pd.read_csv('title.crew.tsv',
sep='\t',
encoding= 'latin1')
Another problem is tt and nm prefix in columns, so cannot be converted to integers.
Solution is:
df['tconst'] = df['tconst'].str[2:].astype(int)

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

Resources