Pandas is messing with a high resolution integer on read_csv - python-3.x

EDIT: This was Excel's fault changing the data type, not Pandas.
When I read a CSV using pd.read_csv(file) a column of super long ints gets converted to a low res float. These ints are a date time in microseconds.
example:
CSV Columns of some values:
15555071095204000
15555071695202000
15555072295218000
15555072895216000
15555073495207000
15555074095206000
15555074695212000
15555075295202000
15555075895210000
15555076495216000
15555077095230000
15555077695206000
15555078295212000
15555078895218000
15555079495209000
15555080095208000
15555080530515000
15555086531880000
15555092531889000
15555098531886000
15555104531886000
15555110531890000
15555116531876000
15555122531873000
15555128531884000
15555134531884000
15555140531887000
15555146531874000
pd.read_csv produces: 1.55551e+16
how do I get it to report the exact int?
I've tried using: float_precision='high'

It's possible that this is caused by the way Pandas handles missing values, meaning that your column is importing as floats, to allow the missing values to be coded as NaN.
A simple solution would be to force the column to import as a str, then impute or remove missing values, and the convert to int:
import pandas as pd
df = pd.read_csv(file, dtypes={'col1': str}) # Edit to use appropriate column reference
# If you want to just remove rows with missing values, something like:
df = df[df.col1 != '']
# Then convert to integer
df.col1 = df.col1.astype('int64')
With a Minimal, Complete and Verifiable Example we can pinpoint the problem and update the code to accurately solve it.

Related

ValueError: could not convert string to float: 'Pregnancies'

def loadCsv(filename):
lines = csv.reader(open('diabetes.csv'))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]
return dataset
Hello, I'm trying to implement Naive-Bayes but its giving me this error even though i've manually changed the type of each column to float.
it's still giving me error.
Above is the function to convert.
The ValueError is because the code is trying to cast (convert) the items in the CSV header row, which are strings, to floats. You could just skip the first row of the CSV file, for example:
for i in range(1, len(dataset)): # specifying 1 here will skip the first row
dataset[i] = [float(x) for x in dataset[i]
Note: that would leave the first item in dataset as the headers (str).
Personally, I'd use pandas, which has a read_csv() method, which will load the data directly into a dataframe.
For example:
import pandas as pd
dataset = pd.read_csv('diabetes.csv')
This will give you a dataframe though, not a list of lists. If you really want a list of lists, you could use dataset.values.tolist().

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

Changing column datatype from Timestamp to datetime64

I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype, but I need them to be in np.datetime64, so that I can make calculations.
I am aware that the function pd.to_datetime() and the astype(np.datetime64[ns]) method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.
I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64 and plugging it back onto the original dataframe:
dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi
But still it doesn't work. I have also tried updating values one by one:
arr_i = df.index
for i in range(len(arr_i)):
df.at[arri[l],'dates'].to_datetime64()
Edit
The root problem seems to be that the dtype of the column gets updated to np.datetime64, but somehow, when getting single values from within, they still have the dtype = Timestamp
Does anyone have a suggestion of a workaround that is fairly fast?
Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.
There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object:
import numpy as np
import pandas as pd
# create some dates, merely for example
dates = pd.date_range('2000-1-1', periods=10)
# convert the dates to a *list* of datetime64s
arr = list(dates.to_numpy())
# wrap the values you wish to protect in a Series of dtype object.
ser = pd.Series(arr, dtype='object')
# assignment with `df['datetime64s'] = ser` would also work
df = pd.DataFrame({'timestamps': dates,
'datetime64s': ser})
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 2 columns):
# timestamps 10 non-null datetime64[ns]
# datetime64s 10 non-null object
# dtypes: datetime64[ns](1), object(1)
# memory usage: 240.0+ bytes
print(type(df['timestamps'][0]))
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(type(df['datetime64s'][0]))
# <class 'numpy.datetime64'>
But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism,
it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.
You may be looking at an XY problem -- it may be more fruitful to find a way to (1)
work with Pandas Timestamps instead of trying to force Pandas to return NumPy
datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

Round pandas timestamp series to seconds - then save to csv without ms/ns resolution

I have a dataframe, df with index: pd.DatetimeIndex. The individual timestamps are changed from 2017-12-04 08:42:12.173645000 to 2017-12-04 08:42:12 using the excellent pandas rounding command:
df.index = df.index.round("S")
When stored to csv, this format is kept (which is exactly what I want). I also need a date-only column, and this is now easily created:
df = df.assign(DateTimeDay = df.index.round("D"))
When stored to csv-file using df.to_csv(), this does write out the entire timestamp (2017-12-04 00:00:00), except when it is the ONLY column to be saved. So, I add the following command before save:
df["DateTimeDay"] = df["DateTimeDay"].dt.date
...and the csv-file looks nice again (2017-12-04)
Problem description
Now over to the question, I have two other columns with timestamps on the same format as above (but different - AND - with some very few NaNs). I want to also round these to seconds (keeping NaNs as NaNs of course), then make sure that when written to csv, they are not padded with zeros "below the second resolution". Whatever I try, I am simply not able to do this.
Additional information:
print(df.dtypes)
print(df.index.dtype)
...all results in datetime64[ns]. If I convert them to an index:
df["TimeCol2"] = pd.DatetimeIndex(df["TimeCol2"]).round("s")
df["TimeCol3"] = pd.DatetimeIndex(df["TimeCol3"]).round("s")
...it works, but the csv-file still pads them with unwanted and unnecessary zeros.
Optimal solution: No conversion of the columns (like above) or use of element-wise apply unless they are quick (100+ million rows). My dream command would be like this:
df["TimeCol2"] = df["TimeCol2"].round("s") # Raises TypeError: an integer is required (got type str)
You can specify the date format for datetime dtypes when calling to_csv:
In[170]:
df = pd.DataFrame({'date':[pd.to_datetime('2017-12-04 07:05:06.767')]})
df
Out[170]:
date
0 2017-12-04 07:05:06.767
In[171]:
df.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[171]: ',date\n0,2017-12-04 07:05:06\n'
If you want to round the values, you need to round prior to writing to csv:
In[173]:
df1 = df['date'].dt.round('s')
df1.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[173]: '0,2017-12-04 07:05:07\n'

Resources