python - numpy array gets imported as str using read_csv() - python-3.x

I am importing a df like
dbscan_parameter_search = pd.read_csv('./src/temp/05_dbscan_parameter_search.csv',
index_col=0)
type(dbscan_parameter_search['clusters'][0])
which results in str.
How can I keep the datatype as numpy array? I tried
dbscan_parameter_search = pd.read_csv('./src/temp/05_dbscan_parameter_search.csv',
dtype={'clusters':np.int32},
index_col=0)
which results in ValueError: invalid literal for int() with base 10: '[ 0 1 2 ... 139634 139634 139636]'.
Any hint? Thanks

Thanks to hpaulj, the problem was rather the export but the import. String was literally saved with dots. For quick workaround see here:
python - how to to_csv() with an column of arrays

Related

Convert Matlab Datenumb into python datetime

I have a DF that looks like this (it is matlab data):
datesAvail date
0 737272 737272
1 737273 737273
2 737274 737274
3 737275 737275
4 737278 737278
5 737279 737279
6 737280 737280
7 737281 737281
Reading on internet, i wanted to convert matlab datetime into python date using the following solution found here
python_datetime = datetime.fromordinal(int(matlab_datenum)) + timedelta(days=matlab_datenum%1) - timedelta(days = 366)
where matlab_datenum is in my case equal to DF['date'] or DF['datesAvail']
I get an error TypeError: cannot convert the series to <class 'int'>
note that the data type is int
Out[102]:
datesAvail int64
date int64
dtype: object
I am not sure where i am going wrong. Any help is very appreciated
I am not sure what you are expecting as an output from this, but I assume it is a list?
The error is telling you exactly what is wrong, you are trying to convert a series with int(). The only arguments int can accept are strings, a bytes-like objects or numbers.
When you call DF['date'] it is giving you a series, so this needs to be converted into a number(or string or byte) first, so you need a for loop to iterate over the whole series. I would change it to a list first by doing DF['date'].tolist()
If you are looking to have an output as a list, you can do a list comprehension as shown here(sorry, this is long);
python_datetime_list = [datetime.fromordinal(int(i)) + timedelta(days=i%1) - timedelta(days = 366) for i in DF['date'].tolist()]

pandas.Dataframe() mixed data types and strange .fillna() behaviour

I have a dataframe which has two dtypes: Object (was expecting string) and Datetime (expected datetime). I don't understand this behavior and why it affects my fillna().
Calling .fillna() with inplace=True wipes the data denoted as int64 despite being changed with .astype(str)
Calling .fillna() without it does nothing.
I know pandas / numpy dtypes are different to the python native, but is it correct behavior or am I getting something terribly wrong?
sample:
import random
import numpy
sample = pd.DataFrame({'A': [random.choice(['aabb',np.nan,'bbcc','ccdd']) for x in range(15)],
'B': [random.choice(['2019-11-30','2020-06-30','2018-12-31','2019-03-31']) for x in range(15)]})
sample.loc[:, 'B'] = pd.to_datetime(sample['B'])
for col in sample.select_dtypes(include='object').columns.tolist():
sample.loc[:, col].astype(str).apply(lambda x: str(x).strip().lower()).fillna('NULL')
for col in sample.columns:
print(sample[col].value_counts().head(15))
print('\n')
Here neither 'NULL' nor 'nan' appear. Added .replace('nan','NULL'), but still nothing. Can you give me a clue what to look for, please? Many thanks.
Problem here is converting missing values to strings, so fillna cannot working. solution is use pandas function Series.str.strip and Series.str.lower working with missing values very nice:
for col in sample.select_dtypes(include='object').columns:
sample[col] = sample[col].str.strip().str.lower().fillna('NULL')

Nan is not existing in data but still it is giving me error that 'float ' is not iterable

I have to split a column of dataframe in pandas on basis of '|' but it gives me error that float in not iterable. There is no NAN exist in data.
I have splitted data and save it in list and traversing that list is also giving me same error.
You can try to troubleshoot by using try/except statements. Here is an example of how to do so in a way that shows you which elements of your list are causing a problem.
L = ['Crime', 'Drama', None, 'Fiction']
L2 = []
for i in L:
try:
for j in i:
L2.append(j)
except TypeError:
print ('object is not iterable: {0}'.format(i))

string indices must be integers - Python 3.6

I am learning python and doing some practice on the titanic statistics data. The file can be found here. While running this simple code i am always getting this error message 'string indices must be integers'.
I want to find the total number of unique data entries in my .csv file according to PassengerId'. When i checked my 'test_data' variable it has 'PassengerId' in it but still i am getting error. How can i solve it?
import pandas as pd
titanic_df = pd.read_csv("file.csv")
unique_number_df = set()
for test_data in titanic_df:
unique_number_df.add(test_data['PassengerId'])
print(len(unique_number_df))
titanic_df = pd.read_csv("titanic_data.csv")
unique_number_df = set()
for test_data in titanic_df["PassengerId"]: #here you should pass the column name.
unique_number_df.add(test_data)
print(len(unique_number_df))

python, loading a string from file

I'm trying to load a .txt file into my python project using numpy:
import numpy as np
import sys
g = np.loadtxt(sys.argv[1])
this command has worked for me when .txt file was a 0/1 matrix, but not
working now as it is a string matrix (4*7 table of words like "crew")
error says "cant convert string to float".. any help?
Take a look at the dtype parameter. (here)
dtype : data-type, optional
Data-type of the resulting array; default: float. If this is a structured data-type, the resulting array will be 1-dimensional, and each row will be interpreted as an element of the array. In this case, the number of columns used must match the number of fields in the data-type.
The default is float, which results in the error you are pointing out in your question.
One option is using pandas:
import numpy as np
import pandas as pd
arr = pd.read_table(filename, sep=" ", header=None).values
(Assuming the separator is a whitespace and there is no header column. Specify otherwise).

Resources