genfromtxt is not loading 2D asc file

genfromtxt is not loading 2D asc file - python-3.x

I am working on the Jupyter notebook using Python3. I am trying to load an asc file containing 2 columns by using lin_data1 = np.genfromtxt(outdir+"/test_22/mp_harmonic_im_r8.00.ph.asc"), I am getting the following error.
'ValueError Traceback (most recent call last)
<ipython-input-152-5d1a4cbeab20> in <module>
1 # format of the path: SIMULATION-NAME/output-NNNN/PARFILE-NAME
2
----> 3 lin_data1 = np.genfromtxt(outdir+"/test_22/mp_harmonic_im_r8.00.ph.asc")
~/.local/lib/python3.8/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding)
2078 # Raise an exception ?
2079 if invalid_raise:
-> 2080 raise ValueError(errmsg)
2081 # Issue a warning ?
2082 else:
ValueError: Some errors were detected !
Line #2 (got 2 columns instead of 3)
Line #3 (got 2 columns instead of 3

Related

NLP and Pandas data extraction

Findings
Impression
File_name_Location
Lung bases: No pulmonary nodules or evidence of pneumonia
No findings on the current CT to account for the patient's clinical complaint of abdominal pain.
/home/text_file/p123456.txt
I have a pandas dataframe with 3 columns (from chest-Xray report) the columns are "findings", "impression" and "file_Name" with directory information. I have have separate directory (folders) of chest-Xray images that i have to crawl through to get the matching "file_Name" (becuase, there are more image files in the directory, than in my text dataframe)from image directory and put in the same row of above dataframe, and the image file name should be matched with the text file name.
need for the code to solve this.
An example of image file directory is as below:
/home/files/f1/images/i123456.jpg
there are folder from f1 to f25 and each having hundreds of .jpg file.
Update: Corralien's code raised an exception:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
3802 try:
-> 3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'File_name_Location'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[79], line 9
6 file = f"{img.stem[1:]}.txt"
7 images[file] = str(img)
----> 9 df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance)
3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
-> 3805 raise KeyError(key) from err
3806 except TypeError:
3807 # If we have a listlike key, _check_indexing_error will raise
3808 # InvalidIndexError. Otherwise we fall through and re-raise
3809 # the TypeError.
3810 self._check_indexing_error(key)
KeyError: 'File_name_Location'

IIUC, there is a relation between text and image files: p123456.txt -> f??/images/i123456.jpg.
You can use the following code:
# create an index of your images with the above relation
images = {}
for img in pathlib.Path('/home/files').glob('f*/images/*.jpg'):
file = f"p{img.stem[1:]}.txt"
images[file] = str(img)
df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)
Output:
>>> df
File_name_Location Image_name_Location
0 /home/text_file/p123456.txt /home/files/f1/images/i123456.jpg
1 home/text_file/p987654.txt /home/files/f22/images/i987654.jpg

Python Numba - Convert DataFrame series object to numpy array

I have a pandas dataframe with strings I am trying to use the set operation using python numba to get the unique characters in the column that contains strings in the dataframe. Since, numba does note recognize pandas dataframes, I need to convert the string column to an numpy array. However, once converted the column shows the dtype as a object. Is there a way that I could convert the pandas dataframe (column of strings) to a normal array (not an object array)
Please find the code for your understanding.
z = train.head(2).sentence.values #Train is a pandas DataFrame
z
Output:
array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)"],
dtype=object)
Python Numba code:
#njit
def set_(z):
x = set(z.sum())
return x
set_(z)
Output:
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
<ipython-input-51-9d5bc17d106b> in <module>()
----> 1 set_(z)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
342 raise e
343 else:
--> 344 reraise(type(e), e, None)
345 except errors.UnsupportedError as e:
346 # Something unsupported is present in the user code, add help info
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
656 value = tp()
657 if value.__traceback__ is not tb:
--> 658 raise value.with_traceback(tb)
659 raise value
660
TypingError: Failed at nopython (nopython frontend)
Internal error at <numba.typeinfer.ArgConstraint object at 0x7fbe66c01a58>:
--%<----------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/errors.py", line 491, in new_error_context
yield
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 194, in __call__
assert ty.is_precise()
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 138, in propagate
constraint(typeinfer)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 195, in __call__
typeinfer.add_type(self.dst, ty, loc=self.loc)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/errors.py", line 499, in new_error_context
six.reraise(type(newerr), newerr, tb)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/six.py", line 659, in reraise
raise value
numba.errors.InternalError:
[1] During: typing of argument at <ipython-input-50-566e4e12481d> (3)
--%<----------------------------------------------------------------------------
File "<ipython-input-50-566e4e12481d>", line 3:
def set_(z):
x = set(z.sum())
^
This error may have been caused by the following argument(s):
- argument 0: Unsupported array dtype: object
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new
Would anyone be able to help me in this regard.
Thanks & Best Regards
Michael

Python Key error when selecting single row from pandas dataframe in jupyter notebook

I've managed to solve many problems using StackOverflow, but this is the first time I got a question I can't find anywhere else and can't solve on my own...
I'm working in jupyter notebook with a pandas dataframe, containing text reviews and scores for amazon products. Below is my code:
import pandas as pd
data = pd.read_csv("AmazonSampleForStudentOffice.csv")
reviews = data[['reviewText', 'score', 'len_text']]
reviews.head(5)
This is the result:
reviewText score len_text
0 Wow! Do I consider myself lucky! I got this CX... 5 274
1 The Optima 45 Electric Stapler has a sleek mod... 5 108
2 This tape does just what it's supposed to.And ... 5 18
3 It is rare that I look for a more expensive pr... 5 104
4 I know of no printer that makes such great pri... 5 34
and slicing the dataframe works fine:
reviews[0:2]
reviewText score len_text
0 Wow! Do I consider myself lucky! I got this CX... 5 274
1 The Optima 45 Electric Stapler has a sleek mod... 5 108
However, if I want to select a single row, jupyter throws a Key error on the selected index:
reviews[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
c:\users\robin\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-7-a635d1333a53> in <module>
----> 1 reviews[0]
c:\users\robin\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2993 if self.columns.nlevels > 1:
2994 return self._getitem_multilevel(key)
-> 2995 indexer = self.columns.get_loc(key)
2996 if is_integer(indexer):
2997 indexer = [indexer]
c:\users\robin\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
Does anyone know what could be causing this problem? I find it very strange that slicing works fine, but selecting a single index throws an error...
As you can see, I tried different methods to select certain rows from the dataframe and they all work fine. I've also tried to reinstall pandas and jupyter notebook, but it still throws the error...
Thanks in advance!

The indexing operator alone like in reviews[] only works to select rows by boolean expressions - e.g. using a slice like reviews[:2] (your 0 is obsolete) - or to select columns like in reviews['score']. If you want to index by position, you need the .ilog attribute, like in reviews.iloc[0, :], which gives you the first row only, but all the columns.
If you want to learn about pandas indexing, focus on the .loc and .iloc attributes, which both work in 2 dimensions. The indexing operator alone can only be used to select in 1 dimension and with quite some restrictions.

read_csv giving error for movielens 20M dataset

I am using pandas for the first time and wanted to do some data analysis for Movielens dataset. After running my code for 1M dataset, I wanted to experiment with Movielens 20M
I am only reading one file i.e ratings.csv
However, I faced multiple problems with 20M dataset, and after spending much time I realized that this is because the dtypes of columns being read are not as expected.
The following code(where path is path of ratings.csv file)
import pandas as pd
import numpy as np
df = pd.read_csv('../data/ml-20m/ratings.csv',sep=',',names=
['userId','movieId','rating','timestamp'],engine='python', dtype=
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}, skipinitialspace=True, error_bad_lines=False)
is giving me the following error :-
Traceback (most recent call last): File
"/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1663, in _cast_types
values = astype_nansafe(values, cast_type, copy=True) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/core/dtypes/cast.py",
line 709, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) File "pandas/_libs/lib.pyx", line 456, in
pandas._libs.lib.astype_intsafe File "pandas/_libs/src/util.pxd",
line 142, in util.set_value_at_unsafe ValueError: invalid literal for
int() with base 10: 'movieId'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "test.py", line 4, in
df = pd.read_csv('../data/ml-20m/ratings.csv',sep=',',names= ['userId','movieId','rating','timestamp'],engine='python', dtype=
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}, skipinitialspace=True, error_bad_lines=False)
File
"/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 678, in parser_f
return _read(filepath_or_buffer, kwds) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 446, in _read
data = parser.read(nrows) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1036, in read
ret = self._engine.read(nrows) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 2272, in read
data = self._convert_data(data) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 2338, in _convert_data
clean_conv, clean_dtypes) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1574, in _convert_to_ndarrays
cvals = self._cast_types(cvals, cast_type, c) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1666, in _cast_types
"type %s" % (column, cast_type)) ValueError: Unable to convert column movieId to type
Basically I want to skip all those lines whose datatype doesn't conform to dictionary
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}
If I don't give the dtype argument to read_csv, then all four columns turn out to be of type "object" which is not what I want.
I searched on google and found noone facing this problem. Can you help me?
I am using python3

Problem is you define columns names, but csv have header, so first row of DataFrame is same like columns names, so all rows are converted to strings:
df = pd.read_csv('ratings.csv',
names= ['userId','movieId','rating','timestamp'])
print (df.head())
userId movieId rating timestamp
0 user_id movie_id rating timestamp
1 1 1193 5 978300760
2 1 661 3 978302109
3 1 914 3 978301968
4 1 3408 4 978300275
Solution is use parameter skiprows=1 or header=0 for rename columns names by names parameter:
df = pd.read_csv('ratings.csv',
dtype= {'userId':np.int32,
'movieId':np.int32,
'rating':np.float64,
'timestamp':np.int64},
header=0, #skiprows=1
names= ['userId','movieId','rating','timestamp'])
print (df.head())
userId movieId rating timestamp
0 1 1193 5.0 978300760
1 1 661 3.0 978302109
2 1 914 3.0 978301968
3 1 3408 4.0 978300275
4 1 2355 5.0 978824291
If dont want rename column names:
df = pd.read_csv('ratings.csv',
dtype= {'userId':np.int32,
'movieId':np.int32,
'rating':np.float64,
'timestamp':np.int64})
print (df.head())
user_id movie_id rating timestamp
0 1 1193 5.0 978300760
1 1 661 3.0 978302109
2 1 914 3.0 978301968
3 1 3408 4.0 978300275
4 1 2355 5.0 978824291

Python 3.4 plistlib doesn't work (str vs bytes errors)

So, I started on a new toy project and decided I'd use Python 3 for the first time...
In [1]: import plistlib
In [2]: with open("/Volumes/Thunderbay/CURRENT/Music/iTunes/iTunes Library.xml") as itl:
library = plistlib.load(itl)
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-6459a022cb71> in <module>()
1 with open("/Volumes/Thunderbay/CURRENT/Music/iTunes/iTunes Library.xml") as itl:
----> 2 library = plistlib.load(itl)
3
/usr/local/Cellar/python3/3.4.3_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plistlib.py in load(fp, fmt, use_builtin_types, dict_type)
984 fp.seek(0)
985 for info in _FORMATS.values():
--> 986 if info['detect'](header):
987 P = info['parser']
988 break
/usr/local/Cellar/python3/3.4.3_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plistlib.py in _is_fmt_xml(header)
556
557 for pfx in prefixes:
--> 558 if header.startswith(pfx):
559 return True
560
TypeError: startswith first arg must be str or a tuple of str, not bytes
hmm ok, let's give it a hint:
In [3]: with open("/Volumes/Thunderbay/CURRENT/Music/iTunes/iTunes Library.xml") as itl:
library = plistlib.load(itl, fmt=plistlib.FMT_XML)
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-ef5f06b44ec2> in <module>()
1 with open("/Volumes/Thunderbay/CURRENT/Music/iTunes/iTunes Library.xml") as itl:
----> 2 library = plistlib.load(itl, fmt=plistlib.FMT_XML)
3
/usr/local/Cellar/python3/3.4.3_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plistlib.py in load(fp, fmt, use_builtin_types, dict_type)
995
996 p = P(use_builtin_types=use_builtin_types, dict_type=dict_type)
--> 997 return p.parse(fp)
998
999
/usr/local/Cellar/python3/3.4.3_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plistlib.py in parse(self, fileobj)
323 self.parser.EndElementHandler = self.handle_end_element
324 self.parser.CharacterDataHandler = self.handle_data
--> 325 self.parser.ParseFile(fileobj)
326 return self.root
327
TypeError: read() did not return a bytes object (type=str)
plistlib is in the standard library, but from the problems above I have the feeling it has not actually been converted to Python 3?
Anyway, my actual question: is it possible to open an XML plist file with plistlib in Python 3.4.3?
surely I'm missing something obvious here perhaps... just noticed the Py2 version of plistlib (which works!) has a different interface, so someone has actually modified the code of the library for inclusion with Py3...

Thanks to #J Presper Eckert for giving me a clue about what to look for...
I then found this article:
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#the-binary-option
which suggests the answer is simply to open the file in binary mode, tried it and it works!
with open("/Volumes/Thunderbay/CURRENT/Music/iTunes/iTunes Library.xml", 'rb') as itl:
library = plistlib.load(itl)

Got same error message using python 3.4.3 against .plist file ( mac property list, configuration file in xml ).
Try this (worked for me):
copy plist / xml file to new file.
change new file extension to '.txt'
open newfile.txt in textEdit (mac), notepad++(win) or similar
save as .txt file with UTF-8 encoding & plaintext only.
Now, when you read newfile.txt, you won't see "startswith first arg must be str or a tuple of str, not bytes".
Cheers

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

genfromtxt is not loading 2D asc file - python-3.x

Related

NLP and Pandas data extraction

Python Numba - Convert DataFrame series object to numpy array

Python Key error when selecting single row from pandas dataframe in jupyter notebook

read_csv giving error for movielens 20M dataset

Python 3.4 plistlib doesn't work (str vs bytes errors)

Categories

Resources