NLP and Pandas data extraction - python-3.x

Findings
Impression
File_name_Location
Lung bases: No pulmonary nodules or evidence of pneumonia
No findings on the current CT to account for the patient's clinical complaint of abdominal pain.
/home/text_file/p123456.txt
I have a pandas dataframe with 3 columns (from chest-Xray report) the columns are "findings", "impression" and "file_Name" with directory information. I have have separate directory (folders) of chest-Xray images that i have to crawl through to get the matching "file_Name" (becuase, there are more image files in the directory, than in my text dataframe)from image directory and put in the same row of above dataframe, and the image file name should be matched with the text file name.
need for the code to solve this.
An example of image file directory is as below:
/home/files/f1/images/i123456.jpg
there are folder from f1 to f25 and each having hundreds of .jpg file.
Update: Corralien's code raised an exception:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
3802 try:
-> 3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~/miniconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'File_name_Location'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[79], line 9
6 file = f"{img.stem[1:]}.txt"
7 images[file] = str(img)
----> 9 df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]
File ~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance)
3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
-> 3805 raise KeyError(key) from err
3806 except TypeError:
3807 # If we have a listlike key, _check_indexing_error will raise
3808 # InvalidIndexError. Otherwise we fall through and re-raise
3809 # the TypeError.
3810 self._check_indexing_error(key)
KeyError: 'File_name_Location'

IIUC, there is a relation between text and image files: p123456.txt -> f??/images/i123456.jpg.
You can use the following code:
# create an index of your images with the above relation
images = {}
for img in pathlib.Path('/home/files').glob('f*/images/*.jpg'):
file = f"p{img.stem[1:]}.txt"
images[file] = str(img)
df['Image_name_Location']=df['File_name_Location'].str.split('/').str[-1].map(images)
Output:
>>> df
File_name_Location Image_name_Location
0 /home/text_file/p123456.txt /home/files/f1/images/i123456.jpg
1 home/text_file/p987654.txt /home/files/f22/images/i987654.jpg

Related

Odd pandas date slicing behavior (doesn't slice day)

I could be missing something here but I believe that there is something odd going on with pandas datetime slicing. Here is a reproducible example:
import pandas as pd
import pandas_datareader as pdr
testdf = pdr.DataReader('SPY', 'yahoo')
testdf.index = pd.to_datetime(testdf.index)
testdf['2020-11']
Here we can see that slicing to find the month's data returns the expected output.
However, now lets try to find the row corresponding to Nov 9 2020.
testdf['2020-11-09']
And we get the following traceback.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: '2020-11-09'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-78-a42a45b5c3a4> in <module>
----> 1 testdf['2020-11-09']
C:\Anaconda\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]
C:\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2898
2899 if tolerance is not None:
KeyError: '2020-11-09'
Here we can see that the key is in fact in the index:
testdf['2020-11'].index
DatetimeIndex(['2020-11-02', '2020-11-03', '2020-11-04', '2020-11-05',
'2020-11-06', '2020-11-09'],
dtype='datetime64[ns]', name='Date', freq=None)
Is this a bug or am I a bug?
testdf['2020-11-09'] slice column-wise, i.e. looking in columns for '2020-11-09'. Do you mean:
testdf.loc['2020-11-09']

genfromtxt is not loading 2D asc file

I am working on the Jupyter notebook using Python3. I am trying to load an asc file containing 2 columns by using lin_data1 = np.genfromtxt(outdir+"/test_22/mp_harmonic_im_r8.00.ph.asc"), I am getting the following error.
'ValueError Traceback (most recent call last)
<ipython-input-152-5d1a4cbeab20> in <module>
1 # format of the path: SIMULATION-NAME/output-NNNN/PARFILE-NAME
2
----> 3 lin_data1 = np.genfromtxt(outdir+"/test_22/mp_harmonic_im_r8.00.ph.asc")
~/.local/lib/python3.8/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding)
2078 # Raise an exception ?
2079 if invalid_raise:
-> 2080 raise ValueError(errmsg)
2081 # Issue a warning ?
2082 else:
ValueError: Some errors were detected !
Line #2 (got 2 columns instead of 3)
Line #3 (got 2 columns instead of 3

Python Numba - Convert DataFrame series object to numpy array

I have a pandas dataframe with strings I am trying to use the set operation using python numba to get the unique characters in the column that contains strings in the dataframe. Since, numba does note recognize pandas dataframes, I need to convert the string column to an numpy array. However, once converted the column shows the dtype as a object. Is there a way that I could convert the pandas dataframe (column of strings) to a normal array (not an object array)
Please find the code for your understanding.
z = train.head(2).sentence.values #Train is a pandas DataFrame
z
Output:
array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)"],
dtype=object)
Python Numba code:
#njit
def set_(z):
x = set(z.sum())
return x
set_(z)
Output:
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
<ipython-input-51-9d5bc17d106b> in <module>()
----> 1 set_(z)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
342 raise e
343 else:
--> 344 reraise(type(e), e, None)
345 except errors.UnsupportedError as e:
346 # Something unsupported is present in the user code, add help info
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
656 value = tp()
657 if value.__traceback__ is not tb:
--> 658 raise value.with_traceback(tb)
659 raise value
660
TypingError: Failed at nopython (nopython frontend)
Internal error at <numba.typeinfer.ArgConstraint object at 0x7fbe66c01a58>:
--%<----------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/errors.py", line 491, in new_error_context
yield
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 194, in __call__
assert ty.is_precise()
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 138, in propagate
constraint(typeinfer)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 195, in __call__
typeinfer.add_type(self.dst, ty, loc=self.loc)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/errors.py", line 499, in new_error_context
six.reraise(type(newerr), newerr, tb)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/six.py", line 659, in reraise
raise value
numba.errors.InternalError:
[1] During: typing of argument at <ipython-input-50-566e4e12481d> (3)
--%<----------------------------------------------------------------------------
File "<ipython-input-50-566e4e12481d>", line 3:
def set_(z):
x = set(z.sum())
^
This error may have been caused by the following argument(s):
- argument 0: Unsupported array dtype: object
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new
Would anyone be able to help me in this regard.
Thanks & Best Regards
Michael

Python Key error when selecting single row from pandas dataframe in jupyter notebook

I've managed to solve many problems using StackOverflow, but this is the first time I got a question I can't find anywhere else and can't solve on my own...
I'm working in jupyter notebook with a pandas dataframe, containing text reviews and scores for amazon products. Below is my code:
import pandas as pd
data = pd.read_csv("AmazonSampleForStudentOffice.csv")
reviews = data[['reviewText', 'score', 'len_text']]
reviews.head(5)
This is the result:
reviewText score len_text
0 Wow! Do I consider myself lucky! I got this CX... 5 274
1 The Optima 45 Electric Stapler has a sleek mod... 5 108
2 This tape does just what it's supposed to.And ... 5 18
3 It is rare that I look for a more expensive pr... 5 104
4 I know of no printer that makes such great pri... 5 34
and slicing the dataframe works fine:
reviews[0:2]
reviewText score len_text
0 Wow! Do I consider myself lucky! I got this CX... 5 274
1 The Optima 45 Electric Stapler has a sleek mod... 5 108
However, if I want to select a single row, jupyter throws a Key error on the selected index:
reviews[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
c:\users\robin\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-7-a635d1333a53> in <module>
----> 1 reviews[0]
c:\users\robin\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2993 if self.columns.nlevels > 1:
2994 return self._getitem_multilevel(key)
-> 2995 indexer = self.columns.get_loc(key)
2996 if is_integer(indexer):
2997 indexer = [indexer]
c:\users\robin\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
Does anyone know what could be causing this problem? I find it very strange that slicing works fine, but selecting a single index throws an error...
As you can see, I tried different methods to select certain rows from the dataframe and they all work fine. I've also tried to reinstall pandas and jupyter notebook, but it still throws the error...
Thanks in advance!
The indexing operator alone like in reviews[] only works to select rows by boolean expressions - e.g. using a slice like reviews[:2] (your 0 is obsolete) - or to select columns like in reviews['score']. If you want to index by position, you need the .ilog attribute, like in reviews.iloc[0, :], which gives you the first row only, but all the columns.
If you want to learn about pandas indexing, focus on the .loc and .iloc attributes, which both work in 2 dimensions. The indexing operator alone can only be used to select in 1 dimension and with quite some restrictions.

The `start` argument could not be matched to a location related to the index of the data

I don't know why my 'start' pred won't work. I added some edits to pd.to_datetime but they didn't work.
This is my code:
pred = results.get_prediction(start=pd.to_datetime('2018-06-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2015':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 4))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Retail_sold')
plt.legend()
plt.show()
and the log of my error always refers to my time format, I had to resample my data before, from daily data to monthly data before I started the analysis of the data and workaround the data, but I don't know why my data can't be read using pd.todatetime.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 1546300800000000000
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
KeyError: Timestamp('2019-01-01 00:00:00')
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 1546300800000000000
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
11 frames
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
KeyError: Timestamp('2019-01-01 00:00:00')
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
KeyError: Timestamp('2019-01-01 00:00:00')
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/base/tsa_model.py in _get_prediction_index(self, start, end, index, silent)
522 start, start_index, start_oos = self._get_index_label_loc(start)
523 except KeyError:
--> 524 raise KeyError('The `start` argument could not be matched to a'
525 ' location related to the index of the data.')
526 if end is None:
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I used Google Colab and Python 3.7.
Does anyone have the solution of my problem?
The underlying problem here is that your data doesn't have an index with an associated frequency, because your data skips days (for example going from 2016/2/5 to 2016/2/14).
In a similar issue my problem was that I was passing as value data a list and I needed to convert in a pd serie
data_to predict = pd.Series(imput_data, index=myIndex)
You should set the dataset's index with time columns like this:
df["Time"] = pd.to_datetime(df['Time'], infer_datetime_format=True)
df = df.set_index(["Time"])
you can try the following:
predictions = results.predict(start=train_data.shape[0],end=(train_data.shape[0]+test_data.shape[0]-1), dynamic=False)

Resources