Applying function on columns of pandas data.frame is generating error - python-3.x

Let say I have below pandas data.frame -
>>> Data
Col1 Col2
53 08.02.2020 2020-02-14
55 01.02.2020 2020-02-13
335 30.01.2020 2020-02-14
365 14.02.2020 2020-02-16
446 11.02.2020 2020-02-15
476 03.02.2020 2020-02-18
504 08.02.2020 2020-02-10
557 01.02.2020 2020-02-15
668 10.02.2020 2020-02-15
756 07.02.2020 2020-02-08
Next, I have below function -
is_ten_char = lambda x: x.str.len().eq(10)
But, applying this function to columns to check the number of characters generates error -
Data[is_ten_char(Data.Col1) & is_ten_char(Data.Col2)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/accessor.py", line 187, in __get__
accessor_obj = self._accessor(obj)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/strings.py", line 2041, in __init__
self._inferred_dtype = self._validate(data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/strings.py", line 2098, in _validate
raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!
Any pointer what is going wrong here will be highly helpful.

Col1 is clearly not a datetime format as shown
Col2 probably is a datetime format, so to compare it as a string, do the follwoing
is_ten_char = lambda x: x.str.len().eq(10)
Data[is_ten_char(Data.Col1) & is_ten_char(Data.Col2.dt.strftime('%Y-%m-%d'))]
However, this does not convert Col2 to a string
print(Data['Col2'][53]) >>> Timestamp('2020-02-14 00:00:00')
If you want Col2 converted to a string
Data.Col2 = Data.Col2.dt.strftime('%y-%m-%d')
Then use the original code

Related

Python Numba - Convert DataFrame series object to numpy array

I have a pandas dataframe with strings I am trying to use the set operation using python numba to get the unique characters in the column that contains strings in the dataframe. Since, numba does note recognize pandas dataframes, I need to convert the string column to an numpy array. However, once converted the column shows the dtype as a object. Is there a way that I could convert the pandas dataframe (column of strings) to a normal array (not an object array)
Please find the code for your understanding.
z = train.head(2).sentence.values #Train is a pandas DataFrame
z
Output:
array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)"],
dtype=object)
Python Numba code:
#njit
def set_(z):
x = set(z.sum())
return x
set_(z)
Output:
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
<ipython-input-51-9d5bc17d106b> in <module>()
----> 1 set_(z)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
342 raise e
343 else:
--> 344 reraise(type(e), e, None)
345 except errors.UnsupportedError as e:
346 # Something unsupported is present in the user code, add help info
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
656 value = tp()
657 if value.__traceback__ is not tb:
--> 658 raise value.with_traceback(tb)
659 raise value
660
TypingError: Failed at nopython (nopython frontend)
Internal error at <numba.typeinfer.ArgConstraint object at 0x7fbe66c01a58>:
--%<----------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/errors.py", line 491, in new_error_context
yield
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 194, in __call__
assert ty.is_precise()
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 138, in propagate
constraint(typeinfer)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/typeinfer.py", line 195, in __call__
typeinfer.add_type(self.dst, ty, loc=self.loc)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/errors.py", line 499, in new_error_context
six.reraise(type(newerr), newerr, tb)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numba/six.py", line 659, in reraise
raise value
numba.errors.InternalError:
[1] During: typing of argument at <ipython-input-50-566e4e12481d> (3)
--%<----------------------------------------------------------------------------
File "<ipython-input-50-566e4e12481d>", line 3:
def set_(z):
x = set(z.sum())
^
This error may have been caused by the following argument(s):
- argument 0: Unsupported array dtype: object
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new
Would anyone be able to help me in this regard.
Thanks & Best Regards
Michael

How can I insert a string value into a column of floats in pandas 0.24.2?

I have a column of over a million floats. I need to be able to replace certain values with strings when that value falls above or below certain thresholds.
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo': np.random.random(10),
'bar': np.random.random(10)})
df
Out[115]:
foo bar
0 0.181262 0.890826
1 0.321260 0.053619
2 0.832247 0.044459
3 0.937769 0.855299
4 0.752133 0.008980
5 0.751948 0.680084
6 0.559528 0.785047
7 0.615597 0.265483
8 0.129505 0.509945
9 0.727209 0.786113
df.at[5, 'foo'] = 'somestring'
Traceback (most recent call last):
File "<ipython-input-116-bf0f6f9e84ac>", line 1, in <module>
df.at[5, 'foo'] = 'somestring'
File "/Users/nate/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2287, in __setitem__
self.obj._set_value(*key, takeable=self._takeable)
File "/Users/nate/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2815, in _set_value
engine.set_value(series._values, index, value)
File "pandas/_libs/index.pyx", line 95, in pandas._libs.index.IndexEngine.set_value
File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.set_value
ValueError: could not convert string to float: 'somestring'
I will eventually need to write something like:
for idx, row in df.iterrows()
if row[0] > some_value:
df.at[idx, 'foo'] = 'over_some_value'
else:
I have tried using iloc, but I suspect it would be to slow, and I would like to be able to use at to keep my code uniform.
In order to assign different type value into the columns , you may need to convert it to object
And warning here, since the convert to object , it is very dangerous
df=df.astype(object)
df.at[5, 'foo'] = 'somestring'
df
foo bar
0 0.163246 0.803071
1 0.946447 0.48324
2 0.777733 0.461704
3 0.996791 0.521338
4 0.320627 0.374384
5 somestring 0.987591
6 0.388765 0.726807
7 0.362077 0.76936
8 0.738139 0.0539076
9 0.208691 0.812568

read_csv giving error for movielens 20M dataset

I am using pandas for the first time and wanted to do some data analysis for Movielens dataset. After running my code for 1M dataset, I wanted to experiment with Movielens 20M
I am only reading one file i.e ratings.csv
However, I faced multiple problems with 20M dataset, and after spending much time I realized that this is because the dtypes of columns being read are not as expected.
The following code(where path is path of ratings.csv file)
import pandas as pd
import numpy as np
df = pd.read_csv('../data/ml-20m/ratings.csv',sep=',',names=
['userId','movieId','rating','timestamp'],engine='python', dtype=
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}, skipinitialspace=True, error_bad_lines=False)
is giving me the following error :-
Traceback (most recent call last): File
"/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1663, in _cast_types
values = astype_nansafe(values, cast_type, copy=True) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/core/dtypes/cast.py",
line 709, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) File "pandas/_libs/lib.pyx", line 456, in
pandas._libs.lib.astype_intsafe File "pandas/_libs/src/util.pxd",
line 142, in util.set_value_at_unsafe ValueError: invalid literal for
int() with base 10: 'movieId'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "test.py", line 4, in
df = pd.read_csv('../data/ml-20m/ratings.csv',sep=',',names= ['userId','movieId','rating','timestamp'],engine='python', dtype=
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}, skipinitialspace=True, error_bad_lines=False)
File
"/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 678, in parser_f
return _read(filepath_or_buffer, kwds) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 446, in _read
data = parser.read(nrows) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1036, in read
ret = self._engine.read(nrows) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 2272, in read
data = self._convert_data(data) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 2338, in _convert_data
clean_conv, clean_dtypes) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1574, in _convert_to_ndarrays
cvals = self._cast_types(cvals, cast_type, c) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1666, in _cast_types
"type %s" % (column, cast_type)) ValueError: Unable to convert column movieId to type
Basically I want to skip all those lines whose datatype doesn't conform to dictionary
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}
If I don't give the dtype argument to read_csv, then all four columns turn out to be of type "object" which is not what I want.
I searched on google and found noone facing this problem. Can you help me?
I am using python3
Problem is you define columns names, but csv have header, so first row of DataFrame is same like columns names, so all rows are converted to strings:
df = pd.read_csv('ratings.csv',
names= ['userId','movieId','rating','timestamp'])
print (df.head())
userId movieId rating timestamp
0 user_id movie_id rating timestamp
1 1 1193 5 978300760
2 1 661 3 978302109
3 1 914 3 978301968
4 1 3408 4 978300275
Solution is use parameter skiprows=1 or header=0 for rename columns names by names parameter:
df = pd.read_csv('ratings.csv',
dtype= {'userId':np.int32,
'movieId':np.int32,
'rating':np.float64,
'timestamp':np.int64},
header=0, #skiprows=1
names= ['userId','movieId','rating','timestamp'])
print (df.head())
userId movieId rating timestamp
0 1 1193 5.0 978300760
1 1 661 3.0 978302109
2 1 914 3.0 978301968
3 1 3408 4.0 978300275
4 1 2355 5.0 978824291
If dont want rename column names:
df = pd.read_csv('ratings.csv',
dtype= {'userId':np.int32,
'movieId':np.int32,
'rating':np.float64,
'timestamp':np.int64})
print (df.head())
user_id movie_id rating timestamp
0 1 1193 5.0 978300760
1 1 661 3.0 978302109
2 1 914 3.0 978301968
3 1 3408 4.0 978300275
4 1 2355 5.0 978824291

ValueError in scipy t test_ind

I have following csv file:
SRA ID ERR169499 ERR169498 ERR169497
Label 1 0 1
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169499 PRJEB3251_ERR169499
333046 0.05 0.99 99.61
1049 0.03 2.34 34.33
337090 0.01 9.78 23.22
99007 22.33 2.90 0.00
I have 92 columns for case for which label is 0 and 95 columns for control for which label is 1. I have to perform two sample independent T-Test and ranksum test So far I have:
df = pd.read_csv('final_out_transposed.csv', header=[1,2], index_col=[0])
case = df.xs('0', axis=1, level=0).dropna()
ctrl = df.xs('1', axis=1, level=0).dropna()
(tt_val, p_ttest) = ttest_ind(case, ctrl, equal_var=False)
For which I am getting the error: ValueError: operands could not be broadcast together with shapes (92,) (95,).
The traceback is:
File "<ipython-input-152-d58634e75106>", line 1, in <module>
runfile('C:/IBD Bioproject/New folder/temp_3251.py', wdir='C:/IBD
Bioproject/New folder')
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/IBD Bioproject/New folder/temp_3251.py", line 106, in <module>
tt_val, p_ttest = ttest_ind(case, ctrl, equal_var=False)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 4068, in ttest_ind
df, denom = _unequal_var_ttest_denom(v1, n1, v2, n2)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 3872, in _unequal_var_ttest_denom
df = (vn1 + vn2)**2 / (vn1**2 / (n1 - 1) + vn2**2 / (n2 - 1))
ValueError: operands could not be broadcast together with shapes (92,) (95,)
I read few posts but its still unclear also I went through numpy broadcast.
Thanks in advance
Apparently the objects created by the xs method of the Pandas DataFrame look like two-dimensional arrays. These must be flattened to look like one-dimensional arrays when passed to ttest_ind.
Try this:
ttest_ind(case.values.ravel(), ctrl.values.ravel(), equal_var=False)
The values attribute of the Pandas objects gives a numpy array, and the ravel() method flattens the array to one-dimension.

TypeError: 'numpy.float64' object cannot be interpreted as an integer

I am trying to run the detect_ts function from pyculiarity package but getting this error on passing a two-dimensional dataframe in python.
>>> import pandas as pd
>>> from pyculiarity import detect_ts
>>> data=pd.read_csv('C:\\Users\\nikhil.chauhan\\Desktop\\Bosch_Frame\\dataset1.csv',usecols=['time','value'])
>>> data.head()
time value
0 0 32.0
1 250 40.5
2 500 40.5
3 750 34.5
4 1000 34.5
>>> results = detect_ts(data,max_anoms=0.05,alpha=0.001,direction = 'both')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Windows\System32\pyculiar-0.0.5\pyculiarity\detect_ts.py", line 177, in detect_ts
verbose=verbose)
File "C:\Windows\System32\pyculiar-0.0.5\pyculiarity\detect_anoms.py", line 69, in detect_anoms
decomp = stl(data.value, np=num_obs_per_period)
File "C:\Windows\System32\pyculiar-0.0.5\pyculiarity\stl.py", line 35, in stl
res = sm.tsa.seasonal_decompose(data.values, model='additive', freq=np)
File "C:\Anaconda3\lib\site-packages\statsmodels\tsa\seasonal.py", line 88, in seasonal_decompose
trend = convolution_filter(x, filt)
File "C:\Anaconda3\lib\site-packages\statsmodels\tsa\filters\filtertools.py", line 303, in convolution_filter
result = _pad_nans(result, trim_head, trim_tail)
File "C:\Anaconda3\lib\site-packages\statsmodels\tsa\filters\filtertools.py", line 28, in _pad_nans
return np.r_[[np.nan] * head, x, [np.nan] * tail]
TypeError: 'numpy.float64' object cannot be interpreted as an integer
The problem with your code might be that np.nan is a float64 type value but the np.r_[] expects comma separated integers within its square brackets.
Hence you need to convert them to integer type first.
But we have another problem here.
return np.r_[[(int)(np.nan)] * head, x, [(int)(np.nan)] * tail]
This should have solved the problem in ordinary cases....
But it wont work in this case, as NaN cannot be type casted to integer type.
ValueError: cannot convert float NaN to integer
Thus, no proper solution can be suggested unless we know what you are trying to do here. Try providing a bit more details about your code and you are sure to get help from us.
:-)

Resources