I have following csv file:
SRA ID ERR169499 ERR169498 ERR169497
Label 1 0 1
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169499 PRJEB3251_ERR169499
333046 0.05 0.99 99.61
1049 0.03 2.34 34.33
337090 0.01 9.78 23.22
99007 22.33 2.90 0.00
I have 92 columns for case for which label is 0 and 95 columns for control for which label is 1. I have to perform two sample independent T-Test and ranksum test So far I have:
df = pd.read_csv('final_out_transposed.csv', header=[1,2], index_col=[0])
case = df.xs('0', axis=1, level=0).dropna()
ctrl = df.xs('1', axis=1, level=0).dropna()
(tt_val, p_ttest) = ttest_ind(case, ctrl, equal_var=False)
For which I am getting the error: ValueError: operands could not be broadcast together with shapes (92,) (95,).
The traceback is:
File "<ipython-input-152-d58634e75106>", line 1, in <module>
runfile('C:/IBD Bioproject/New folder/temp_3251.py', wdir='C:/IBD
Bioproject/New folder')
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/IBD Bioproject/New folder/temp_3251.py", line 106, in <module>
tt_val, p_ttest = ttest_ind(case, ctrl, equal_var=False)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 4068, in ttest_ind
df, denom = _unequal_var_ttest_denom(v1, n1, v2, n2)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 3872, in _unequal_var_ttest_denom
df = (vn1 + vn2)**2 / (vn1**2 / (n1 - 1) + vn2**2 / (n2 - 1))
ValueError: operands could not be broadcast together with shapes (92,) (95,)
I read few posts but its still unclear also I went through numpy broadcast.
Thanks in advance
Apparently the objects created by the xs method of the Pandas DataFrame look like two-dimensional arrays. These must be flattened to look like one-dimensional arrays when passed to ttest_ind.
Try this:
ttest_ind(case.values.ravel(), ctrl.values.ravel(), equal_var=False)
The values attribute of the Pandas objects gives a numpy array, and the ravel() method flattens the array to one-dimension.
Related
I'm struggling with getting seaborn to work for me. I've looked at several tutorials, and nothing seems to be working. First column is float values, second column is class labels. I'm wanting each class label to have it's own violin plot. There is class-imbalance. How do I get the second column to split up into separate violin plots, and the floats from it's neighboring column assigned into those groups?
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
data = <read files and append [float, str] to list>
data = pd.DataFrame(data=data, columns=['Slope', 'Class'])
Slope Class
0 0.1204778488753736 A
1 -0.05555463166744862 B
3 0.12801810567575655 A
4 -0.05620886965822473 B
... ... ...
1126 0.10525394490595655 D
1127 0.10174103119847132 E
1128 -0.03527457290513986 C
1129 -0.035187264760902316 D
1130 -0.03407274349346173 E
sns.violinplot(data=data)
It displays:
I'm running out of ideas of what to try. Most things end up with the error:
Traceback (most recent call last):
File "c:\Path\Slope Graphs.py", line 43, in <module>
main(slopes1, slopes2)
File "c:\Path\Slope Graphs.py", line 26, in main
sns.violinplot(x='Class', y='Slope', data=data)
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\seaborn\categorical.py", line 2384, in violinplot
plotter = _ViolinPlotter(x, y, hue, data, order, hue_order,
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\seaborn\categorical.py", line 554, in __init__
self.estimate_densities(bw, cut, scale, scale_hue, gridsize)
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\seaborn\categorical.py", line 620, in estimate_densities
kde, bw_used = self.fit_kde(kde_data, bw)
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\seaborn\categorical.py", line 705, in fit_kde
kde = stats.gaussian_kde(x, bw)
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\scipy\stats\kde.py", line 209, in __init__
self.set_bandwidth(bw_method=bw_method)
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\scipy\stats\kde.py", line 565, in set_bandwidth
self._compute_covariance()
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\scipy\stats\kde.py", line 574, in _compute_covariance
self._data_covariance = atleast_2d(cov(self.dataset, rowvar=1,
File "<__array_function__ internals>", line 5, in cov
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\function_base.py", line 2422, in cov
avg, w_sum = average(X, axis=1, weights=w, returned=True)
File "<__array_function__ internals>", line 5, in average
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\function_base.py", line 420, in average
scl = wgt.sum(axis=axis, dtype=result_dtype)
File "C:\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\numpy\core\_methods.py", line 38, in _sum
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
TypeError: No loop matching the specified signature and casting was found for ufunc add
Good day, I have 42Gb of data in a list of sequenced 2455 xCSV files.
I am trying to import the data sequentially using a loop into a pd.DataFrame for analysis.
I have tried it with 3 files and it works well.
from glob import glob
import pandas as pd
# Import data into DF
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_trial = [pd.read_csv(f) for f in filenames]
df_trial
I am getting the following error. Copy pasted the traceback here. Please help
df_trial = [pd.read_csv(f) for f in filenames]
Traceback (most recent call last):
File "<ipython-input-23-0438182db491>", line 1, in <module>
df_trial = [pd.read_csv(f) for f in filenames]
File "<ipython-input-23-0438182db491>", line 1, in <listcomp>
df_trial = [pd.read_csv(f) for f in filenames]
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1148, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\frame.py", line 435, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 74, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1670, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1726, in form_blocks
float_blocks = _multi_blockify(items_dict["FloatBlock"])
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1820, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1848, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 107. MiB for an array with shape (124, 113012) and data type float64
There are a number of things you can do.
First, only process one dataframe at a time:
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
for f in filenames:
df = pd.read_csv(f)
process(df)
Second, if that's not possible you can try to reduce the amount of memory used when loading the dataframes by a variety of means (smaller dtypes for numeric columns, omitting numeric columns, and more). See https://pythonspeed.com/articles/pandas-load-less-data/ for some starting points on these techniques.
Thanks to all.
I was able to process all 42GB of data using the nrows argument
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_2019=[]
for filename in filenames:
df = pd.read_csv(filename, index_col=None, header=0, nrows = 1000)
df_2019.append(df)
frame = pd.concat(df_2019, axis=0, ignore_index=True)
I have a numpy array with following dimensions :
(1611216, 2)
I tried reshaping it to (804, 2004)
using :
df = np.reshape(df, (804, 2004))
but it gives an error :
Traceback (most recent call last):
File "Z:/Seismic/Geophysical/99_Personal/Abhishake/RMS_Machine_learning/RMS_data_analysis.py", line 19, in <module>
df = np.reshape(df, (804, 2004))
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 3222432 into shape (804,2004)
df = np.reshape(df, (804, 2004))
but it gives an error :
Traceback (most recent call last):
File "Z:/Seismic/Geophysical/99_Personal/Abhishake/RMS_Machine_learning/RMS_data_analysis.py", line 19, in
df = np.reshape(df, (804, 2004))
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 3222432 into shape (804,2004)
You cannot reshape (1611216, 2) numpy array into (804, 2004).
It is because 1611216 x 2 = 3,222,432 and 804 x 2004 = 1,611,216. The difference in size of the two array is very large. I think you have to come up with another set of dimensions for your numpy array and that would depend on how you want to use your array.
Hint : (1608, 2004) will be a valid reshape.
I am trying to run the detect_ts function from pyculiarity package but getting this error on passing a two-dimensional dataframe in python.
>>> import pandas as pd
>>> from pyculiarity import detect_ts
>>> data=pd.read_csv('C:\\Users\\nikhil.chauhan\\Desktop\\Bosch_Frame\\dataset1.csv',usecols=['time','value'])
>>> data.head()
time value
0 0 32.0
1 250 40.5
2 500 40.5
3 750 34.5
4 1000 34.5
>>> results = detect_ts(data,max_anoms=0.05,alpha=0.001,direction = 'both')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Windows\System32\pyculiar-0.0.5\pyculiarity\detect_ts.py", line 177, in detect_ts
verbose=verbose)
File "C:\Windows\System32\pyculiar-0.0.5\pyculiarity\detect_anoms.py", line 69, in detect_anoms
decomp = stl(data.value, np=num_obs_per_period)
File "C:\Windows\System32\pyculiar-0.0.5\pyculiarity\stl.py", line 35, in stl
res = sm.tsa.seasonal_decompose(data.values, model='additive', freq=np)
File "C:\Anaconda3\lib\site-packages\statsmodels\tsa\seasonal.py", line 88, in seasonal_decompose
trend = convolution_filter(x, filt)
File "C:\Anaconda3\lib\site-packages\statsmodels\tsa\filters\filtertools.py", line 303, in convolution_filter
result = _pad_nans(result, trim_head, trim_tail)
File "C:\Anaconda3\lib\site-packages\statsmodels\tsa\filters\filtertools.py", line 28, in _pad_nans
return np.r_[[np.nan] * head, x, [np.nan] * tail]
TypeError: 'numpy.float64' object cannot be interpreted as an integer
The problem with your code might be that np.nan is a float64 type value but the np.r_[] expects comma separated integers within its square brackets.
Hence you need to convert them to integer type first.
But we have another problem here.
return np.r_[[(int)(np.nan)] * head, x, [(int)(np.nan)] * tail]
This should have solved the problem in ordinary cases....
But it wont work in this case, as NaN cannot be type casted to integer type.
ValueError: cannot convert float NaN to integer
Thus, no proper solution can be suggested unless we know what you are trying to do here. Try providing a bit more details about your code and you are sure to get help from us.
:-)
I have two Series objects, which from my perspective look exactly the same, except they contain different data. I have attempted to convert them to DataFrames and to put them both in the same DataFrame as separate columns. For some reason I cannot fathom, one of the Series will be converted happily to a DataFrame and the other one refuses to be converted when placed in a container (list or dict). I get a reindexing error, but there are no duplicates in the index of either Series.
import pickle
import pandas as pd
s1 = pickle.load(open('s1.p', 'rb'))
s2 = pickle.load(open('s2.p', 'rb'))
print(s1.head(10))
print(s2.head(10))
pd.DataFrame(s1) # <--- works fine
pd.DataFrame(s2) # <--- works fine
pd.DataFrame([s1]) # <--- works fine
# pd.DataFrame([s2]) # <--- doesn't work
# pd.DataFrame([s1, s2]) # <--- doesn't work
pd.DataFrame({s1.name: s1}) # <--- works fine
pd.DataFrame({s2.name: s2}) # <--- works fine
pd.DataFrame({s1.name: s1, s2.name: s1}) # <--- works fine
# pd.DataFrame({s1.name: s1, s2.name: s2}) # <--- doesn't work
Here is the output, although you can't see it here, there is overlap between the index values; they are just in a different order. I want the indexes to be matched when I combine them into the same DataFrame.
id
801120 42.01
801138 50.18
801139 50.01
802101 53.77
802110 56.52
802112 47.37
802113 46.52
802114 46.58
802115 42.59
802117 40.85
Name: age, dtype: float64
id
A32067 0.39083
A32195 0.28506
A01685 0.36432
A11124 0.55649
A32020 0.41524
A32021 0.43788
A32098 0.49206
A00699 0.37515
A32158 0.58793
A14139 0.47413
Name: lh_vtx_000001, dtype: float64
Traceback when the final line is uncommented:
Traceback (most recent call last):
File "/Users/sm2286/Documents/Vertex/test.py", line 18, in <module>
pd.DataFrame({s1.name: s1, s2.name: s2}) # <--- doesn't work
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 224, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 360, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5236, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5534, in _homogenize
v = v.reindex(index, copy=False)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2287, in reindex
return super(Series, self).reindex(index=index, **kwargs)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2229, in reindex
fill_value, copy).__finalize__(self)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2247, in _reindex_axes
copy=copy, allow_dups=False)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2341, in _reindex_with_indexers
copy=copy)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 3586, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py", line 2293, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
Traceback when line 13 is uncommented:
Traceback (most recent call last):
File "/Users/sm2286/Documents/Vertex/test.py", line 13, in <module>
pd.DataFrame([s2]) # <--- doesn't work
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 263, in __init__
arrays, columns = _to_arrays(data, columns, dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5359, in _to_arrays
dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5453, in _list_of_series_to_arrays
indexer = indexer_cache[id(index)] = index.get_indexer(columns)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py", line 2082, in get_indexer
raise InvalidIndexError('Reindexing only valid with uniquely'
pandas.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
After more investigation the difference between the Series was that the latter contained missing values. Removing them fixed the issue.