I am using pandas for the first time and wanted to do some data analysis for Movielens dataset. After running my code for 1M dataset, I wanted to experiment with Movielens 20M
I am only reading one file i.e ratings.csv
However, I faced multiple problems with 20M dataset, and after spending much time I realized that this is because the dtypes of columns being read are not as expected.
The following code(where path is path of ratings.csv file)
import pandas as pd
import numpy as np
df = pd.read_csv('../data/ml-20m/ratings.csv',sep=',',names=
['userId','movieId','rating','timestamp'],engine='python', dtype=
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}, skipinitialspace=True, error_bad_lines=False)
is giving me the following error :-
Traceback (most recent call last): File
"/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1663, in _cast_types
values = astype_nansafe(values, cast_type, copy=True) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/core/dtypes/cast.py",
line 709, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape) File "pandas/_libs/lib.pyx", line 456, in
pandas._libs.lib.astype_intsafe File "pandas/_libs/src/util.pxd",
line 142, in util.set_value_at_unsafe ValueError: invalid literal for
int() with base 10: 'movieId'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "test.py", line 4, in
df = pd.read_csv('../data/ml-20m/ratings.csv',sep=',',names= ['userId','movieId','rating','timestamp'],engine='python', dtype=
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}, skipinitialspace=True, error_bad_lines=False)
File
"/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 678, in parser_f
return _read(filepath_or_buffer, kwds) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 446, in _read
data = parser.read(nrows) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1036, in read
ret = self._engine.read(nrows) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 2272, in read
data = self._convert_data(data) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 2338, in _convert_data
clean_conv, clean_dtypes) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1574, in _convert_to_ndarrays
cvals = self._cast_types(cvals, cast_type, c) File "/home/sahildeep/.local/lib/python3.5/site-packages/pandas/io/parsers.py",
line 1666, in _cast_types
"type %s" % (column, cast_type)) ValueError: Unable to convert column movieId to type
Basically I want to skip all those lines whose datatype doesn't conform to dictionary
{'userId':np.int32, 'movieId':np.int32, 'rating':np.float64,
'timestamp':np.int64}
If I don't give the dtype argument to read_csv, then all four columns turn out to be of type "object" which is not what I want.
I searched on google and found noone facing this problem. Can you help me?
I am using python3
Problem is you define columns names, but csv have header, so first row of DataFrame is same like columns names, so all rows are converted to strings:
df = pd.read_csv('ratings.csv',
names= ['userId','movieId','rating','timestamp'])
print (df.head())
userId movieId rating timestamp
0 user_id movie_id rating timestamp
1 1 1193 5 978300760
2 1 661 3 978302109
3 1 914 3 978301968
4 1 3408 4 978300275
Solution is use parameter skiprows=1 or header=0 for rename columns names by names parameter:
df = pd.read_csv('ratings.csv',
dtype= {'userId':np.int32,
'movieId':np.int32,
'rating':np.float64,
'timestamp':np.int64},
header=0, #skiprows=1
names= ['userId','movieId','rating','timestamp'])
print (df.head())
userId movieId rating timestamp
0 1 1193 5.0 978300760
1 1 661 3.0 978302109
2 1 914 3.0 978301968
3 1 3408 4.0 978300275
4 1 2355 5.0 978824291
If dont want rename column names:
df = pd.read_csv('ratings.csv',
dtype= {'userId':np.int32,
'movieId':np.int32,
'rating':np.float64,
'timestamp':np.int64})
print (df.head())
user_id movie_id rating timestamp
0 1 1193 5.0 978300760
1 1 661 3.0 978302109
2 1 914 3.0 978301968
3 1 3408 4.0 978300275
4 1 2355 5.0 978824291
Related
Could anyone help me debug this error? I appreciate it!
Col_1 Col_2 Col_3
0 8 0 0
1 0 1 0
2 0 0 1
3 8 0 0
'''
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
data = pd.read_csv("simple_data_2.csv")
print(data.shape)
data.dropna()
data = data[["Col_1","Col_2","Col_3"]]
predict ="Col_1"
gene1 = "Col_2"
gene2 = "Col_3"
data = data.dropna()
data = data.reset_index(drop=True)
print(data)
x = np.array(data[gene1,gene2])
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x,y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train,y_train)
acc = linear.score(x_test, y_test)
print(acc)'''
Traceback (most recent call last):
File "/Users/Chris/opt/anaconda3/envs/tf/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('Col_2', 'Col_3')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/Chris/PycharmProjects/TensorEnv/debug.py", line 16, in
x = np.array(data[gene1,gene2])
File "/Users/Chris/opt/anaconda3/envs/tf/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in getitem
indexer = self.columns.get_loc(key)
File "/Users/Chris/opt/anaconda3/envs/tf/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: ('Col_2', 'Col_3')
The error is when you assigned names gene1 and gene2 to columns, you didn't do it properly.
gene1 = df["Col_2"]
But I wouldn't even bother with that, just use actual column names when feeding data into the algorithm or change their names, you will confuse yourself.
Good day, I have 42Gb of data in a list of sequenced 2455 xCSV files.
I am trying to import the data sequentially using a loop into a pd.DataFrame for analysis.
I have tried it with 3 files and it works well.
from glob import glob
import pandas as pd
# Import data into DF
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_trial = [pd.read_csv(f) for f in filenames]
df_trial
I am getting the following error. Copy pasted the traceback here. Please help
df_trial = [pd.read_csv(f) for f in filenames]
Traceback (most recent call last):
File "<ipython-input-23-0438182db491>", line 1, in <module>
df_trial = [pd.read_csv(f) for f in filenames]
File "<ipython-input-23-0438182db491>", line 1, in <listcomp>
df_trial = [pd.read_csv(f) for f in filenames]
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1148, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\frame.py", line 435, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 74, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1670, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1726, in form_blocks
float_blocks = _multi_blockify(items_dict["FloatBlock"])
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1820, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1848, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 107. MiB for an array with shape (124, 113012) and data type float64
There are a number of things you can do.
First, only process one dataframe at a time:
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
for f in filenames:
df = pd.read_csv(f)
process(df)
Second, if that's not possible you can try to reduce the amount of memory used when loading the dataframes by a variety of means (smaller dtypes for numeric columns, omitting numeric columns, and more). See https://pythonspeed.com/articles/pandas-load-less-data/ for some starting points on these techniques.
Thanks to all.
I was able to process all 42GB of data using the nrows argument
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_2019=[]
for filename in filenames:
df = pd.read_csv(filename, index_col=None, header=0, nrows = 1000)
df_2019.append(df)
frame = pd.concat(df_2019, axis=0, ignore_index=True)
I have the following dataframe
0 1 2 3 4 5 6
0 i love eating spicy hand pulled noodles
1 i also like to game alot
I'd like to apply a function to create a new dataframe, but instead of the above words, the df will be populated with each words's part of speech tag.
I'm using nltk.pos_tag, and I did this df.apply(nltk.pos_tag).
My expected output should look like this:
0 1 2 3 4 5 6
0 NN NN VB JJ NN VB NN
1 NN DT NN NN VB DT
However, I get IndexError: ('string index out of range', 'occurred at index 6')
Also, I understand that nltk.pos_tag will return a tuple output in the format of: ('word', 'pos_tag'). So some further manipulation may be required to only get the tag. Any suggestions on how to go about doing this efficiently?
Traceback:
Traceback (most recent call last):
File "PartsOfSpeech.py", line 71, in <module>
FilteredTrees = pos.run_pos(data.lower())
File "PartsOfSpeech.py", line 59, in run_pos
df = df.apply(pos_tag)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/frame.py", line 6487, in apply
return op.get_result()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 151, in get_result
return self.apply_standard()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 257, in apply_standard
self.apply_series_generator()
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/pandas/core/apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag
return _pos_tag(tokens, tagset, tagger, lang)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 157, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "/anaconda3/envs/customer_sentiment/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 242, in normalize
elif word[0].isdigit():
You can use applymap.
df.fillna('').applymap(lambda x: nltk.pos_tag([x])[0][1] if x!='' else '')
0 1 2 3 4 5 6
0 NN NN VBG NN NN VBD NNS
1 NN RB IN TO NN NN
Note: If your dataframe is large, it'll be more efficient to tag the entire sentences and then convert the tags into a dataframe. The current approach will be slow with big dataset.
I have a column of over a million floats. I need to be able to replace certain values with strings when that value falls above or below certain thresholds.
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo': np.random.random(10),
'bar': np.random.random(10)})
df
Out[115]:
foo bar
0 0.181262 0.890826
1 0.321260 0.053619
2 0.832247 0.044459
3 0.937769 0.855299
4 0.752133 0.008980
5 0.751948 0.680084
6 0.559528 0.785047
7 0.615597 0.265483
8 0.129505 0.509945
9 0.727209 0.786113
df.at[5, 'foo'] = 'somestring'
Traceback (most recent call last):
File "<ipython-input-116-bf0f6f9e84ac>", line 1, in <module>
df.at[5, 'foo'] = 'somestring'
File "/Users/nate/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2287, in __setitem__
self.obj._set_value(*key, takeable=self._takeable)
File "/Users/nate/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2815, in _set_value
engine.set_value(series._values, index, value)
File "pandas/_libs/index.pyx", line 95, in pandas._libs.index.IndexEngine.set_value
File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.set_value
ValueError: could not convert string to float: 'somestring'
I will eventually need to write something like:
for idx, row in df.iterrows()
if row[0] > some_value:
df.at[idx, 'foo'] = 'over_some_value'
else:
I have tried using iloc, but I suspect it would be to slow, and I would like to be able to use at to keep my code uniform.
In order to assign different type value into the columns , you may need to convert it to object
And warning here, since the convert to object , it is very dangerous
df=df.astype(object)
df.at[5, 'foo'] = 'somestring'
df
foo bar
0 0.163246 0.803071
1 0.946447 0.48324
2 0.777733 0.461704
3 0.996791 0.521338
4 0.320627 0.374384
5 somestring 0.987591
6 0.388765 0.726807
7 0.362077 0.76936
8 0.738139 0.0539076
9 0.208691 0.812568
I have following csv file:
SRA ID ERR169499 ERR169498 ERR169497
Label 1 0 1
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169499 PRJEB3251_ERR169499
333046 0.05 0.99 99.61
1049 0.03 2.34 34.33
337090 0.01 9.78 23.22
99007 22.33 2.90 0.00
I have 92 columns for case for which label is 0 and 95 columns for control for which label is 1. I have to perform two sample independent T-Test and ranksum test So far I have:
df = pd.read_csv('final_out_transposed.csv', header=[1,2], index_col=[0])
case = df.xs('0', axis=1, level=0).dropna()
ctrl = df.xs('1', axis=1, level=0).dropna()
(tt_val, p_ttest) = ttest_ind(case, ctrl, equal_var=False)
For which I am getting the error: ValueError: operands could not be broadcast together with shapes (92,) (95,).
The traceback is:
File "<ipython-input-152-d58634e75106>", line 1, in <module>
runfile('C:/IBD Bioproject/New folder/temp_3251.py', wdir='C:/IBD
Bioproject/New folder')
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/IBD Bioproject/New folder/temp_3251.py", line 106, in <module>
tt_val, p_ttest = ttest_ind(case, ctrl, equal_var=False)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 4068, in ttest_ind
df, denom = _unequal_var_ttest_denom(v1, n1, v2, n2)
File "C:\Users\ksingh1\AppData\Local\Continuum\Anaconda3\lib\site-
packages\scipy\stats\stats.py", line 3872, in _unequal_var_ttest_denom
df = (vn1 + vn2)**2 / (vn1**2 / (n1 - 1) + vn2**2 / (n2 - 1))
ValueError: operands could not be broadcast together with shapes (92,) (95,)
I read few posts but its still unclear also I went through numpy broadcast.
Thanks in advance
Apparently the objects created by the xs method of the Pandas DataFrame look like two-dimensional arrays. These must be flattened to look like one-dimensional arrays when passed to ttest_ind.
Try this:
ttest_ind(case.values.ravel(), ctrl.values.ravel(), equal_var=False)
The values attribute of the Pandas objects gives a numpy array, and the ravel() method flattens the array to one-dimension.