Python MEMORY ERROR when loading 2455 CSV files (42GB) as pandas dataframe - python-3.x

Good day, I have 42Gb of data in a list of sequenced 2455 xCSV files.
I am trying to import the data sequentially using a loop into a pd.DataFrame for analysis.
I have tried it with 3 files and it works well.
from glob import glob
import pandas as pd
# Import data into DF
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_trial = [pd.read_csv(f) for f in filenames]
df_trial
I am getting the following error. Copy pasted the traceback here. Please help
df_trial = [pd.read_csv(f) for f in filenames]
Traceback (most recent call last):
File "<ipython-input-23-0438182db491>", line 1, in <module>
df_trial = [pd.read_csv(f) for f in filenames]
File "<ipython-input-23-0438182db491>", line 1, in <listcomp>
df_trial = [pd.read_csv(f) for f in filenames]
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1148, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\frame.py", line 435, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 74, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1670, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1726, in form_blocks
float_blocks = _multi_blockify(items_dict["FloatBlock"])
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1820, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1848, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 107. MiB for an array with shape (124, 113012) and data type float64

There are a number of things you can do.
First, only process one dataframe at a time:
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
for f in filenames:
df = pd.read_csv(f)
process(df)
Second, if that's not possible you can try to reduce the amount of memory used when loading the dataframes by a variety of means (smaller dtypes for numeric columns, omitting numeric columns, and more). See https://pythonspeed.com/articles/pandas-load-less-data/ for some starting points on these techniques.

Thanks to all.
I was able to process all 42GB of data using the nrows argument
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_2019=[]
for filename in filenames:
df = pd.read_csv(filename, index_col=None, header=0, nrows = 1000)
df_2019.append(df)
frame = pd.concat(df_2019, axis=0, ignore_index=True)

Related

KeyError: 'longitude' when reading from csv file [duplicate]

I have successfully read a csv file using pandas. When I am trying to print the a particular column from the data frame i am getting keyerror. Hereby i am sharing the code with the error.
import pandas as pd
reviews_new = pd.read_csv("D:\\aviva.csv")
reviews_new['review']
**
reviews_new['review']
Traceback (most recent call last):
File "<ipython-input-43-ed485b439a1c>", line 1, in <module>
reviews_new['review']
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'review'
**
Can someone help me in this ?
I think first is best investigate, what are real columns names, if convert to list better are seen some whitespaces or similar:
print (reviews_new.columns.tolist())
I think there can be 2 problems (obviously):
1.whitespaces in columns names (maybe in data also)
Solutions are strip whitespaces in column names:
reviews_new.columns = reviews_new.columns.str.strip()
Or add parameter skipinitialspace to read_csv:
reviews_new = pd.read_csv("D:\\aviva.csv", skipinitialspace=True)
2.different separator as default ,
Solution is add parameter sep:
#sep is ;
reviews_new = pd.read_csv("D:\\aviva.csv", sep=';')
#sep is whitespace
reviews_new = pd.read_csv("D:\\aviva.csv", sep='\s+')
reviews_new = pd.read_csv("D:\\aviva.csv", delim_whitespace=True)
EDIT:
You get whitespace in column name, so need 1.solutions:
print (reviews_new.columns.tolist())
['Name', ' Date', ' review']
^ ^
import pandas as pd
df=pd.read_csv("file.txt", skipinitialspace=True)
df.head()
df['review']
dfObj['Hash Key'] = (dfObj['DEAL_ID'].map(str) +dfObj['COST_CODE'].map(str) +dfObj['TRADE_ID'].map(str)).apply(hash)
#for index, row in dfObj.iterrows():
# dfObj.loc[`enter code here`index,'hash'] = hashlib.md5(str(row[['COST_CODE','TRADE_ID']].values)).hexdigest()
print(dfObj['hash'])

select row in heavy csv

i search how can i select some row with word in line so i use this script
import pandas
import datetime
df = pandas.read_csv(
r"C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv",
sep=",",
)
communes = ["PERPIGNAN"]
print()
df = df[~df["libelleCommuneEtablissement"].isin(communes)]
print()
so my script work well with a normal csv
but with a heavy Csv (4Go) the scipt say :
Traceback (most recent call last):
File "C:lafinessedufiness.py", line 5, in <module>
df = pandas.read_csv(r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv',
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1072, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1172, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1731, in pandas._libs.parsers._try_int64
MemoryError: Unable to allocate 128. KiB for an array with shape (16384,) and data type int64
do you know how can i fix this error please?
The pd.read_csv() function has an option to read the file in chunks, rather than loading it all at once. Use iterator=True and specify a reasonable chunk size (rows per chunk).
import pandas as pd
path = r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv'
it = pd.read_csv(path, sep=',', iterator=True, chunksize=10_000)
communes = ['PERPIGNAN']
filtered_chunks = []
for chunk_df in it:
chunk_df = chunk_df.query('libelleCommuneEtablissement not in #communes')
filtered_chunks.append(chunk_df)
df = pd.concat(filtered_chunks)
As you can see, you don't have enough memory available for Pandas to load that file entirely into memory.
One reason is that based on Python38-32 in the traceback, you're running a 32-bit version of Python, where 4 gigabytes (or is it 3 gigabytes?) is the limit for memory allocations anyway. If your system is 64-bit, you should switch to the 64-bit version of Python, so that's one obstacle less.
If that doesn't help, you'll just also need more memory. You could configure Windows's virtual memory, or buy more actual memory and install it in your system.
If those don't help, then you'll have to come up with a better approach than to load the big CSV entirely into memory.
For one, if you really only care about rows with the string PERPIGNAN (no matter the column; you can really filter it again in your code), you could do grep PERPIGNAN data.csv > data_perpignan.csv and work with that (assuming you have grep; you can do the same filtering with a short Python script).
Since read_csv() accepts any iterable of lines, you can also just do something like
def lines_from_file_including_strings(file, strings):
for i, line in enumerate(file):
if i == 0 or any(string in line for string in strings):
yield line
communes = ["PERPIGNAN", "PARIS"]
with open("StockEtablissement_utf8.csv") as f:
df = pd.read_csv(lines_from_file_including_strings(f, communes), sep=",")
for an initial filter.

How do I load and read several csvs and then merge afterwards into one file?

I am seriously struggling with the issue of reading multiple csvs from a dir where I have them all listed.
All files I am trying to first read, and then load start with around 15 lines of report info on that file. Trying to eliminate this with skiprows=15, although this only delivers random luck.
Using e.g. glob, I keep getting either "pandas.errors.EmptyDataError: No columns to parse from file" or "Error reading line XXXX..... saw 23", "Python int too large to covert to C long", "Error: field larger than field limit (131072)". I am trying--on a per month basis--to merge hundreds of files loaded on a per hour load in a share, and then merge each month's loads into Dec, Jan, Feb...and so on. Things worked fine for all months except Dec which is why I am really after some bullit-proof solution able to read any csv file.
I have this:
import glob
import pandas as pd
import sys
import csv
maxInt = sys.maxsize
while True:
# decrease the maxInt value by factor 10
# as long as the OverflowError occurs.
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10)
df = pd.concat([pd.read_csv(f, encoding="ISO-8859-1", sep=",", skiprows=7, engine="python") for f in glob.glob('C:\\...._dec*.csv')], ignore_index=True)
df.to_csv("C:\\...\\dec_logons_total_per_some_date.csv", sep=';', index=False)
I get this:
Traceback (most recent call last):
File "C:/.../PycharmProjects/FFA_AllFiles/load_multiple_csvs.py", line 51, in <module>
df = pd.read_csv(file_, index_col=None, error_bad_lines=False, skiprows=15, header=0, low_memory=False)
File "C:\Users\...\PycharmProjects\F...\venv\lib\site-packages\pandas\io\parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 948, in __init__
self._make_engine(self.engine)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 2010, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas\_libs\parsers.pyx", line 540, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
I tried a lot of attempts, including also skiprows=whatever,

ignore missing files in loop - data did not show up

I have thousands of files as you can see the year range below. Some of the dates of the files are missing so I want to skip over them. But when I tried the method below, and calling data_in, the variable doesn't exist. Any help would be truly appreciated. I am new to python. Thank you.
path = r'file path here'
DataYears = ['2012','2013','2014', '2015','2016','2017','2018','2019', '2020']
Years = np.float64(DataYears)
NumOfYr = Years.size
DataMonths = ['01','02','03','04','05','06','07','08','09','10','11','12']
daysofmonth=[31,28,31,30,31,30,31,31,30,31,30,31]
for yy in range(NumOfYr):
for mm in range (12):
try:
data_in = pd.read_csv(path+DataYears[yy]+DataMonths[mm]+'/*.dat', skiprows=4, header=None, engine='python')
print('Reached data_in') # EDIT
a=data_in[0] #EDIT
except IOError:
pass
#print("File not accessible")
EDIT: Error added
Traceback (most recent call last):
File "Directory/Documents/test.py", line 23, in <module>
data_in = pd.read_csv(path+'.'+DataYears[yy]+DataMonths[mm]+'/*.cod', skiprows=4, header=None, engine='python')
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1126, in _make_engine
self._engine = klass(self.f, **self.options)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2269, in __init__
memory_map=self.memory_map,
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/common.py", line 431, in get_handle
f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 'Directory/Documents/201201/*.dat'
You can adapt the code below to get a list of your date folders:
import glob
# Gives you a list of your folders with the different dates
folder_names = glob.glob("Directory/Documents/")
print(folder_names)
Then with the list of folder, you can iterate through there contents. If you just want a list of all .dat files can do something like:
import glob
# Gives you a list of your folders with the different dates
file_names = glob.glob("Directory/Documents/*/*.dat")
print(file_names)
The code above searches the contents of your directories so you bypass your problem with missing dates. The prints are there so you can see the results of glob.glob().

Iterating over text files in a folder and creating dataframe with 1 file / row in Python

I have a corpus of 14K text files that I want to read in to a dataframe. I want each file to be a unique row in said dataframe. Here's what I have so far:
import glob
import os
import pandas as pd
os.chdir("/Users/Wintermute/Desktop/senior_thesis/topic_models/corpus/")
content = pd.DataFrame()
i = 0
for file in glob.glob("*.txt"):
with open(file, 'r') as f:
i += 1
print(i)
content[i,] = f.readlines()
df = pd.DataFrame(content)
df.to_csv("corpus_article_by_line.csv")
When I run the program it acts as I would expect for the first 5 text files, but then I get a valueError: length of values does not match length of index.
Full error message:
Traceback (most recent call last):
File "/Users/Wintermute/PycharmProjects/cs4/test.py", line 13, in
content[i,] = f.readlines()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in setitem
self._set_item(key, value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item
6
value = self._sanitize_column(key, value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
Perhaps you have a non-uniform number of lines in your text files?

Resources