select row in heavy csv - python-3.x

i search how can i select some row with word in line so i use this script
import pandas
import datetime
df = pandas.read_csv(
r"C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv",
sep=",",
)
communes = ["PERPIGNAN"]
print()
df = df[~df["libelleCommuneEtablissement"].isin(communes)]
print()
so my script work well with a normal csv
but with a heavy Csv (4Go) the scipt say :
Traceback (most recent call last):
File "C:lafinessedufiness.py", line 5, in <module>
df = pandas.read_csv(r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv',
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1072, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1172, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1731, in pandas._libs.parsers._try_int64
MemoryError: Unable to allocate 128. KiB for an array with shape (16384,) and data type int64
do you know how can i fix this error please?

The pd.read_csv() function has an option to read the file in chunks, rather than loading it all at once. Use iterator=True and specify a reasonable chunk size (rows per chunk).
import pandas as pd
path = r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv'
it = pd.read_csv(path, sep=',', iterator=True, chunksize=10_000)
communes = ['PERPIGNAN']
filtered_chunks = []
for chunk_df in it:
chunk_df = chunk_df.query('libelleCommuneEtablissement not in #communes')
filtered_chunks.append(chunk_df)
df = pd.concat(filtered_chunks)

As you can see, you don't have enough memory available for Pandas to load that file entirely into memory.
One reason is that based on Python38-32 in the traceback, you're running a 32-bit version of Python, where 4 gigabytes (or is it 3 gigabytes?) is the limit for memory allocations anyway. If your system is 64-bit, you should switch to the 64-bit version of Python, so that's one obstacle less.
If that doesn't help, you'll just also need more memory. You could configure Windows's virtual memory, or buy more actual memory and install it in your system.
If those don't help, then you'll have to come up with a better approach than to load the big CSV entirely into memory.
For one, if you really only care about rows with the string PERPIGNAN (no matter the column; you can really filter it again in your code), you could do grep PERPIGNAN data.csv > data_perpignan.csv and work with that (assuming you have grep; you can do the same filtering with a short Python script).
Since read_csv() accepts any iterable of lines, you can also just do something like
def lines_from_file_including_strings(file, strings):
for i, line in enumerate(file):
if i == 0 or any(string in line for string in strings):
yield line
communes = ["PERPIGNAN", "PARIS"]
with open("StockEtablissement_utf8.csv") as f:
df = pd.read_csv(lines_from_file_including_strings(f, communes), sep=",")
for an initial filter.

Related

How do I load and read several csvs and then merge afterwards into one file?

I am seriously struggling with the issue of reading multiple csvs from a dir where I have them all listed.
All files I am trying to first read, and then load start with around 15 lines of report info on that file. Trying to eliminate this with skiprows=15, although this only delivers random luck.
Using e.g. glob, I keep getting either "pandas.errors.EmptyDataError: No columns to parse from file" or "Error reading line XXXX..... saw 23", "Python int too large to covert to C long", "Error: field larger than field limit (131072)". I am trying--on a per month basis--to merge hundreds of files loaded on a per hour load in a share, and then merge each month's loads into Dec, Jan, Feb...and so on. Things worked fine for all months except Dec which is why I am really after some bullit-proof solution able to read any csv file.
I have this:
import glob
import pandas as pd
import sys
import csv
maxInt = sys.maxsize
while True:
# decrease the maxInt value by factor 10
# as long as the OverflowError occurs.
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10)
df = pd.concat([pd.read_csv(f, encoding="ISO-8859-1", sep=",", skiprows=7, engine="python") for f in glob.glob('C:\\...._dec*.csv')], ignore_index=True)
df.to_csv("C:\\...\\dec_logons_total_per_some_date.csv", sep=';', index=False)
I get this:
Traceback (most recent call last):
File "C:/.../PycharmProjects/FFA_AllFiles/load_multiple_csvs.py", line 51, in <module>
df = pd.read_csv(file_, index_col=None, error_bad_lines=False, skiprows=15, header=0, low_memory=False)
File "C:\Users\...\PycharmProjects\F...\venv\lib\site-packages\pandas\io\parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 948, in __init__
self._make_engine(self.engine)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\...\PycharmProjects\FFA_AllFiles\venv\lib\site-packages\pandas\io\parsers.py", line 2010, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas\_libs\parsers.pyx", line 540, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
I tried a lot of attempts, including also skiprows=whatever,

Not able to parse csv file from pandas

I am writing python script in which i am generating two different csv files and then reading these file by using pandas. I am able to read file1 with pandas but getting error while reading file2 which in same format(same column name) as file1 but different/same values. Please find the below error that i am getting and sample code that i am using.
Error:
Traceback (most recent call last):
File "MSReport.py", line 168, in <module>
fail = pd.read_csv('/home/cisapp/msLogFailure.csv', sep=',')
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Code:
df = pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', engine='python')
f_output = df.groupby('MSISDN').last()
#print(df)
print(f_output)
fail = pd.read_csv(BASE_LOCATION+'/msLogFailure.csv', engine='python')
fail = fail['MSISDN']
fail = fail.tolist()
for i in fail:
succ = f_output[f_output.MSISDN != i]
In above sample code there is no error while reading file df = pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', engine='python') but while reading file fail = pd.read_csv(BASE_LOCATION+'/msLogFailure.csv', engine='python') i am facing the error as mentioned above. Please help to resolve.
Note: I am running code by using python3.
I faced the same problem and resolved. So you can check using below idea.
Check the delimitator and mention like below examples
pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', encoding='utf-16', sep='\t')
pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', delim_whitespace=True)
You can also add 'r' before file path.
Otherwise share the file image
Your sample of msLogFailure file looks OK - 6 column names and 6 data fields.
I looked for posts concerning just this error message and I found an advice to:
read the input file into a string variable,
read_csv from this string, e.g. pd.read_csv(io.StringIO(txt),...).
Maybe this will help.

Python MEMORY ERROR when loading 2455 CSV files (42GB) as pandas dataframe

Good day, I have 42Gb of data in a list of sequenced 2455 xCSV files.
I am trying to import the data sequentially using a loop into a pd.DataFrame for analysis.
I have tried it with 3 files and it works well.
from glob import glob
import pandas as pd
# Import data into DF
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_trial = [pd.read_csv(f) for f in filenames]
df_trial
I am getting the following error. Copy pasted the traceback here. Please help
df_trial = [pd.read_csv(f) for f in filenames]
Traceback (most recent call last):
File "<ipython-input-23-0438182db491>", line 1, in <module>
df_trial = [pd.read_csv(f) for f in filenames]
File "<ipython-input-23-0438182db491>", line 1, in <listcomp>
df_trial = [pd.read_csv(f) for f in filenames]
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1148, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\frame.py", line 435, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 74, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1670, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1726, in form_blocks
float_blocks = _multi_blockify(items_dict["FloatBlock"])
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1820, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\WorkStation\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1848, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 107. MiB for an array with shape (124, 113012) and data type float64
There are a number of things you can do.
First, only process one dataframe at a time:
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
for f in filenames:
df = pd.read_csv(f)
process(df)
Second, if that's not possible you can try to reduce the amount of memory used when loading the dataframes by a variety of means (smaller dtypes for numeric columns, omitting numeric columns, and more). See https://pythonspeed.com/articles/pandas-load-less-data/ for some starting points on these techniques.
Thanks to all.
I was able to process all 42GB of data using the nrows argument
filenames = glob('Z:\PersonalFolders\AllData\*.csv')
df_2019=[]
for filename in filenames:
df = pd.read_csv(filename, index_col=None, header=0, nrows = 1000)
df_2019.append(df)
frame = pd.concat(df_2019, axis=0, ignore_index=True)

How can I read the csv file in pandas which is separated with ";"?

I started working with pandas in python 3.4 for couple of days. I chose to work on Book-Crossing data set.
The book information table is like this:
The Book rating table is like this:
I want to grab the "ISBN","Book-title" from the book information table and merge it with the book-rating table in which both match the "ISBN" and after that write the results in another csv file.
I used the code below:
udata = pd.read_csv('1', names = ('User_ID', 'ISBN', 'Book-Rating'), encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
uitem = pd.read_csv('2', names = ('ISBN', 'Book-Title'), encoding="ISO-8859-1", sep=';', usecols=[0,1])
ratings = pd.merge(udata, uitem, on='ISBN')
ratings.to_csv('ratings.csv', index=False)
Unfortunately it doesn't work and it gives an error:
Traceback (most recent call last):
File "C:\Users\masoud\Desktop\Dataset\data2\a.py", line 2, in <module>
udata = pd.read_csv('2.csv', names = ('User_ID', 'ISBN', 'Book-Rating'),encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 491, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 278, in _read
return parser.read()
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 740, in read
ret = self._engine.read(nrows)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 1187, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 758, in pandas.parser.TextReader.read (pandas\parser.c:7919)
File "pandas\parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8175)
File "pandas\parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas\parser.c:8868)
File "pandas\parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8736)
File "pandas\parser.pyx", line 1732, in pandas.parser.raise_parser_error (pandas\parser.c:22105)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 8 fields in line 6452, saw 9
I was wondering if anybody could fix the error?
In the first and second row, change sep to ;.
sep=';'

Contradicting Errors?

So I'm trying to edit a csv file by writing to a temporary file and eventually replacing the original with the temp file. I'm going to have to edit the csv file multiple times so I need to be able to reference it. I've never used the NamedTemporaryFile command before and I'm running into a lot of difficulties. The most persistent problem I'm having is writing over the edited lines.
This part goes through and writes over rows unless specific values are in a specific column and then it just passes over.
I have this:
office = 3
temp = tempfile.NamedTemporaryFile(delete=False)
with open(inFile, "rb") as oi, temp:
r = csv.reader(oi)
w = csv.writer(temp)
for row in r:
if row[office] == "R00" or row[office] == "ALC" or row[office] == "RMS":
pass
else:
w.writerow(row)
and I get this error:
Traceback (most recent call last):
File "H:\jcatoe\Practice Python\pract.py", line 86, in <module>
cleanOfficeCol()
File "H:\jcatoe\Practice Python\pract.py", line 63, in cleanOfficeCol
for row in r:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
So I searched for that error and the general consensus was that "rb" needs to be "rt" so I tried that and got this error:
Traceback (most recent call last):
File "H:\jcatoe\Practice Python\pract.py", line 86, in <module>
cleanOfficeCol()
File "H:\jcatoe\Practice Python\pract.py", line 67, in cleanOfficeCol
w.writerow(row)
File "C:\Users\jcatoe\AppData\Local\Programs\Python\Python35-32\lib\tempfile.py", line 483, in func_wrapper
return func(*args, **kwargs)
TypeError: a bytes-like object is required, not 'str'
I'm confused because the errors seem to be saying to do the opposite thing.
If you read the tempfile docs you'll see that by default it's opening the file in 'w+b' mode. If you take a closer look at your errors, you'll see that you're getting one on read, and one on write. What you need to be doing is making sure that you're opening your input and output file in the same mode.
You can do it like this:
import csv
import tempfile
office = 3
temp = tempfile.NamedTemporaryFile(delete=False)
with open(inFile, 'r') as oi, tempfile.NamedTemporaryFile(delete=False, mode='w') as temp:
reader = csv.reader(oi)
writer = csv.writer(temp)
for row in reader:
if row[office] == "R00" or row[office] == "ALC" or row[office] == "RMS":
pass
else:
writer.writerow(row)

Resources