how to skip corrupted gzips with pyspark? - apache-spark

I need to read from a lot of gzips from hdfs, like this:
sc.textFile('*.gz')
while some of these gzips are corrupted, raises
java.io.IOException: gzip stream CRC failure
stops the whole processing running.
I read the debate here, where someone has the same need, but get no clear solution. Since it's not appropriate to achieve this function within spark (according to the link), is there any way just brutally skip corrupted files? There seem to have hints for scala user, no idea how to deal with it in python.
Or I can only detect corrupted files first, and delete them?
What if I have large amount of gzips, and after a day of running, find out the last one of them are corrupted. The whole day wasted. And having corrupted gzips are quite common.

You could manually list all of the files and then read over the files in a map UDF. The UDF could then have try/except blocks to handles corrupted files.
The code would look something like
import gzip
from pyspark.sql import Row
def readGzips(fileLoc):
try:
...
code to read file
...
return record
except:
return Row(failed=fileLoc)
from os import listdir
from os.path import isfile, join
fileList = [f for f in listdir(mypath) if isfile(join(mypath, f))]
pFileList = sc.parallelize(fileList)
dataRdd = pFileList.map(readGzips).filter((lambda x: 'failed' not in x.asDict()))

Related

How can I work on a large dataset without having to use Pyspark?

I'm trying to work on a dataset with 510,000 rows and 636 columns. I loaded it into a dataframe using the dask dataframe method, but the entries can't be displayed. When i try to get the shape, it results in delays. Is there a way for me to analyze the whole dataset without using big data technologies like Pyspark?
from dask import dataframe
import requests
import zipfile
import os
import pandas as pd
if os.path.exists('pisa2012.zip') == False:
r = requests.get('https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip', allow_redirects=True)
open('pisa2012.zip', 'wb').write(r.content)
if os.path.exists('pisa2012.csv') == False:
with zipfile.ZipFile('pisa2012.zip', 'r') as zip_ref:
zip_ref.extractall('./')
df_pisa = dataframe.read_csv('pisa2012.csv')
df_pisa.shape #Output:(Delayed('int-e9d8366d-1b9e-4f8e-a83a-1d4cac510621'), 636)
Firstly, spark, dask and vaex are all "big data" technologies.
it results in delays
If you read the documentation, you will see that dask is lazy and only performs operations on demand, you have to want to. The reason is, that just getting the shape requires reading all the data, but the data will not be held in memory - that is the whole point and the feature that lets you work with bigger-than-memory data (otherwise, just use pandas).
This works:
df_pisa.shape.compute()
Bute, better, figure out what you actually want to do with the data; I assume you are not just after the shape. You can put multiple operations/delayed objects into a dask.compute() to do them at once and not have to repeat expensive tasks like reading/parsing the file.

to_csv takes too much time to complete for xml object

I have an xml object which needs to be written to a file. I see that this takes more than 1 hour to complete for 10,000 records. I tried to convert using df_merge['xml'] = df_merge['xml'].astype(str). Still time taken is similar i.e. more than 1 hour just that astype(str) takes more time. So, whatever be the scenario, it takes more than 1 hour to complete to_csv.
So, can I please know how to write large xml object to a file quickly?
Size of 10000 xmls will be around 600 MB.
df_merge.to_csv(settings.OUTPUT_XML, encoding='utf-8', index=False,
columns=['xml'])
Later I tried to use np.savetxt which also takes similar time.
import numpy as np
np.savetxt('output_xml.txt', df_merge['xml'], encoding='utf-8', fmt="%s")
You might consider using serialization. A good library for that is joblib, or other common serialization tools like pickle
A good Stack Overflow post outlining the differences and when to use each is here
In your case, you might be able to serialize your object and it would be done so in much more time, using some example code from below:
# Import joblib's dump function
from joblib import dump
# For speed, keep compression = 0
dump(df_merge, 'df_merge.joblib')
# For smaller file size, you can increase compression, though it will slow your write time
# dump(df_merge, 'df_merge.joblib', compress=9)
You can then use joblib to load the file, like so:
# Import joblib's load function
from joblib import load
# For speed, keep compression = 0
# Note, if you used compress=n, then it will take longer to load
df_merge = load('df_merge.joblib')

Is it possible to use pandas and/or pyreadstat to read a large SPSS file in chunks, or does an alternative exist?

I have a SPSS database that I need to open, but it is huge and if opened naively as in the code below, it saturates RAM and eventually crashes.
import pandas as pd
def main():
data = pd.read_spss('database.sav')
print(data)
if __name__=='__main__':
main()
The equivalent pandas function to read a SAS database allows for the chunksize and iterator keywords, mapping the file without reading it all into RAM in one shot, but for SPSS this option appears to be missing. Is there another python module that I could use for this task that would allow for mapping of the database without reading it into RAM in its entirety?
You can using pyreadstat using the generator read_file_in_chunks. Use the parameter chunksize to regulate how many rows should be read on each iteration.
import pyreadstat
fpath = 'database.sav'
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, fpath, chunksize= 10000)
for df, meta in reader:
print(df) # df will contain 10K rows
# do some cool calculations here for the chunk
Pandas read_spss uses pyreadstat under the hood, but exposes only a subset of options.

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1

Resources