Streaming parquet file python and only downsampling - python-3.x

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!

Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method

This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1

Related

Is it possible to use pandas and/or pyreadstat to read a large SPSS file in chunks, or does an alternative exist?

I have a SPSS database that I need to open, but it is huge and if opened naively as in the code below, it saturates RAM and eventually crashes.
import pandas as pd
def main():
data = pd.read_spss('database.sav')
print(data)
if __name__=='__main__':
main()
The equivalent pandas function to read a SAS database allows for the chunksize and iterator keywords, mapping the file without reading it all into RAM in one shot, but for SPSS this option appears to be missing. Is there another python module that I could use for this task that would allow for mapping of the database without reading it into RAM in its entirety?
You can using pyreadstat using the generator read_file_in_chunks. Use the parameter chunksize to regulate how many rows should be read on each iteration.
import pyreadstat
fpath = 'database.sav'
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, fpath, chunksize= 10000)
for df, meta in reader:
print(df) # df will contain 10K rows
# do some cool calculations here for the chunk
Pandas read_spss uses pyreadstat under the hood, but exposes only a subset of options.

dask.dataframe.read_parquet takes too long

I tried to read parquet from s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
},
engine='pyarrow',
)
It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()
My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?
Thanks,
Currently if there is no metadata file and if you're using the PyArrow backend then Dask is probably sending a request to read metadata from each of the individual partitions on S3. This is quite slow.
Dask's dataframe parquet reader is being rewritten now to help address this. You might consider using fastparquet until then and the ignore_divisions keyword (or something like that), or checking back in a month or two.

computer crashes when doing calculations on big datasets on python using Dask

I am unable to do calculations on large datasets using python-Dask. My computer crashes.
I have a computer with 4GB of RAM and running Linux Debian. I am trying load some files from a Kaggle competition (ElO Merchant competition) When I try load and get the shape of the dask dataframe the computer crashes.
I am running the code on only my laptop. I chose dask because it could handle large datasets. I would also like to know if Dask is able to move computations to my hard disk if it does not fit in memory? If so do I need to activate such thing or dask automatically does it? If I need to do it manually how do I do it? If there is a tutorial on this it would be great also.
I have 250GB Solid State Drive as my hard Disk. Hence the there would be space for a large dataset to fit to disk.
Please help me on this regard. My code is as below.
Thank you
Michael
import dask.dataframe as dd
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
client = Client(processes=False)
merchant = dd.read_csv('/home/michael/Elo_Merchant/merchants.csv')
new_merchant_transactions = dd.read_csv('/home/michael/Elo_Merchant/new_merchant_transactions.csv')
historical_transactions = dd.read_csv('/home/michael/Elo_Merchant/historical_transactions.csv')
train = dd.read_csv('/home/michael/Elo_Merchant/train.csv')
test = dd.read_csv('/home/michael/Elo_Merchant/test.csv')
merchant.head()
merchant.compute().shape
merchant_headers = merchant.columns.values.tolist()
for c in range(len(merchant_headers)):
print(merchant_headers[c])
print('--------------------')
print("{}".format(merchant[merchant_headers[c]].value_counts().compute()) + '\n')
print("Number of NaN values {}".format(merchant[merchant_headers[c]].isnull().sum().compute()) + '\n')
historical_transactions.head()
historical_transactions.compute().shape #after computing for a few minutes computer restarts.
I expect it to run as the code and give me the shape of the dask array and run the rest of the code (which I have not showed here since it is not relevant)
I found a way to get it.
Here it is:
print("(%s,%s)" % (historical_transactions.index.count().compute(),len(historical_transactions.columns)))
The first output value is the rows and the second output value is the number of columns.
Thanks
Michael

how to skip corrupted gzips with pyspark?

I need to read from a lot of gzips from hdfs, like this:
sc.textFile('*.gz')
while some of these gzips are corrupted, raises
java.io.IOException: gzip stream CRC failure
stops the whole processing running.
I read the debate here, where someone has the same need, but get no clear solution. Since it's not appropriate to achieve this function within spark (according to the link), is there any way just brutally skip corrupted files? There seem to have hints for scala user, no idea how to deal with it in python.
Or I can only detect corrupted files first, and delete them?
What if I have large amount of gzips, and after a day of running, find out the last one of them are corrupted. The whole day wasted. And having corrupted gzips are quite common.
You could manually list all of the files and then read over the files in a map UDF. The UDF could then have try/except blocks to handles corrupted files.
The code would look something like
import gzip
from pyspark.sql import Row
def readGzips(fileLoc):
try:
...
code to read file
...
return record
except:
return Row(failed=fileLoc)
from os import listdir
from os.path import isfile, join
fileList = [f for f in listdir(mypath) if isfile(join(mypath, f))]
pFileList = sc.parallelize(fileList)
dataRdd = pFileList.map(readGzips).filter((lambda x: 'failed' not in x.asDict()))

Big data read subsamples R

I'm most grateful for your time to read this.
I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.
I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.
One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.
Appreciate your thoughts greatly.
R.version
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 15.1
year 2012
month 06
day 22
svn rev 59600
language R
version.string R version 2.15.1 (2012-06-22)
nickname Roasted Marshmallows
Yoda
As themel has pointed out, R is very very slow on reading csv files.
If you have sqlite, it really is the best approach, as it appears the data mining is not just for
one time, but over multiple session, in multiple ways.
Lets look at the options we have
Reading the csv to R (slow)
Doing this in R is like 20 times slow, compared to a tool written in C (on my machine)
This is very slow
read.csv( file='filename.csv' , head=TRUE , sep=",")
Convert to stata dta file beforehand and load from there
Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)
Writing a program to convert csv into dta format(If you know what you are doing)
Using the resource from http://www.stata.com/help.cgi?dta
and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write
and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)
Then using the foreign package (http://cran.r-project.org/web/packages/foreign/index.html), a simple
library(foreign)
whatever <- read.dta("file.dta")
would load your data
Using mysql directly to import csv data (Hard to use, however it is not that bad if you know SQL)
From s SQL console
LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table
IGNORE 1 LINES <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
Or
mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase
Then play from the R console, using RMySQL R interface to the MySQL database
http://cran.r-project.org/web/packages/RMySQL/index.html
install.packages('RMySQL')
Then play around like
mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)
Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)
Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only
From the R console,
install.packages("sqldf")
# shows built in data frames
data()
# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))
And off you go!
Presonally, I would recomend the library(sqldf) option :-)
I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?
You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.
If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.

Resources