computer crashes when doing calculations on big datasets on python using Dask - python-3.x

I am unable to do calculations on large datasets using python-Dask. My computer crashes.
I have a computer with 4GB of RAM and running Linux Debian. I am trying load some files from a Kaggle competition (ElO Merchant competition) When I try load and get the shape of the dask dataframe the computer crashes.
I am running the code on only my laptop. I chose dask because it could handle large datasets. I would also like to know if Dask is able to move computations to my hard disk if it does not fit in memory? If so do I need to activate such thing or dask automatically does it? If I need to do it manually how do I do it? If there is a tutorial on this it would be great also.
I have 250GB Solid State Drive as my hard Disk. Hence the there would be space for a large dataset to fit to disk.
Please help me on this regard. My code is as below.
Thank you
Michael
import dask.dataframe as dd
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
client = Client(processes=False)
merchant = dd.read_csv('/home/michael/Elo_Merchant/merchants.csv')
new_merchant_transactions = dd.read_csv('/home/michael/Elo_Merchant/new_merchant_transactions.csv')
historical_transactions = dd.read_csv('/home/michael/Elo_Merchant/historical_transactions.csv')
train = dd.read_csv('/home/michael/Elo_Merchant/train.csv')
test = dd.read_csv('/home/michael/Elo_Merchant/test.csv')
merchant.head()
merchant.compute().shape
merchant_headers = merchant.columns.values.tolist()
for c in range(len(merchant_headers)):
print(merchant_headers[c])
print('--------------------')
print("{}".format(merchant[merchant_headers[c]].value_counts().compute()) + '\n')
print("Number of NaN values {}".format(merchant[merchant_headers[c]].isnull().sum().compute()) + '\n')
historical_transactions.head()
historical_transactions.compute().shape #after computing for a few minutes computer restarts.
I expect it to run as the code and give me the shape of the dask array and run the rest of the code (which I have not showed here since it is not relevant)

I found a way to get it.
Here it is:
print("(%s,%s)" % (historical_transactions.index.count().compute(),len(historical_transactions.columns)))
The first output value is the rows and the second output value is the number of columns.
Thanks
Michael

Related

NetCDF uses twice the memory when reading part of data. Why? How to rectify it?

I have 2 fairly large datasets (~47GB) stored in a netCDF file. The datasets have three dimensions: time, s, and s1. The first dataset is of shape (3000,2088,1000) and the second is of shape (1566,160000,25). Both datasets are equal in size. The only difference is their shape. Since my RAM size is only 32GB, I am accessing the data in blocks.
For the first dataset, when I read the first ~12GB chunk of data, the code uses almost twice the amount of memory. Whereas, for the second, it uses just the amount of memory as that of the chunk (12GB). Why is this happening? How do I stop the code from using more than what is necessary?
Not using more memory is very important for my code because my algorithm's efficiency hinges on the fact that every line of code uses just enough memory and not more. Also, because of this weird behaviour, my system starts swapping like crazy. I have a linux system, if that information is useful. And I use python 3.7.3 with netCDF 4.6.2
This is how I am accessing the datasets,
from netCDF4 import Dataset
dat = Dataset('dataset1.nc')
dat1 = Dataset('dataset2.nc')
chunk1 = dat.variables['data'][0:750] #~12GB worth of data uses ~24GB RAM memory
chunk2 = dat1.variables['data'][0:392] #~12GB worth of data uses ~12GB RAM memory

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

reading a 20gb csv file in python

I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:
from tqdm import tqdm
chunksize=100000
df_list = [] # list to hold the batch dataframe
for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
df_list.append(df_chunk)
train_df = pd.concat(df_list)
Do you have much more than 20GB RAM? Because you're reading the entire file into RAM, and represent it as Python objects. That df_list.append(df_chunk) is the culprit.
What you need to is:
read it by smaller pieces (you already do);
process it piece by piece;
discard the old piece after processing. Python's garbage collection will do it for you unless you keep a reference to the spent chunk, as you currently do in df_list.
Note that you can keep the intermediate / summary data in RAM the whole time. Just don't keep the entire input in RAM the whole time.
Or get 64GB / 128GB RAM, whichever is faster for you. Sometimes just throwing more resources at a problem is faster.

Understanding Dask's Task Stream

I'm running dask locally using the distributed scheduler on my machine with 8 cores. On initialization I see:
Which looks correct, but I'm confused by the task stream in the diagnostics (shown below):
I was expecting 8 rows corresponding to the 8 workers/cores, is that incorrect?
Thanks
AJ
I've added the code I'm running:
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client()
progress(client)
# load datasets
trd = (dd.read_csv('trade_201811*.csv', compression='gzip',
blocksize=None, dtype={'Notional': 'float64'})
.assign(timestamp=lambda x: dd.to_datetime(x.timestamp.str.replace('D', 'T')))
.set_index('timestamp', sorted=True))
Each line corresponds to a single thread. Some more sophisticated Dask operations will start up additional threads, this happens particularly when tasks launch other tasks, which is common especially in machine learning workloads.
My guess is that you're using one of the following approaches:
dask.distributed.get_client or dask.distributed.worker_client
Scikit-Learn's Joblib
Dask-ML
If so, the behavior that you're seeing is normal. The task stream plot will look a little odd, yes, but hopefully it is still interpretable.

Big data read subsamples R

I'm most grateful for your time to read this.
I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.
I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.
One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.
Appreciate your thoughts greatly.
R.version
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 15.1
year 2012
month 06
day 22
svn rev 59600
language R
version.string R version 2.15.1 (2012-06-22)
nickname Roasted Marshmallows
Yoda
As themel has pointed out, R is very very slow on reading csv files.
If you have sqlite, it really is the best approach, as it appears the data mining is not just for
one time, but over multiple session, in multiple ways.
Lets look at the options we have
Reading the csv to R (slow)
Doing this in R is like 20 times slow, compared to a tool written in C (on my machine)
This is very slow
read.csv( file='filename.csv' , head=TRUE , sep=",")
Convert to stata dta file beforehand and load from there
Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)
Writing a program to convert csv into dta format(If you know what you are doing)
Using the resource from http://www.stata.com/help.cgi?dta
and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write
and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)
Then using the foreign package (http://cran.r-project.org/web/packages/foreign/index.html), a simple
library(foreign)
whatever <- read.dta("file.dta")
would load your data
Using mysql directly to import csv data (Hard to use, however it is not that bad if you know SQL)
From s SQL console
LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table
IGNORE 1 LINES <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
Or
mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase
Then play from the R console, using RMySQL R interface to the MySQL database
http://cran.r-project.org/web/packages/RMySQL/index.html
install.packages('RMySQL')
Then play around like
mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)
Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)
Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only
From the R console,
install.packages("sqldf")
# shows built in data frames
data()
# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))
And off you go!
Presonally, I would recomend the library(sqldf) option :-)
I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?
You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.
If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.

Resources