I'm most grateful for your time to read this.
I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.
I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.
One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.
Appreciate your thoughts greatly.
R.version
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 15.1
year 2012
month 06
day 22
svn rev 59600
language R
version.string R version 2.15.1 (2012-06-22)
nickname Roasted Marshmallows
Yoda
As themel has pointed out, R is very very slow on reading csv files.
If you have sqlite, it really is the best approach, as it appears the data mining is not just for
one time, but over multiple session, in multiple ways.
Lets look at the options we have
Reading the csv to R (slow)
Doing this in R is like 20 times slow, compared to a tool written in C (on my machine)
This is very slow
read.csv( file='filename.csv' , head=TRUE , sep=",")
Convert to stata dta file beforehand and load from there
Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)
Writing a program to convert csv into dta format(If you know what you are doing)
Using the resource from http://www.stata.com/help.cgi?dta
and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write
and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)
Then using the foreign package (http://cran.r-project.org/web/packages/foreign/index.html), a simple
library(foreign)
whatever <- read.dta("file.dta")
would load your data
Using mysql directly to import csv data (Hard to use, however it is not that bad if you know SQL)
From s SQL console
LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table
IGNORE 1 LINES <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
Or
mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase
Then play from the R console, using RMySQL R interface to the MySQL database
http://cran.r-project.org/web/packages/RMySQL/index.html
install.packages('RMySQL')
Then play around like
mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)
Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)
Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only
From the R console,
install.packages("sqldf")
# shows built in data frames
data()
# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))
And off you go!
Presonally, I would recomend the library(sqldf) option :-)
I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?
You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.
If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.
Related
So if I have a csv file as follows:
User Gender
A M
B F
C F
Then I want to write another csv file with rows shuffled like so (as an example):
User Gender
C F
A M
B F
My problem is that I don't know how to randomly select rows and ensure that I get every row from the original csv file. For reference my csv file is around 3gb. If I load my entire dataset into a dataframe and use the random package to shuffle it, my PC crashes due to RAM use.
Probably the easiest (and fastest) is to use shuf in bash!
shuf words.txt > shuffled_words.txt
(I know you asked for a Python solution, but I am going to assume this is still a better answer)
To programmatically do it from Python:
import sh
sh.shuf("words.txt", out="shuffled_words.txt")
Create an array of lines as file positions of line starts, by reading the file once as random access or memory mapped file. The array has one extra entry with the file length,
so the line i holds the bytes [array[i], array[i+1]>.
Shuffle the indices 0 .. number of lines - 1.
Now you can use random access positioning (seek) to read a line buffer.
You can use the chunk_size argument to csv in chunks
df_chunks = pandas.read_csv("your_csv_name.csv", chunk_size=10)
Then you can shuffle only the chunks so it takes less memory
for chunk in df_chunks:
do stuff
Then you can concat them and save it into another csv :
new_df = pandas.concat(new_chunks)
new_df.to_csv("your_new_csv_name.csv")
If you have memory issue, while you create new_chunks don't forget to erase old one as you don't want them to be left in RAM for no reason, you can do it with
chunk=None
I am unable to do calculations on large datasets using python-Dask. My computer crashes.
I have a computer with 4GB of RAM and running Linux Debian. I am trying load some files from a Kaggle competition (ElO Merchant competition) When I try load and get the shape of the dask dataframe the computer crashes.
I am running the code on only my laptop. I chose dask because it could handle large datasets. I would also like to know if Dask is able to move computations to my hard disk if it does not fit in memory? If so do I need to activate such thing or dask automatically does it? If I need to do it manually how do I do it? If there is a tutorial on this it would be great also.
I have 250GB Solid State Drive as my hard Disk. Hence the there would be space for a large dataset to fit to disk.
Please help me on this regard. My code is as below.
Thank you
Michael
import dask.dataframe as dd
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
client = Client(processes=False)
merchant = dd.read_csv('/home/michael/Elo_Merchant/merchants.csv')
new_merchant_transactions = dd.read_csv('/home/michael/Elo_Merchant/new_merchant_transactions.csv')
historical_transactions = dd.read_csv('/home/michael/Elo_Merchant/historical_transactions.csv')
train = dd.read_csv('/home/michael/Elo_Merchant/train.csv')
test = dd.read_csv('/home/michael/Elo_Merchant/test.csv')
merchant.head()
merchant.compute().shape
merchant_headers = merchant.columns.values.tolist()
for c in range(len(merchant_headers)):
print(merchant_headers[c])
print('--------------------')
print("{}".format(merchant[merchant_headers[c]].value_counts().compute()) + '\n')
print("Number of NaN values {}".format(merchant[merchant_headers[c]].isnull().sum().compute()) + '\n')
historical_transactions.head()
historical_transactions.compute().shape #after computing for a few minutes computer restarts.
I expect it to run as the code and give me the shape of the dask array and run the rest of the code (which I have not showed here since it is not relevant)
I found a way to get it.
Here it is:
print("(%s,%s)" % (historical_transactions.index.count().compute(),len(historical_transactions.columns)))
The first output value is the rows and the second output value is the number of columns.
Thanks
Michael
I can't find anything specifically to this and can't seem to get any combo of dask or pool to do what I need without an error.
My need is to read a dozen or more txt files (in four folders so using recursive) with specific naming convention and then merge them all together. All files have the same column names but each file is different lengths.
Here is how I can do it now and have it working but want to run in parallel:
path1 = my specific filepath
file_list = glob.glob(os.path.join(path1, "*\\XT*.txt"), recursive =
True)
df_each = (pd.read_csv(f, sep = '|') for f in file_list)
df = pd.concat(df_each, ignore_index = True)
Then there are a few little things that need to be cleaned up and changed which I have done like this:
df.replace{("#":""}, regex = True, inplace = True
df.columns = df.columns.str.replace("#", "")
The end goal of what I need for all the files is a summary of the sum for each column which is grouped specifically which is done like this:
df_calc = df.groupby(['Name1', 'Name2']).sum()
Right now it takes about 30 minutes to run and looking to run in parallel to cut this time down. Thanks!
You mention in a comment that your CPU utilization is low, not near 100%. This means that you are being limited by disk throughput or memory bandwidth. So assigning more CPU cores to work on this task will only slow it down. Instead, you should focus on reducing the IO and the memory consumption.
Using the usecols option of pd.read_csv() is a great start. Also, try passing engine='c' and an explicit dtype to avoid Pandas having to guess the dtype each time.
You might also benefit from an SSD.
You should also consider storing your data in a more efficient format. For example the format produced by np.save() and friends. This could speed up loading by 100x.
I have a large data chunk(about 10M rows) in Amazon-Redishift, that I was to obtain in a Pandas data-frame and store the data in a pickle file. However, it shows "Out of Memory" exception for obvious reasons, because of the size of data. I tried a lot other things like sqlalchemy, however, not able to crack the Problem. Can anyone suggest a better way or code to get through it.
My current (simple) code snippet goes as below:
import psycopg2
import pandas as pd
import numpy as np
cnxn = psycopg2.connect(dbname=<mydatabase>, host='my_redshift_Server_Name', port='5439', user=<username>, password=<pwd>)
sql = "Select * from mydatabase.mytable"
df = pd.read_sql(sql, cnxn, columns=1)
pd.to_pickle(df, 'Base_Data.pkl')
print(df.head(50))
cnxn.close()
print(df.head(50))
1) find the row count in the table and the maximum chunk of the table that you can pull by adding order by [column] limit [number] offset 0 and increasing the limit number reasonably
2) add a loop that will produce the sql with the limit that you found and increasing offset, i.e. if you can pull 10k rows your statements would be:
... limit 10000 offset 0;
... limit 10000 offset 10000;
... limit 10000 offset 20000;
until you reach the table row count
3) in the same loop, append every new obtained set of rows to your dataframe.
p.s. this will work assuming you won't run into any issues with memory/disk on client end which I can't guarantee since you have such issue on a cluster which is likely higher grade hardware. To avoid the problem you would just write a new file on every iteration instead of appending.
Also, the whole approach is probably not right. You'd better unload the table to S3 which is pretty quick because the data is copied from every node independently, and then do whatever needed against the flat file on S3 to transform it to the final format you need.
If you're using pickle to just transfer the data somewhere else, I'd repeat the suggestion from AlexYes's answer - just use S3.
But if you want to be able to work with the data locally, you have to limit yourself to the algorithms that do not require all data to work.
In this case, I would suggest something like HDF5 or Parquet for data storage and Dask for data processing since it doesn't require all the data to reside in memory - it can work in chunks and in parallel. You can migrate your data from Redshift using this code:
from dask import dataframe as dd
d = dd.read_sql_table(my_table, my_db_url, index_col=my_table_index_col)
d.to_hdf('Base_Data.hd5', key='data')
I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.
The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.
val df = spark.read.csv(csvFile).as[FireIncident]
A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below:
1) Use typed column (~ 500 ms on local host)
df.where($"UnitID" === "B02").count()
2) Use temp table and sql query (~ same as option 1)
df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()
3) Use strong typed class field (14,987ms, i.e. 30 times as slow)
df.filter(_.UnitID.orNull == "B02").count()
I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.
df.filter(df['UnitID'] == 'B02').count()
Could someone shed some light on how 3) and the python API are executed differently from the first two options?
It's because of step 3 here.
In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.
In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.
I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.
When running python what is happening is that first your code is loaded onto the JVM, interpreted, and then its finally compiled into bytecode. When using the Scala API, Scala natively runs on the JVM so you're cutting out the entire load python code into the JVM part.