I have 2 fairly large datasets (~47GB) stored in a netCDF file. The datasets have three dimensions: time, s, and s1. The first dataset is of shape (3000,2088,1000) and the second is of shape (1566,160000,25). Both datasets are equal in size. The only difference is their shape. Since my RAM size is only 32GB, I am accessing the data in blocks.
For the first dataset, when I read the first ~12GB chunk of data, the code uses almost twice the amount of memory. Whereas, for the second, it uses just the amount of memory as that of the chunk (12GB). Why is this happening? How do I stop the code from using more than what is necessary?
Not using more memory is very important for my code because my algorithm's efficiency hinges on the fact that every line of code uses just enough memory and not more. Also, because of this weird behaviour, my system starts swapping like crazy. I have a linux system, if that information is useful. And I use python 3.7.3 with netCDF 4.6.2
This is how I am accessing the datasets,
from netCDF4 import Dataset
dat = Dataset('dataset1.nc')
dat1 = Dataset('dataset2.nc')
chunk1 = dat.variables['data'][0:750] #~12GB worth of data uses ~24GB RAM memory
chunk2 = dat1.variables['data'][0:392] #~12GB worth of data uses ~12GB RAM memory
reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.
python 3.3 example
import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)
Output: 19.90
R 3.0.2 example
system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
stringsAsFactors=F,na.strings=""))
Output:
User System Elapsed
181.13 1.07 182.32
Julia 0.2.0 (Julia Studio 0.4.4) example # 1
using DataFrames
timing = #time myData = readtable("C:/myFile.txt",separator='|',header=false)
Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)
Julia 0.2.0 (Julia Studio 0.4.4) example # 2
timing = #time myData = readdlm("C:/myFile.txt",'|',header=false)
Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)
Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?
The best answer is probably that I'm not as a good a programmer as Wes.
In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.
As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.
On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.
There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl
Note that the "n bytes allocated" output from #time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.
I've found a few things that can partially help this situation.
using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.
Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:
Data = readdlm("file.csv", ',', Float32)
Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).
It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.
Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.
Let us first create a file you are talking about to provide reproducibility:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end
Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.
Single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
and using e.g. 4 threads:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.812742 seconds (47.28 k allocations: 1.208 GiB)
In R it is:
> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+ stringsAsFactors=F,na.strings=""))
user system elapsed
28.615 0.436 29.048
In Python (Pandas) it is:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526
Now if we test fread from R (which is fast) we get:
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=1))
user system elapsed
1.043 0.036 1.082
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=4))
user system elapsed
1.361 0.028 0.416
So in this case the summary is:
despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)
EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.
Data preparation step:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end
CSV.jl, single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.029965 seconds (248 allocations: 17.037 MiB)
Pandas:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406
Conclusions:
the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
for small files CSV.jl is faster than Pandas (if we exclude compilation cost)
Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.
From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.
Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.
In my experience, the best way to deal with larger text files is not load them up into Julia, but rather to stream them. This method has some additional fixed costs, but generally runs extremely quickly. Some pseudo code is this:
function streamdat()
mycsv=open("/path/to/text.csv", "r") # <-- opens a path to your text file
sumvec = [0.0] # <-- store a sum here
i = 1
while(!eof(mycsv)) # <-- loop through each line of the file
row = readline(mycsv)
vector=split(row, "|") # <-- split each line by |
sumvec+=parse(Float64, vector[i])
i+=1
end
end
streamdat()
The code above is just a simple sum, but this logic can be expanded to more complex problems.
using CSV
#time df=CSV.read("C:/Users/hafez/personal/r/tutorial for students/Book2.csv")
recently I tried in Julia 1.4.2. I found different response and at first, I didn't understand Julia. then I posted the same thing in the Julia discussion forums. then I understood that this code will provide only compile time. here you can find benchmark
I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine
tl;dr
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of a full outer merge when using Pandas DataFrames as part of a combinatorics graph.
Questions (repeated below).
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython) - Databases are out of the question.
The problem
Recently I re-wrote the py-upset graphing library to make it more efficient in terms of time when calculating combinations across DataFrames. I'm not looking for a review of this code, it works perfectly well in most instances and I'm happy with the approach. What I am looking for now is the answer to a very specific problem; uncovered when working with large data-sets.
The approach I took with the re-write was to formulate an in-memory merge of all provided dataframes on a full outer join as seen on lines 480 - 502 of pyupset.resources
for index, key in enumerate(keys):
frame = self._frames[key]
frame.columns = [
'{0}_{1}'.format(column, key)
if column not in self._unique_keys
else
column
for column in self._frames[key].columns
]
if index == 0:
self._merge = frame
else:
suffixes = (
'_{0}'.format(keys[index-1]),
'_{0}'.format(keys[index]),
)
self._merge = self._merge.merge(
frame,
on=self._unique_keys,
how='outer',
copy=False,
suffixes=suffixes
)
For small to medium dataframes using joins works incredibly well. In fact recent performance tests have shown that it'll handle 5 or 6 Data-Sets containing 10,000's of lines each in a less than a minute which is more than ample for the application structure I require.
The problem now moves from time based to memory based.
Given datasets of potentially 100s of thousands of records, the library very quickly runs out of memory even on a large server.
To put this in perspective, my test machine for this application is an 8-core VMWare box with 128GiB RAM running Centos7.
Given the following dataset sizes, when adding the 5th dataframe, memory usage spirals exponentially. This was pretty much anticipated but underlines the heart of the problem I am facing.
Rows | Dataframe
------------------------
13963 | dataframe_one
48346 | dataframe_two
52356 | dataframe_three
337292 | dataframe_four
49936 | dataframe_five
24542 | dataframe_six
258093 | dataframe_seven
16337 | dataframe_eight
These are not "small" dataframes in terms of the number of rows although the column count for each is limited to one unique key + 4 non-unique columns. The size of each column in pandas is
column | type | unique
--------------------------
X | object | Y
id | int64 | N
A | float64 | N
B | float64 | N
C | float64 | N
This merge can cause problems as memory is eaten up. Occasionally it aborts with a MemoryError (great, I can catch and handle those), other times the kernel takes over and simply kills the application before the system becomes unstable, and occasionally, the system just hangs and becomes unresponsive / unstable until finally the kernel kills the application and frees the memory.
Sample output (memory sizes approximate):
[INFO] Creating merge table
[INFO] Merging table dataframe_one
[INFO] Data index length = 13963 # approx memory <500MiB
[INFO] Merging table dataframe_two
[INFO] Data index length = 98165 # approx memory <1.8GiB
[INFO] Merging table dataframe_three
[INFO] Data index length = 1296665 # approx memory <3.0GiB
[INFO] Merging table dataframe_four
[INFO] Data index length = 244776542 # approx memory ~13GiB
[INFO] Merging table dataframe_five
Killed # > 128GiB
When the merge table has been produced, it is queried in set combinations to produce graphs similar to https://github.com/mproffitt/py-upset/blob/feature/ISSUE-7-Severe-Performance-Degradation/tests/generated/extra_additional_pickle.png
The approach I am trying to build for solving the memory issue is to look at the sets being offered for merge, pre-determine how much memory the merge will require, then if that combination requires too much, split it into smaller combinations, calculate each of those separately, then put the final dataframe back together (divide and conquer).
My issue here is that I'm stuck at calculating how many rows to anticipate on each part of the merge.
Questions (repeated from above)
The ideal solution would be to not require the merge and to query panel objects. Given that there isn't a query method on the panel is there a cleaner solution which would solve this problem without hitting the memory ceiling?
If the answer to 2 is no, how can I calculate the size of the required merge table for each combination of sets without carrying out the merge? This might be a sub-optimal approach but in this instance it would be acceptable for the purpose of the application.
Is Python the right language for this or should I be looking at a more statistical language such as R or write it at a lower level (c, cython).
Apologies for the lengthy question. I'm happy to provide more information if required or possible.
Can anybody shed some light on what might be the reason for this?
Thank you.
Question 1.
Dask shows a lot of promise in being able to calculate the merge table "out of memory" by using hdf5 files as a temporary store.
By using multi-processing to create the merges, dask also offers a performance increase over pandas. Unfortunately this is not carried through to the query method so performance gains made on the merge are lost on querying.
It is still not a completely viable solution as dask may still run out of memory on large, complex merges.
Question 2.
Pre-calculating the size of the merge is entirely possible using the following method.
Group each dataframe by a unique key and calculate the size.
Create a set of key names for each dataframe.
Create an intersection of sets from 2.
Create a set difference for set 1 and for set 2
To accommodate for np.nan stored in the unique key, select all NAN values. If one frame contains nan and the other doesn't, write the other as 1.
for sets in the intersection, multiply the count from each groupby('...').size()
Add counts from the set differences
Add a count of np.nan values
In python this could be written as:
def merge_size(left_frame, right_frame, group_by):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_sub_right = left_keys - intersection
right_sub_left = right_keys - intersection
left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_groups[group_name] for group_name in left_sub_right]
sizes += [right_groups[group_name] for group_name in right_sub_left]
sizes += [left_nan * right_nan]
return sum(sizes)
Question 3
This method is fairly heavy on calculating and would be better written in Cython for performance gains.
I have a dataframe that contains ~4bn records. Many of the columns are 64bit ints, but could be truncated into 32bit or 16bit ints without data loss. When I try converting the data types using the following function:
def switchType(df, colName):
df = df.withColumn( colName + "SmallInt", df[colName].cast(ShortType()))
df = df.drop(colName)
return df.withColumnRenamed(colName + 'SmallInt', colName)
positionsDf = switchType(positionsDf, "FundId")
# repeat for 4 more cols...
print(positionsDf.cache().count())
This shows as taking 54.7 MB in ram. When I don't do this, it shows as 56.7MB in ram.
So, is it worth trying to truncate ints at all?
I am using Spark 2.01 in stand alone mode.
If you plan to write it in format that saves numbers in binary (parquet, avro) it may save some space. For calculations there will be probably no difference in speed.
Ok, for the benefit of anyone else that stumbles across this. If I understand it, it depends on your JVM implementation (so, machine/OS specific), but in my case it makes little difference. I'm running java 1.8.0_102 on RHEL 7 64bit.
I tried it with a larger dataframe (3tn+ records). The dataframe contains 7 coulmns of type short/long, and 2 as doubles:
As longs - 59.6Gb
As shorts - 57.1Gb
The tasks I used to create this cached dataframe also showed no real difference in execution time.
What is nice to note is that the storage size does seem to scale linearly with the number of records. So that is good.