Is there a way to implement multi-threading on CSV.read in Julia? A good example to read big files has been presented in Parallelism for reading a large file in Julia.
But since I have to frequently change my datasets these approaches may not be applicable.
using CSV
file = ("C:\\Users\\User\\Desktop\\Datasets\\X_train_sat4.csv")
#time df = CSV.read(file, DataFrame)
Output:
69.469112 seconds (6.29 M allocations: 9.767 GiB, 0.76% gc time)
29723 rows × 2456 columns
I have used the steps recommended here Speed up loading and compilation time but these improves only the first loading time.
Thanks in advance!
CSV.jl supports multi-threading. You can select how many threads you want to use with tasks keyword argument. You cannot use more than number of threads that your Julia process was started with.
Here is a sample timing on reading a file having 10^8 rows and 10 columns:
julia> #time CSV.read("test.csv", DataFrame, ntasks=1); # one thread
75.190387 seconds (23.24 M allocations: 9.253 GiB, 1.24% gc time)
julia> #time CSV.read("test.csv", DataFrame, ntasks=2); # two threads
43.078513 seconds (4.34 M allocations: 8.044 GiB, 2.30% gc time)
By default number of tasks used is set to Threads.nthreads().
Related
I am using pyspark / spark sql for performing very simple tasks. Data size is very less, highest being 215 MB. 90% of the data sources sizes are less than 15 MB. We do filtering, crunching and data aggregations and resultant data is also less than 5 MB for 90% of data. Only 2 data results are 120 MB and 260 MB.
Main hot-spot is coalesce(1) operation as we have requirement to produce only one file. I can understand 120 MB and 260 MB gziped file generation and writing taking time. But generation and writing less than 5MB file should be fast. When I monitor job I can see lot of time is taken by coalesce and save data file. I am clueless why it should take 60-70 secs for generating and writing 2-3 MB file.
Configuration:
I have achieved some performance gain with fat executors of 3 vcores per executor. I am using 1 master 3 worker cluster with 4 core node.
Regards
Manish Zope
I am reading from a partitioned table that has close to 4 billion records.
The files that I am reading from is my source, and I have no control over it to alter the records.
While reading the files through dataframes, for each partition I am creating 2000 files of size less than 2KB. This is because of shuffle partition being set to 2000, to increase the execution speed.
Approach followed to resolve this issue:
I have looped over the HDFS path of the table, post its execution is completed as has created a list with data paths [/dv/hdfs/..../table_name/partition_value=01,/dv/hdfs/..../table_name/partition_value=02..]
For each such path, I have calculated
disk usage and block size from cluster and got the appropriate number of partitions as
no_of_partitions = ceil[disk_usage / block size], and then written the data into another location with the same partition_id such as [/dv/hdfs/..../table2_name/partition_value=01].
Now though this works in reducing the small files to avg block size of 82 MB from 2KB, it is taking about 2.5 mins per partition. With 256 such partitions being available, it is taking more than 10hrs to finish the execution.
Kindly suggest any other method where this could be achieved in less than 2 hrs of time.
Although you have 2000 shuffle partitions you can and should control the output files.
Generating small files in spark is itself a performance degradation for the next read operations.
Now to control small files issue you can do the following:
While writing the dataframe to hdfs repartition it based on the number of partitions and controlling the number of output files per partition
df.repartition(partition_col).write.option("maxRecordsPerFile", 100000).partition_by(partition_col).parquet(path)
This will generate files having 100000 records each in every partition. Hence solving your small files issue. This will improve overall read and write performance of your job.
Hope it helps.
I have a PySpark dataframe with shape (1e10, 14) and I'd like to filter it with about 50 compound OR statements, i.e.:
sql_string = "
(col1='val1' and col2=5) or
(col1='val2' and col2=7) or
(col1='val3' and col2=5) or
...
"
df_f = df.filter(sql_string)
df_f.limit(1000).show()
If the number of these single OR statements is < 10, Spark Jobs for the show method are created instantaneously.
However, with about 15 ORs, it already takes about 30 seconds to create the Spark Jobs.
And at around 20 ORs, the time to create any Spark Jobs grows unmanageable (more than hours).
Starting with about 15 ORs, GC Allocation messages are displayed every few seconds, i.e.:
2020-05-04T09:55:50.762+0000: [GC (Allocation Failure) [PSYoungGen: 7015644K->1788K(7016448K)] 7266861K->253045K(21054976K), 0.0063209 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
So it seems like something funky is going on. Feels similar to the issue, when one loops over Spark Dataframes?
The driver has 32GB RAM (10G used) and 4 cores (1 core 100% used, others near 0%).
I/O is pretty much zero.
Though there is 100% usage on one core, the cluster thinks it's inactive, since it shuts down after the inactivity time that I've set.
Here is a link to the execution plan: https://pastebin.com/7MEv5Sq2.
In this scenario, you are filtering the dataframe based on multiple hardcoded values using compound OR statements, so the spark catalyst optimizer also have to check each filter one by one and loading the complete dataframe after every OR statement is executed.
So, when we cache the dataframe it already have it in memory hence executing it faster by passing the cached dataframe to all the executors.
For large dataframes, you can try persist on mem and disk, that should give you the performance boost you seek but if that doesn't work, you can improve your query by filtering the dataframe by col1 then filtering the already filtered dataframe on col2. This will require you to implement a little logic based approach in order to minimize the iterations over large data.
Hope it helps.
Being originally from Cloudera stack experience, I tended to work with parquet & kudu.
None-the-less the following even though not sure what you are asking in reality, seems more like observations:
50 filters takes time, no filter is at the other end of the spectrum, but takes obviously near to zero processing time in that context. OR processing is more expensive.
push-down is evident from physical plan and is now Spark default for ORC processing.
ORC engine does push down work, hence far less executor activity you observe.
limit cannot be pushed down to database or parquet / orc.
GC stuff can be ignored.
My overall take: With a 1e10 vertical shape element, nothing unusual in my view.
First posted answer not correct.
I'm getting an error in a spark job that's surprising me:
Total size of serialized results of 102 tasks (1029.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)
My job is like this:
def add(a,b): return a+b
sums = rdd.mapPartitions(func).reduce(add)
rdd has ~500 partitions and func takes the rows in that partition and returns a large array (a numpy array of 1.3M doubles, or ~10Mb).
I'd like to sum all these results and return their sum.
Spark seems to be holding the total result of mapPartitions(func) in memory (about 5gb) instead of processing it incrementally, which would require about only 30Mb.
Instead of increasing spark.driver.maxResultSize, is there a way perform the reduce more incrementally?
Update: Actually I'm kinda surprised that more that two results are ever held in memory.
When using reduce Spark applies final reduction on the driver. If func returns a single object this is effectively equivalent to:
reduce(add, rdd.collect())
You may use treeReduce:
import math
# Keep maximum possible depth
rdd.treeReduce(add, depth=math.log2(rdd.getNumPartitions()))
or toLocalIterator:
sum(rdd.toLocalIterator())
The former one will recursively merge partitions on the workers at the cost of increased network exchange. You can use depth parameter tune the performance.
The latter one will collect only a single partition at the time, but it might require re-evaluation of the rdd and significant part of the job will be performed by the driver.
Depending on the exact logic used in func you can also improve work distribution by splitting the matrix into blocks, and performing addition by-block, for example using BlockMatrices
I have a bunch of compressed text files each line of which containing a JSON object. Simplified my workflow looks like this:
string_json = sc.textFile('/folder/with/gzip/textfiles/')
json_objects = string_json.map(make_a_json)
DataRDD = json_objects.map(extract_data_from_json)
DataDF = sqlContext.createDataFrame(DataRDD,schema).collect()
'''followed by some transformations to the dataframe'''
Now, the code works fine. The problem arises as soon as the number files can not be evenly divided between executors.
That is as far as I understand it, because spark is not extracting the files and then distributing the rows to the executors, but rather each executioner gets one file to work with.
e.g If i have 5 files and 4 executors, the first 4 files are processed in parallel and then the 5th file.
Because the 5th is not being processed in parallel with the other 4 and cannot be divided between the 4 executors, it takes the same amount of time as the first 4 together.
This happens at every stages of the program.
Is there a way to transform this kind compartmentalized RDD either into a RDD or Dataframe that is not ?
I'm using python 3.5 and spark 2.0.1
Spark operations are divided into tasks, or units of work that can be done in parallel. There are a few things to know about sc.textFile:
If you're loading multiple files, you're going to get 1 task per file, at minimum.
If you're loading gzipped files, you're going to get 1 task per file, at maximum.
Based on these two premises,your use case is going to see one task per file. You're absolutely right about how the tasks / cores ratio affects wall clock time: having 5 tasks running on 4 cores will take roughly the same time as 8 tasks on 4 cores (though not quite, because stragglers exist and the first core to finish will take on the 5th task).
A rule of thumb is that you should have roughly 2-5 tasks per core in your Spark cluster to see good performance. But if you only have 5 gzipped text files, you're not going to see this. You could try to repartition your RDD (which uses a somewhat expensive shuffle operation) if you're doing a lot downstream:
repartitioned_string_json = string_json.repartition(100, shuffle=True)