How to access this kind of data in Spark - apache-spark

The data is stored in the following forms:
data/file1_features.mat
data/file1_labels.txt
data/file2_features.mat
data/file2_labels.txt
...
data/file100_features.mat
data/file100_labels.txt
Each data/file*_features.mat stores the features of some samples and each row is a sample. Each data/file*_labels.txt stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80 million samples.
In Spark, how to access this data set?
I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py. It has the following lines:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
I run this example in ./bin/pyspark, it shows the data object is a PythonRDD.
PythonRDD[32] at RDD at PythonRDD.scala:48
The data/mllib/sample_libsvm_data.txt is just one file. In my case, there are many files. Is there any RDD in Spark to handle this case conveniently? Does it need to merge all 100 files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).

Simply point
dir = "<path_to_data>/data"
sc.textFile(dir)
Spark automatically picks up all of the files inside that directory

If you want load specific file type for processing then you can use regular expression for loading files into RDD.
dir = "data/*.txt"
sc.textFile(dir)
Spark will all files ending with txt extension.

Related

Number of files saved by parquet writer in pyspark

How many files does a pyspark parquet write generate? I have read that the output is one file per in memory partition. However, this does not seem to always be true.
I am running a 6 executors cluster with 6G executor memory per executor. All the rest (pyspark, overhead, offheap) are 2G
using the following data:
dummy_data = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,3,4,5,6,7,8,9,10],100000)}))
The following code where I repartition without specifying a column to repartition by, always produces the number of files equal to the number of memory partitions:
df_dummy = dummy_data.repartition(200)
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_wo_id")
#files generated 200
However, the following code, where I do specify the column to repartition the data by, produces some random number of files:
df_dummy = dummy_data.repartition(200,'a')
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_w_id")
#files generated 11
Can you help me understand the number of output files that gets generated by the pyspark parquet writer.
This is an answer that does not explain everything you're noticing, but probably contains useful enough information that it would be a pity not to share it.
The reason why you're seeing a different amount of output files is because of the order of your data after those 2 partitions.
dummy_data.repartition(200) repartitions your individual rows using round robin partitioning
the result is that your data has a random ordering, because your input data has random ordering
dummy_data.repartition(200,'a') uses hash partitioning according to the column a's values
the result is that your data is chopped up in a very specific order: hashing the column values will put values where a == 1 always in the same partition
since your nr of partitions is smaller than the distinct amount of possible values, each partition will contain only 1 distinct a value.
Now, there is a pattern in the amount of output part-files you receive:
In the case of dummy_data.repartition(200), you simply get the same number of part-files as partitions. 200 in your example.
In the other case, you get 11 part-files. If you have a look at the content of those part-files, you will see that there is 1 empty file + 10 filled files. 1 for each distinct value of your original dataset. So this leads to the conclusion that while writing your files, something is being smart and merging those minuscule and identical files. I'm not sure whether this is Spark, or the PARQUET_OUTPUT_COMMITTER_CLASS, or something else.
Conclusion
In general, you get the same amount of part-files as the amount of partitions.
In your specific case, when you're repartitioning by the column (which is the only value in the Row), your parquet part-files will contain a bunch of the same values. It seems that something (I don't know what) is being smart and merging files with the same values.
In your case, you got 11 part-files because there is 1 empty file and 10 files for each distinct value in your dataframe. Try changing np.random.choice([1,2,3,4,5,6,7,8,9,10] to np.random.choice([1,2,3,4,5,6,7,8] and you will see you'll get 9 part-files (8 + 1).
Most likely, the reason you see 11 files being written after you do a .repartition(200,'a') is because your first partition (with partition id = 0) becomes empty. Spark allows the task working on that empty partition to proceed with the write, but will suppress writing all other empty parquet files for all other partitions. This behavior can be tracked down to the changes made for JIRA SPARK-21435 "Empty files should be skipped while write to file", and corresponding code in FileFormatWriter.scala:
:
val dataWriter =
if (sparkPartitionId != 0 && !iterator.hasNext) {
// In case of empty job, leave first partition to save meta for file format like parquet.
new EmptyDirectoryDataWriter(description, taskAttemptContext, committer)
} else if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {
:
So, if you repartition your dataset such that partition 0 becomes non-empty, you would not see any empty files written.

Spark goes java heap space out of memory with a small collect

I've got a problem with Spark, its driver and an OoM issue.
Currently I have a dataframe which is being built with several, joined sources (actually different tables in parquet format), and there are thousands of tuples. They have a date which represents the date of creation of the record, and distinctly they are a few.
I do the following:
from pyspark.sql.functions import year, month
# ...
selectionRows = inputDataframe.select(year('registration_date').alias('year'), month('registration_date').alias('month')).distinct()
selectionRows.show() # correctly shows 8 tuples
selectionRows = selectionRows.collect() # goes heap space OoM
print(selectionRows)
Reading the memory consumption statistics shows that the driver does not exceed ~60%. I thought that the driver should load only the distinct subset, not the entire dataframe.
Am I missing something? Is it possible to collect those few rows in a smarter way? I need them as a pushdown predicate to load a secondary dataframe.
Thank you very much!
EDIT / SOLUTION
After reading the comments and elaborating my personal needs, I cached the dataframe at every "join/elaborate" step, so that in a timeline I do the following:
Join with loaded table
Queue required transformations
Apply the cache transformation
Print the count to keep track of cardinality (mainly for tracking / debugging purposes) and thus apply all transformations + cache
Unpersist the cache of the previous sibiling step, if available (tick/tock paradigm)
This reduced some complex ETL jobs down to 20% of the original time (as previously it was applying the transformations of each previous step at each count).
Lesson learned :)
After reading the comments, I elaborated the solution for my use case.
As mentioned in the question, I join several tables one with each other in a "target dataframe", and at each iteration I do some transformations, like so:
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
print(f'Target after table "other": {target.count()}')
The problem of slowliness / OoM was that Spark was forced to do all the transformations from start to finish at each table due to the ending count, making it slower and slower at each table / iteration.
The solution I found is to cache the dataframe at each iteration, like so:
cache: DataFrame = null
# ...
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
target = target.cache()
target_count = target.count() # actually do the cache
if cache:
cache.unpersist() # free the memory from the old cache version
cache = target
print(f'Target after table "other": {target_count}')

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

Spark shuffle writes growing out of control

I'm using Spark 1.6.1 to process some archives from CommonCrawl. They come as gzipped text files, and I've read that Spark has to load such compressed files into an RDD of a single partition. However, I'm running it on a cluster of ten nodes with 4 CPU's each, so I need to repartition it in order to process the data in parallel. These repartition steps are taking what seems like an unacceptably long time, and when I look on the web UI at the shuffle write times for any repartition step, it grows to over 40 GB, even though one .gz archive is only around 100 MB. Here is the relevant portion of the code I'm running:
final WhitelistFilter<String> filter =
new WhitelistFilter<String>(WHITELIST_THRESHOLD, termWeights, regex);
//Each URL points to a gzip-compressed text file in the commoncrawl s3 bucket
//With ~40,000 pages per archive
int counter = 0;
for(String s : ccURL) {
//Obtain WET file from the URL, and filter down to pages of English text
JavaPairRDD<LongWritable, Text> raw =
sc.newAPIHadoopFile("sample.wet.gz",
org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class,
LongWritable.class, Text.class, s3Conf)
.repartition(sc.defaultParallelism() * 3)
JavaRDD<String> pages = JavaPairRDD.fromJavaRDD(
.filter(new WETFilter())
.map(new WETTransformerWithURL())
.filter(filter)).keys();
pages.saveAsTextFile("pages-" + (counter++) + ".txt");
}
The various functions called in the filter and map steps are just basic text processing - the most complicated thing is assigning a score based on term frequencies and filtering out anything below a threshold. If I remove the call to repartition(), the entire thing will finish quickly, but without any parallelism. What about the repartitioning could be causing it to be so incredibly slow, and also make the block manager write tens of gigabytes to the disk?

How does the Apache Spark scheduler split files into tasks?

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.
Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?
Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created

Resources