Hi,There is a topic about writing text data into multiple output directories in one spark job using MultipleTextOutputFormat
Write to multiple outputs by key Spark - one Spark job
I would ask if there is some similar way to write avro data to multiple directories
What I want is to write the data in avro file to different directory(based on the timestamp field, same day in the timestamp goes to the same directory)
The AvroMultipleOutputs class simplifies writing Avro output data to multiple outputs.
Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own Schema and OutputFormat.
Case two: to write data to different files provided by user
AvroMultipleOutputs supports counters, by default they are disabled. The counters group is the AvroMultipleOutputs class name. The names of the counters are the same as the output name. These count the number of records written to each output name.
Also have a look at
MultipleOutputer
MultipleOutputsFormatTest (see the code example with unit test case here... For some reason MultipleOutputs does not work with Avro, but the near-identical AvroMultipleOutputs does. These obviously related classes have no common ancestor so they are combined under the MultipleOutputer type class which at least allows for future extension.)
Here is what we had implemented for our usecase in java : Write to different files with prefix depending upon the content of avro record using AvroMultipleOutputs.
Here is the wrapper on top of OutputFormat to produce multiple outputs using AvroMultipleOutputs similar to what #Ram has mentioned. https://github.com/architch/MultipleAvroOutputsFormat/blob/master/MultipleAvroOutputsFormat.java
It can be used to write avro records to multiple paths in spark the following way:
Job job = Job.getInstance(hadoopConf);
AvroJob.setOutputKeySchema(job, schema);
AvroMultipleOutputs.addNamedOutput(job,"type1",AvroKeyOutputFormat.class,schema);
AvroMultipleOutputs.addNamedOutput(job,"type2",AvroKeyOutputFormat.class,schema);
rdd.mapToPair(event->{
if(event.isType1())
return new Tuple2<>(new Tuple2<>("type1",new AvroKey<>(event.getRecord())),NullWritable.get());
else
return new Tuple2<>(new Tuple2<>("type2",new AvroKey<>(event.getRecord())),NullWritable.get());
})
.saveAsNewAPIHadoopFile(
outputBasePath,
GenericData.Record.class,
NullWritable.class,
MultipleAvroOutputsFormat.class,
job.getConfiguration()
);
Here getRecords returns a GenericRecord.
The output would be like this at outputBasePath:
17359 May 28 15:23 type1-r-00000.avro
28029 May 28 15:24 type1-r-00001.avro
16473 May 28 15:24 type1-r-00003.avro
17124 May 28 15:23 type2-r-00000.avro
30962 May 28 15:24 type2-r-00001.avro
16229 May 28 15:24 type2-r-00003.avro
This can also be used to write to different directories altogether by providing the baseOutputPath directly as mentioned here: write to multiple directory
Related
I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs
I have a list of files in parquet format
-- s3:\\my-bucket\files\14\09\12\file.pq
-- s3:\\my-bucket\files\14\09\11\file.pq
# 14 = day, 09 = month, 11 = hour.
if I pass the absolute path to my spark context it can read the file without any issue
spark.read.parquet('s3:\\my-bucket\files\14\09\12\file.pq')
if I pass in
spark.read.parquet('s3:\\my-bucket\files\14')
then I will get the following error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
This is, I believe due to the fact the partitions are unnamed, I have no control over the source so I can't change it unfortunately.
my hacky work around is to list all the files and then take the unique set of lowest level paths and pass that into spark .
Is there a better work around?
The easier way is putting * to list all directories within path:
df = spark.read.parquet('s3://my-bucket/files/*/*/*/')
If you want retrieve day, month and hour, follow my answer here
The data is stored in the following forms:
data/file1_features.mat
data/file1_labels.txt
data/file2_features.mat
data/file2_labels.txt
...
data/file100_features.mat
data/file100_labels.txt
Each data/file*_features.mat stores the features of some samples and each row is a sample. Each data/file*_labels.txt stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80 million samples.
In Spark, how to access this data set?
I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py. It has the following lines:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
I run this example in ./bin/pyspark, it shows the data object is a PythonRDD.
PythonRDD[32] at RDD at PythonRDD.scala:48
The data/mllib/sample_libsvm_data.txt is just one file. In my case, there are many files. Is there any RDD in Spark to handle this case conveniently? Does it need to merge all 100 files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).
Simply point
dir = "<path_to_data>/data"
sc.textFile(dir)
Spark automatically picks up all of the files inside that directory
If you want load specific file type for processing then you can use regular expression for loading files into RDD.
dir = "data/*.txt"
sc.textFile(dir)
Spark will all files ending with txt extension.
In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.
Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?
Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created
I wrote a Spark program that mimics functionality of an existing Map Reduce job. The MR job takes about 50 minutes every day, but the Spark job took only 9 minutes! That’s great!
When I looked at the output directory, I noticed that it created 1,020 part files. The MR job uses only 20 reducers so it creates only 20 files. We need to cut down on # of output files; otherwise our Namespace would be full in no time.
I am trying to figure out how I can reduce the number of output files under Spark. Seems like 1,020 tasks are getting triggered and each one creates a part file. Is this correct? Do I have to change the level of parallelism to cut down no. of tasks thereby reducing no. of output files? If so how do I set it? I am afraid cutting down no. of tasks will slow down this process – but I can test that!
Cutting down the number of reduce tasks will slow down the process for sure. However, it still should be considerably faster than Hadoop MapReduce for your use case.
In my opinion, the best method to limit the number of output files is using the coalesce(numPartitions) transformation. Below is an example:
JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/);
JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");
//Consider we have 1020 partitions and thus 1020 map tasks
JavaRDD<String> mappedData = myData.map( your map function );
//Consider we need 20 output files
JavaRDD<String> newData = mappedData.coalesce(20)
newData.saveAsTextFile("output path");
In this example, the map function would be executed by 1020 tasks, which would not be altered in any way. However, after having coalesced the partitions, there should only be 20 partitions to work with. In that case, 20 output files would be saved at the end of the program.
As mentioned earlier, take into account that this method will be slower than having 1020 output files. The data needs to be stored into few partitions (from 1020 to 20).
Note: please take a look to the repartition command on the following link too.