Combined Spark output into single file - apache-spark

I'm wondering if there's a way to combine the final result into a single file when using Spark? Here's the code I have:
conf = SparkConf().setAppName("logs").setMaster("local[*]")
sc = SparkContext(conf = conf)
logs_1 = sc.textFile('logs/logs_1.tsv')
logs_2 = sc.textFile('logs/logs_2.tsv')
url_1 = logs_1.map(lambda line: line.split("\t")[2])
url_2 = logs_2.map(lambda line: line.split("\t")[2])
all_urls = uls_1.intersection(urls_2)
all_urls = all_urls.filter(lambda url: url != "localhost")
all_urls.collect()
all_urls.saveAsTextFile('logs.csv')
The collect() method doesn't seem to be working (or I've misunderstood its purpose). Essentially, I need the 'saveAsTextFile' to output to a single file, instead of a folder with parts.

Please find below some suggestions:
collect() and saveAsTextFile() are actions that means they will collect the results on the driver node. Therefore is redundant to call both of them.
In your case you just need to store the data with saveAsTextFile() there is no need to call collect().
collect() returns an array of items (in your case you are not using the returned variable)
As Glennie and Akash suggested just use coalesce(1) to force one single partition. coalesce(1) will not cause shuffling hence is much more efficient.
In the given code you are using the RDD API of Spark I would suggest to use dataframes/datasets instead.
Please refer on the next links for further details over RDDs and dataframes:
Difference between DataFrame, Dataset, and RDD in Spark
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Well, before you save, you can repartition once, like below:
all_urls.repartition(1).saveAsTextFile(resultPath)
then you would get just one result file.

You can store it in a parquet format. This is the best format suited for HDFS
all_urls.write.parquet("dir_name")

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

how to apply filter function to org.apache.spark.rdd.RDD[Array[Byte]]

After using spark to load thrift files:
val rdd1 = sc.newAPIHadoopFile[LongWritable,BinaryWritable[Array[Byte]],
MultiInputFormat[Array[Byte]]]("s3://spark-tst/test/input/").map(_._2.get()),
I got RDDs in the format org.apache.spark.rdd.RDD[Array[Byte]]. Now I need to apply filter function to these RDDs, but got "org.apache.spark.SparkException: Task not serializable". If introducing an intermediate rdd "val s = rdd1.collect.toList" then the filter function can apply to RDD s; but "collect" is not suitable for the case with large number of files. Another problem is that alter the filter, data also needs to write back to s3 in its original thrift format.
Appreciate any helps / suggestions.

use spark to scan multiple cassandra tables using spark-cassandra-connector

I have a problem of how to use spark to manipulate/iterate/scan multiple tables of cassandra. Our project uses spark&spark-cassandra-connector connecting to cassandra to scan multiple tables , try to match related value in different tables and if matched, take the extra action such as table inserting. The use case is like below:
sc.cassandraTable(KEYSPACE, "table1").foreach(
row => {
val company_url = row.getString("company_url")
sc.cassandraTable(keyspace, "table2").foreach(
val url = row.getString("url")
val value = row.getString("value")
if (company_url == url) {
sc.saveToCassandra(KEYSPACE, "target", SomeColumns(url, value))
}
)
})
The problems are
As spark RDD is not serializable, the nested search will fail cause sc.cassandraTable returns a RDD. The only way I know to work around is to use sc.broadcast(sometable.collect()). But if the sometable is huge, the collect will consume all the memory. And also, if in the use case, several tables use the broadcast, it will drain the memory.
Instead of broadcast, can RDD.persist handle the case? In my case, I use sc.cassandraTable to read all tables in RDD and persist back to disk, then retrieve the data back for processing. If it works, how can I guarantee the rdd read is done by chunks?
Other than spark, is there any other tool (like hadoop etc.??) which can handle the case gracefully?
It looks like you are actually trying to do a series of Inner Joins. See the
joinWithCassandraTable Method
This allows you to use the elements of One RDD to do a direct query on a Cassandra Table. Depending on the fraction of data you are reading from Cassandra this may be your best bet. If the fraction is too large though you are better off reading the two table separately and then using the RDD.join method to line up rows.
If all else fails you can always manually use the CassandraConnector Object to directly access the Java Driver and do raw requests with that from a distributed context.

Multiples computations in one iteration with Spark

How can I iterate over a big collection of files producing different results in just one step with Spark? For example:
val tweets : RDD[Tweet] = ...
val topWords : RDD[String] = getTopWords(tweets)
val topHashtags : RDD[String] = getTopHashtags(tweets)
topWords.collect().foreach(println)
topHashtags.collect().foreach(println)
It looks like Spark is going to iterate twice over the tweets dataset. Is there any way to prevent this? Is Spark smart enough to make this kind of optimizations?
Thanks in advance,
Spark will keep data loaded into CPU cache as long as it can, but that's not something you should rely on, so your best bet is to tweets.cache so that after the initial load then it will be working off of a memory store. The only other solution you would have is to combine your two functions and return a tuple of (resultType1, resultType2)

spark save and read parquet on HDFS

I am writing this code
val inputData = spark.read.parquet(inputFile)
spark.conf.set("spark.sql.shuffle.partitions",6)
val outputData = inputData.sort($"colname")
outputData.write.parquet(outputFile) //write on HDFS
If I want to read the content of the file "outputFile" from HDFS, I don't find the same number of partitions and the data is not sorted. Is this normal?
I am using Spark 2.0
This is an unfortunate deficiency of Spark. While write.parquet saves files as part-00000.parquet, part-00001.parquet, ... , it saves no partition information, and does not guarantee that part-00000 on disk is read back as the first partition.
We have added functionality for our project to a) read back partitions in the same order (this involves doing some somewhat-unsafe partition casting and sorting based on the contained filename), and b) serialize partitioners to disk and read them back.
As far as I know, there is nothing you can do in stock Spark at the moment to solve this problem. I look forward to seeing a resolution in future versions of Spark!
Edit: My experience is in Spark 1.5.x and 1.6.x. If there is a way to do this in native Spark with 2.0, please let me know!
You should make use of the repartition() instead. This would write the parquet file the way you want it:
outputData.repartition(6).write.parquet("outputFile")
Then, it would be the same if you try to read it back .
Parquet preserves the order of rows. You should use take() instead of show() to check the contents. take(n) returns the first n rows and the way it works is by first reading the first partition to get an idea of the partition size and then getting the rest of the data in batches..

Resources