Multiples computations in one iteration with Spark - apache-spark

How can I iterate over a big collection of files producing different results in just one step with Spark? For example:
val tweets : RDD[Tweet] = ...
val topWords : RDD[String] = getTopWords(tweets)
val topHashtags : RDD[String] = getTopHashtags(tweets)
topWords.collect().foreach(println)
topHashtags.collect().foreach(println)
It looks like Spark is going to iterate twice over the tweets dataset. Is there any way to prevent this? Is Spark smart enough to make this kind of optimizations?
Thanks in advance,

Spark will keep data loaded into CPU cache as long as it can, but that's not something you should rely on, so your best bet is to tweets.cache so that after the initial load then it will be working off of a memory store. The only other solution you would have is to combine your two functions and return a tuple of (resultType1, resultType2)

Related

Optimal (low-latency) spark settings for small datasets

I'm aware that spark is designed for large datasets for which it's great. But under certain circumstances I don't need this scalability, e.g. for unit tests or for data exploration on small datasets. Under these conditions spark performs relatively bad compared implementation in pure scala/python/matlab/R etc.
Note that I don't want to drop spark entirely, I want to keep the framework for larger workloads without re-implementing everything.
How can I disable sparks overhead as much as possible on small datasets (say 10-1000s of records)? I'm tried using only 1 partition in local mode (setting spark.sql.shuffle.partitions=1 and spark.default.parallelism=1)? Even which these settings, simple queries on 100 records take on the order of 1-2 seconds.
Note that I'm not trying to reduce the time for SparkSession instantiation, just the execution time given SparkSession exists.
operations in spark have same signature as the scala collections.
You could implement something like:
val useSpark = false
val rdd: RDD[String]
val list: List[String] = Nil
def mapping: String => Int = s => s.length
if (useSpark) {
rdd.map(mapping)
} else {
list.map(mapping)
}
I think this code could be abstracted even more.

What is the fastest way to get a large number of time ranges using Apache Spark?

I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
spark_session.CreateDataFrame()
and
df.registerTempTable()
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
ranges.contains(_.contains(timestamp))
}
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.

Parallel writing from same DataFrame in Spark

Let's say I have a DataFrame in Spark and I need to write the results of it to two databases, where one stores the original data frame but the other stores a slightly modified version (e.g. drops some columns). Since both operations can take a few moments, is it possible/advisable to run these operations in parallel or will that cause problems because Spark is working on the same object in parallel?
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def write1(){
//your save statement for first dataframe
}
def write2(){
//your save statement for second dataframe
}
def writeAllTables() {
Future{ write1()}
Future{ write2()}
}
Let me ask you, do you really need to do it? If you are not sure, then, you most probably don't.
So, lets assume below scenario similar to one you explained:
val df1 = spark.read.csv('someFile.csv') // Original Dataframe
val df2 = df1.withColumn("newColumn", concat(col("oldColumn"), lit(" is blah!"))) // Modified Dataframe, FYI, this df2 is a different object
df1.write('db_loc1') // Write to DB1, already parallelised & uses spark resources optimally
df2.write('db_loc2') // Write to DB2, already parallelised & uses spark resources optimally
Spark scheduler divides the first DataFrame df1 into partitions and writes them in parallel in db_loc1.
It picks up the second DataFrame df2 and again breaks it into partitions and writes these partitions in parallel in db_loc2.
By default, the degree of parallelisation per write is speculated in order to optimally use available cluster resources.
Small writes might not be repartitioned as mostly the write time is low and repartitioning will only increase overhead. In a extraordinary case where you have a lot of small writes, it might make a good case for trying to parallelise these writes. But, the best way to do so is to redesign your code to run one spark job per DataFrame instead of trying to parallelise DataFrame.write() call in same driver program.
Large writes will probably use all available resources in parallel during the single DataFrame write itself. Hence, if spark allowed issuing another write operation for a different DataFrame at the same time, it would only delay both operations as now they are racing with each other for resources. Not to mention, there may be some performance slowdown due to increased overhead because of sheer increase in number of tasks that spark now needs to manage and track.
Also, You can read this answer to and learn more about this

Out of memory issue when compare two large datasets using spark scala

I am daily importing 10 Million records from Mysql to Hive using Spark scala program and comparing datasets yesterdays and todays datasets.
val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts");
val todayDf=sqlContext.sql("select * from t_todayProducts");
val diffDf=todayDf.except(yesterdayDf);
I am using 3 node cluster and program working fine for 4 million records.
For more than 4 million we are facing out of memory issue as RAM memory is not sufficient.
I would like to know best way to compare two large datasets.
Have you tried findout out how many partitions do you have:
yesterdayDf.rdd.partitions.size will give you that information for yesterdayDf dataframe and you can do the same for other dataframes too.
You can also use
yesterdayDf.repartition(1000) // (a large number) to see if the OOM problem goes away.
The reason for this issue is hard to say. But the issue could be that for some reason the workers are taking too many data. Try to clear the data frames to do the except. According to my question in comments, you said that you have key columns so take only they like this:
val yesterdayDfKey = yesterdayDf.select("key-column")
val todayDfKey = todayDf.select("key-column")
val diffDf=todayDfKey.except(yesterdayDfKey);
With that you will take an data frame with the keys. Than you can make a filter with that using join like this post.
you also need to make sure your yarn.nodemanager.resource.memory-mb is larger than your --executor-memory.
you can also try joining two df on keys with left_anti join and then check count of number of records

Why does Spark DataFrame run out of memory when same process on RDD completes fine?

I'm working with a fairly large amount of data (a few TBs). When I use a subset of the data, I find that Spark dataframes are great to work with. However, when I try calculations on my full dataset the same code returns me a dreaded "java.lang.OutOfMemoryError: GC overhead limit exceeded". What surprised me is that the process completes fine doing the same thing with an RDD. I thought dataframes were supposed to have better optimization. Is this a mistake in my approach or a limitation of dataframes?
For example, here is a simple task using dataframes that completes fine for a subset of my data and chokes on the full sample:
val records = sqlContext.read.avro(datafile)
val uniqueIDs = records.select("device_id").dropDuplicates(Array("device_id"))
val uniqueIDsCount = uniqueIDs.count().toDouble
val sampleIDs = uniqueIDs.sample(withReplacement = false, 100000/uniqueIDsCount)
sampleIDs.write.format("com.databricks.spark.csv").option("delimiter", "|").save(outputfile)
In this case it even chokes on the count.
However, when I try the same thing using RDDs in the following way it calculates fine (and pretty quickly at that).
val rawinput = sc.hadoopFile[AvroWrapper[Observation],NullWritable,
AvroInputFormat[Observation]](rawinputfile).map(x=> x._1.datum)
val tfdistinct = rawinput.map(x => x.getDeviceId).distinct
val distinctCount = tfdistinct.count().toDouble
tfdistinct.sample(false, 100000/distinctCount.toDouble).saveAsTextFile(outputfile)
I'd love to keep using dataframes in the future, am I approaching this wrong?

Resources