Using DataFrame.foreachPartition, processing partitions as data frames - apache-spark

I have a dataframe that's partitioned by col0; there are many rows in the DF per value of col0. I have a database from which I want to fetch batches of data using the values of col0 in each partition, but I can't for the life of me figure out how to use foreachPartition, since it returns a Iterator[Row].
Here's pseudocode for what I'm wanting to do:
var df = spark.read.parquet(...).repartition(numPartitions, "col0")
df.foreachPartition((part_df : DataFrame) => {
val values = part_df.select("col0").distinct
val sql = "select * from table0 where col0 in (${values})" // or some smarter method :)
val db_df = spark.read.jdbc(..., table = sql)
part_df.join(db_dv, "col0") // and/or whatever else
})
Any ideas?

I wasn't able to find an elegant solution to this, but I was able to find an inelegant one.
When you write out to a filesystem, Spark will write a separate file for each partition. You can then use filesystem to list the files, then read in and operate on each one individually as a separate dataframe.

Related

Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Question
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

spark: No of records in DataFrame is different in different runs

I am running a spark job that reads data from teradata. The query looks like
select * from db_name.table_name sample 5000000;
I'm trying to pull sample of 5 million rows of data. When I tried to print the number of rows in the result DataFrame, it is giving different results each time I run. Sometimes it is 4999937 and sometimes it is 5000124. Is there any particular reason for this kind of behaviour?
EDIT #1:
The code I'm using:
val query = "(select * from db_name.table_name sample 5000000) as data"
var teradataConfig = Map("url"->"jdbc:teradata://HOSTNAME/DATABASE=db_name,DBS_PORT=1025,MAYBENULL=ON",
"TMODE"->"TERA",
"user"->"username",
"password"->"password",
"driver"->"com.teradata.jdbc.TeraDriver",
"dbtable" -> query)
var df = spark.read.format("jdbc").options(teradataConfig).load()
df.count
Try caching the resultant dataframe and perform count action on the dataframe
df.cache()
println(s"Record count: ${df.count()}
From here on when you reuse the df to create new dataframe or any other transformation you don't get mismatched counts since it is already in cache.
Make sure you have given enough memory to hold the cached dataframe in memory.

How to create multiple RDD rows from a single file record in Apache Spark

I'm struggling with the following logic using Apache Spark. My input file has rows in the following format pipe-delimited:
14586|9297,0.000128664|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07|9234,4.80094e-05|9235,1.91492e-05|9236,0.00776722
The first column is a key. There maybe one or more columns after that. Each subsequent column has a secondary key and a value, like this: 4673,0.00730174
While reading this file I want to have the resulted RDD having only 3 columns flattening other columns after the first one but retaining the main key, like these:
14586|9297,0.000128664
14586|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07
14588|9234,4.80094e-05
14588|9235,1.91492e-05
14588|9236,0.00776722
How can I do that in Scala?
Is this the thing you're looking for?
val sc: SparkContext = ...
val rdd = sc.parallelize(Seq(
"14586|9297,0.000128664|9298,0.0683921",
"14587|4673,0.00730174",
"14588|9233,1.15112e-07|9234,4.80094e-05|9235,1.91492e-05|9236,0.00776722"
)).flatMap { line =>
val splits = line.split('|')
val key = splits.head
val pairs = splits.tail
pairs.map { pair =>
s"$key|$pair"
}
}
rdd collect() foreach println
Output:
14586|9297,0.000128664
14586|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07
14588|9234,4.80094e-05
14588|9235,1.91492e-05
14588|9236,0.00776722
Have you considered using flatMap? It allows you to create multiple 0-n rows from a single row of input. Just parse the line and reconstruct the row with the different values for the primary row key.

How to control Spark JavaRDD<MyTable> to take specific n rows?

I am having my own data structure called MyTable which is kind of columnar data store format table. Now I want to use Spark to create myTable in distributed environment as my datasets are in HDFS. I have used Spark earlier and I am familiar with it.
I am not able to figure out how can we control JavaRDD to take n rows. Here n could be 80k, 90k rows etc. If you see the following JavaRDD will always create one row MyTable, how do I create MyTable with n rows
JavaRDD<MyTable> rdd_records = sc.textFile("/path/to/hdfs").map(
new Function<String, MyTable>() {
public MyTable call(String line) throws Exception {
String[] fields = line.split(",");
Record record = create Record from above fields
MyTable table = new MyTable();
return table.append(record);
}
});
If I know how to command RDD to take certain no of rows then I can use it to create MyTable in distributed way.
when you load data using sc.textfile, spark automatically splits data on newlinesand puts them to partitions. So, what you need to do is a custom partitioning using your params (80k thing). Then you can use partitionBy on RDD. After that, you should be using mapPartitions instead of map to generate your data structures of Rows.
One advice, this seems a case to use Dataframes. If you are on 1.3, you take a look. It does converting tuples to schema in distributed way already

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}
A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.
val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
val dataRows = mapRdd.filter(_._1 == id).values
dataRows.saveAsTextFile(id)
})
I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD
groupedRdd.foreach({ tup =>
val data = sc.parallelize(List(tup._2)) //nested RDD does not work
data.saveAsTextFile(tup._1)
})
Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.
I think this problem is similar to
Write to multiple outputs by key Spark - one Spark job
Please refer the answer there.
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any =
NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Split" + args(1))
val sc = new SparkContext(conf)
sc.textFile("input/path")
.map(a => (k, v)) // Your own implementation
.partitionBy(new HashPartitioner(num))
.saveAsHadoopFile("output/path", classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
spark.stop()
}
}
Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.
new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.
you can directly call saveAsTextFile on grouped RDD, here it will save the data based on partitions, i mean, if you have 4 distinctID's, and you specified the groupedRDD's number of partitions as 4, then spark stores each partition data into one file(so by which you can have only one fileper ID) u can even see the data as iterables of eachId in the filesystem.
This will save the data per user ID
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1),
x)}.groupByKey(numPartitions).saveAsObjectFile("file")
If you need to retrieve the data again based on user id you can do something like
val userIdLookupTable = sc.objectFile("file").cache() //could use persist() if data is to big for memory
val data = userIdLookupTable.lookup(id) //note this returns a sequence, in this case you can just get the first one
Note that there is no particular reason to save to the file in this case I just did it since the OP asked for it, that being said saving to a file does allow you to load the RDD at anytime after the initial grouping has been done.
One last thing, lookup is faster than a filter approach of accessing ids but if you're willing to go off a pull request from spark you can checkout this answer for a faster approach

Resources