Spark DataFrame Lazy Evaluation when select function is called - apache-spark

Knowing that Spark only do the real job when an action is called (e.g. a show on a DataFrame) I have one doubt regarding the extension of this lazyness behavior.
Imagine the following scenario of a development of a DataFrame with 3 columns:
val df = otherDF
.withColumn("aaa", lit("AAA"))
.withColumn("bbb", lit("BBB")
.withColumn("ccc", lit("CCC")
After this, I will select only one column and show (trigger an action):
df
.select("aaa")
.show
Will spark only compute the "aaa" column and ignore the other ones if they are not needed? Or will it evaluate and process the "bbb" and "ccc" columns also and the select function will only filter the output subset?
The real scenario here is that I want to create a "master" DataFrame with many columns and complex transformations, but then some sub-processes will select the master DataFrame with only a subset of the columns and add if needed some more specific columns.
I want to guarantee that if a sub-process that only needs 10% of the columns will not be affected by the all evaluation and process of the complete master DataFrame (if this is possible).
Thanks in advance

I prepared this sample code:
val input = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/*#gmail.com/city_temperature.csv")
val df = input
.withColumn("aaa", lit("AAA"))
.withColumn("bbb", lit("BBB"))
.withColumn("ccc", lit("CCC"))
.withColumn("generated_value",monotonically_increasing_id)
import org.apache.spark.sql.execution.debug._
df.select("aaa", "generated_value").debugCodegen()
I am reading csv, then adding some column and at the end selecting only few of them. I added monotonically_increasing_id to include also column which is not a literal value but is generated dynamically
.debigCodegen() shows us what code was generated, so lets take a look at first version where i am selecting also the generated_value
/* 029 */ private void project_doConsume_0(InternalRow inputadapter_row_0) throws java.io.IOException {
/* 030 */ final long project_value_1 = partitionMask + project_count_0;
/* 031 */ project_count_0++;
/* 032 */
/* 033 */ project_mutableStateArray_0[0].reset();
/* 034 */
/* 035 */ project_mutableStateArray_0[0].write(0, ((UTF8String) references[0] /* literal */));
/* 036 */
/* 037 */ project_mutableStateArray_0[0].write(1, project_value_1);
/* 038 */ append((project_mutableStateArray_0[0].getRow()));
/* 039 */
/* 040 */ }
Here you can see that code need to calculate id was generated and later executed, its this part:
final long project_value_1 = partitionMask + project_count_0;
Now the same code but lets remove second column from select. First lines of code are the same as in previous example
import org.apache.spark.sql.execution.debug._
df.select("aaa").debugCodegen()
Project_doConsume is different
/* 024 */ private void project_doConsume_0(InternalRow inputadapter_row_0) throws java.io.IOException {
/* 025 */ project_mutableStateArray_0[0].reset();
/* 026 */
/* 027 */ project_mutableStateArray_0[0].write(0, ((UTF8String) references[0] /* literal */));
/* 028 */ append((project_mutableStateArray_0[0].getRow()));
/* 029 */
/* 030 */ }
Code needed for monotnically_increasing was not generated which means that Spark is able to pushdown projection and generate only columns that are needed

Related

Apache Beam - Write BigQuery TableRow to Cassandra

I'm trying to read data from BigQuery (using TableRow) and write the output to Cassandra. How to do that?
Here's what I've tried. This works:
/* Read BQ */
PCollection<CxCpmMapProfile> data = p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, CxCpmMapProfile>() {
public CxCpmMapProfile apply(SchemaAndRecord record) {
GenericRecord r = record.getRecord();
return new CxCpmMapProfile((String) r.get("channel_no").toString(), (String) r.get("channel_name").toString());
}
}).fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`").usingStandardSql().withoutValidation());
/* Write to Cassandra */
data.apply(CassandraIO.<CxCpmMapProfile>write()
.withHosts(Arrays.asList("<IP addr1>", "<IP addr2>"))
.withPort(9042)
.withUsername("cassandra_user").withPassword("cassandra_password").withKeyspace("cassandra_keyspace")
.withEntity(CxCpmMapProfile.class));
But when I changed Read BQ part using TableRow like this:
/* Read from BQ using readTableRow */
PCollection<TableRow> data = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`")
.usingStandardSql().withoutValidation());
In Write to Cassandra I got the following error
The method apply(PTransform<? super PCollection<TableRow>,OutputT>) in the type PCollection<TableRow> is not applicable for the arguments (CassandraIO.Write<CxCpmMacProfile>)
The error is due to the input PCollection containing TableRow elements, while the CassandraIO read is expecting CxCpmMacProfile elements. You need to read the elements from BigQuery as CxCpmMacProfile elements. The BigQueryIO documentation has an example of reading rows from a table and parsing them into a custom type, done through the read(SerializableFunction) method.

Spark dataset : Casting Columns of dataset

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

Add a constant struct column to a Spark DataFrame

I want to load a struct from a database collection, and attach it as a constant column to every row in a target DataFrame.
I can load the column I need as a DataFrame with one row, then do a crossJoin to paste it onto each row of the target:
val parentCollectionDF = /* ... load a single row from the database */
val constantCol = broadcast(parentCollectionDF.select("my_column"))
val result = childCollectionDF.crossJoin(constantCol)
It works but feels wasteful: the data is constant for each row of the child collection, but the crossJoin copies it to each row.
If I could hardcode the values, I could use something like childCollection.withColumn("my_column", struct(lit(val1) as "field1", lit(val2) as "field2" /* etc. */)) But I don't know them ahead of time; I need to load the struct from the parent collection.
What I'm looking for is something like:
childCollection.withColumn("my_column",
lit(parentCollectionDF.select("my_column").take(1).getStruct(0))
... but I can see from the code for literals that only basic types can be used as an argument to lit(). No good to pass a GenericRowWithSchema or a case class here.
Is there a less clumsy way to do this? (Spark 2.1.1, Scala)
[edit: Not the same as this question, which explains how to add a struct with literal (hardcoded) constants. My struct needs to be loaded dynamically.]

What is an efficient way to partition by column but maintain a fixed partition count?

What is the best way to partition the data by a field into predefined partition count?
I am currently partitioning the data by specifying the partionCount=600. The count 600 is found to give best query performance for my dataset/cluster setup.
val rawJson = sqlContext.read.json(filename).coalesce(600)
rawJson.write.parquet(filenameParquet)
Now I want to partition this data by the column 'eventName' but still keep the count 600. The data currently has around 2000 unique eventNames, plus the number of rows in each eventName is not uniform. Around 10 eventNames have more than 50% of the data causing data skew. Hence if I do the partitioning like below, its not very performant. The write is taking 5x more time than without.
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("eventName").parquet(filenameParquet)
What is a good way to partition the data for these scenarios? Is there a way to partition by eventName but spread this into 600 partitions?
My schema looks like this:
{
"eventName": "name1",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "detailed1",
"id": "1234",
...
...
}
}
}
Thanks!
This is a common problem with skewed data and there are several approaches you can take.
List bucketing works if the skew remains stable over time, which may or may not be the case, especially if new values of the partitioning variable are introduced. I have not researched how easy it is to adjust list bucketing over time and, as your comment states, you can't use that anyway because it is a Spark 2.0 feature.
If you are on 1.6.x, the key observation is that you can create your own function that maps each event name into one of 600 unique values. You can do this as a UDF or as a case expression. Then, you simply create a column using that function and then partition by that column using repartition(600, 'myPartitionCol) as opposed to coalesce(600).
Because we deal with very skewed data at Swoop, I've found the following workhorse data structure to be quite useful for building partitioning-related tools.
/** Given a key, returns a random number in the range [x, y) where
* x and y are the numbers in the tuple associated with a key.
*/
class RandomRangeMap[A](private val m: Map[A, (Int, Int)]) extends Serializable {
private val r = new java.util.Random() // Scala Random is not serializable in 2.10
def apply(key: A): Int = {
val (start, end) = m(key)
start + r.nextInt(end - start)
}
override def toString = s"RandomRangeMap($r, $m)"
}
For example, here is how we build a partitioner for a slightly different case: one where the data is skewed and the number of keys is small so we have to increase the number of partitions for the skewed keys while sticking with 1 as the minimum number of partitions per key:
/** Partitions data such that each unique key ends in P(key) partitions.
* Must be instantiated with a sequence of unique keys and their Ps.
* Partition sizes can be highly-skewed by the data, which is where the
* multiples come in.
*
* #param keyMap maps key values to their partition multiples
*/
class ByKeyPartitionerWithMultiples(val keyMap: Map[Any, Int]) extends Partitioner {
private val rrm = new RandomRangeMap(
keyMap.keys
.zip(
keyMap.values
.scanLeft(0)(_+_)
.zip(keyMap.values)
.map {
case (start, count) => (start, start + count)
}
)
.toMap
)
override val numPartitions =
keyMap.values.sum
override def getPartition(key: Any): Int =
rrm(key)
}
object ByKeyPartitionerWithMultiples {
/** Builds a UDF with a ByKeyPartitionerWithMultiples in a closure.
*
* #param keyMap maps key values to their partition multiples
*/
def udf(keyMap: Map[String, Int]) = {
val partitioner = new ByKeyPartitionerWithMultiples(keyMap.asInstanceOf[Map[Any, Int]])
(key:String) => partitioner.getPartition(key)
}
}
In your case, you have to merge several event names into a single partition, which would require changes but I hope the code above gives you an idea how to approach the problem.
One final observation is that if the distribution of event names values a lot in your data over time, you can perform a statistics gathering pass over some part of the data to compute a mapping table. You don't have to do this all the time, just when it is needed. To determine that, you can look at the number of rows and/or size of output files in each partition. In other words, the entire process can be automated as part of your Spark jobs.

Cassandra DataStax driver: how to page through columns

I have wide rows with timestamp columns. If I use the DataStax Java driver, I can page row results by using LIMIT or FETCH_SIZE, however, I could not find any specifics as to how I can page through columns for a specific row.
I found this post: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CQL-3-and-wide-rows-td7594577.html
which explains how I could get ranges of columns based on the column name (timestamp) values.
However, what I need to do is to get ALL columns, I just don't want to load them all into memory , but rather "stream" the results and process a chunk of columns (preferably of a controllable size) at a time until all columns of the row are processed.
Does the DataStax driver support streaming of this kind? and of so - what is the syntax for using it?
Additional clarification:
Essentially, what I'm looking for is an equivalent of the Hector's ColumnSliceIterator using which I could iterate over all columns (up to Integer.MAX_VALUE number) of a specific row in batches of, say, 100 columns at a time as following:
SliceQuery sliceQuery = HFactory.createSliceQuery(keySpace, ...);
sliceQuery.setColumnFamily(MY_COLUMN_FAMILY);
sliceQuery.setKey(myRowKey);
// columns to be returned. The null value indicates all columns
sliceQuery.setRange(
null // start column
, null // end column
, false // reversed order
, Integer.MAX_VALUE // number of columns to return
);
ColumnSliceIterator iter = new ColumnSliceIterator(
sliceQuery // previously created slice query needs to be passed as parameter
, null // starting column name
, null // ending column name
, false // reverse
, 100 // column count <-- the batch size
);
while (iter.hasNext()) {
String myColumnValue = iter.next().getValue();
}
How do I do the exact same thing using the DataStax driver?
thanks!
Marina
The ResultSet Object that you get is actually setup to do this sort of paginating for you by default. Calling one() repeatedly or iterating using the iterator() will allow you to access all the data without calling it all into memory at once. More details are available in the api.

Resources