I'm trying to read data from BigQuery (using TableRow) and write the output to Cassandra. How to do that?
Here's what I've tried. This works:
/* Read BQ */
PCollection<CxCpmMapProfile> data = p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, CxCpmMapProfile>() {
public CxCpmMapProfile apply(SchemaAndRecord record) {
GenericRecord r = record.getRecord();
return new CxCpmMapProfile((String) r.get("channel_no").toString(), (String) r.get("channel_name").toString());
}
}).fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`").usingStandardSql().withoutValidation());
/* Write to Cassandra */
data.apply(CassandraIO.<CxCpmMapProfile>write()
.withHosts(Arrays.asList("<IP addr1>", "<IP addr2>"))
.withPort(9042)
.withUsername("cassandra_user").withPassword("cassandra_password").withKeyspace("cassandra_keyspace")
.withEntity(CxCpmMapProfile.class));
But when I changed Read BQ part using TableRow like this:
/* Read from BQ using readTableRow */
PCollection<TableRow> data = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`")
.usingStandardSql().withoutValidation());
In Write to Cassandra I got the following error
The method apply(PTransform<? super PCollection<TableRow>,OutputT>) in the type PCollection<TableRow> is not applicable for the arguments (CassandraIO.Write<CxCpmMacProfile>)
The error is due to the input PCollection containing TableRow elements, while the CassandraIO read is expecting CxCpmMacProfile elements. You need to read the elements from BigQuery as CxCpmMacProfile elements. The BigQueryIO documentation has an example of reading rows from a table and parsing them into a custom type, done through the read(SerializableFunction) method.
I want to transform my input data (XML files) and produce 3 different outputs.
Each output will be in parquet format and will have a different schema/number of columns.
Currently in my solution, the data is stored in RDD[Row], where each Row belongs to one of three types and has a different number of fields. What I'm doing now is caching the RDD, then filtering it (using the field telling me about the record type) and saving the data using the following method:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
...
// the same for filtered_data_2 and filtered_data_3
Is there any way to do it better, for example do not cache entire data in memory?
In MapReduce we have MultipleOutputs class and we can do it this way:
MultipleOutputs.addNamedOutput(job, "data_type_1", DataType1OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_2", DataType2OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_3", DataType3OutputFormat.class, Void.class, Group.class);
...
MultipleOutputs<Void, Group> mos = new MultipleOutputs<>(context);
mos.write("data_type_1", null, myRecordGroup1, filePath1);
mos.write("data_type_2", null, myRecordGroup2, filePath2);
...
We had exactly this problem, to re-iterate: we read 1000s of datasets into one RDD, all of different schemas (we used a nested Map[String, Any]) and wanted to write those 1000s of datasets to different Parquet partitions in their respective schemas. All in a single embarrassingly parallel Spark Stage.
Our initial approach indeed did the hacky thing of caching, but this meant (a) 1000 passes of the cached data (b) hitting a lot of memory issues!
For a long time now I've wanted to bypass the Spark's provided .parquet methods and go to lower level underlying libraries, and wrap that in a nice functional signature. Finally recently we did exactly this!
The code is too much to copy and paste all of it here, so I will just paste the main crux of the code to explain how it works. We intend on making this code Open Source in the next year or two.
val successFiles: List[String] = successFilePaths(tableKeyToSchema, tableKeyToOutputKey, tableKeyToOutputKeyNprs)
// MUST happen first
info("Deleting success files")
successFiles.foreach(S3Utils.deleteObject(bucket, _))
if (saveMode == SaveMode.Overwrite) {
info("Deleting past files as in Overwrite mode")
parDeleteDirContents(bucket, allDirectories(tableKeyToOutputKey, tableKeyToOutputKeyNprs, partitions, continuallyRunStartTime))
} else {
info("Not deleting past files as in Append mode")
}
rdd.mapPartitionsWithIndex {
case (index, records) =>
records.toList.groupBy(_._1).mapValues(_.map(_._2)).foreach {
case (regularKey: RegularKey, data: List[NotProcessableRecord Either UntypedStruct]) =>
val (nprs: List[NotProcessableRecord], successes: List[UntypedStruct]) =
Foldable[List].partitionEither(data)(identity)
val filename = s"part-by-partition-index-$index.snappy.parquet"
Parquet.writeUntypedStruct(
data = successes,
schema = toMessageType(tableKeyToSchema(regularKey.tableKey)),
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKey(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
Parquet.writeNPRs(
nprs = nprs,
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKeyNprs(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
} pipe Iterator.single
}.count() // Just some action to force execution
info("Writing _SUCCESS files")
successFiles.foreach(S3Utils.uploadFileContent(bucket, "", _))
Of course this code cannot be copy and pasted as many methods and values are not provided. The key points are:
We hand crank the deleting of _SUCCESS files and previous files when overwriting
Each spark partition will result in one-or-many output files (many when multiple data schemas are in the same partition)
We hand crank the writing of _SUCCESS files
Notes:
UntypedStruct is our nested representation of arbitrary schema. It's a little bit like Row in Spark but much better, as it's based on Map[String, Any].
NotProcessableRecord are essentially just dead letters
Parquet.writeUntypedStruct is the crux of the logic of writing a parquet file, so we'll explain this in more detail. Firstly
val toMessageType: StructType => MessageType = new org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter().convert
Should be self explanatory. Next fsMode contains within it the com.amazonaws.auth.AWSCredentials, then inside writeUntypedStruct we use that to construct org.apache.hadoop.conf.Configuration setting fs.s3a.access.key and fs.s3a.secret.key.
writeUntypedStruct basically just calls out to:
def writeRaw(
data: List[UntypedStruct],
schema: MessageType,
config: Configuration,
path: Path,
compression: CompressionCodecName = CompressionCodecName.SNAPPY
): Unit =
Using.resource(
ExampleParquetWriter.builder(path)
.withType(schema)
.withConf(config)
.withCompressionCodec(compression)
.withValidation(true)
.build()
)(writer => data.foreach(data => writer.write(transpose(data, new SimpleGroup(schema)))))
where SimpleGroup comes from org.apache.parquet.example.data.simple, and ExampleParquetWriter extends ParquetWriter<Group>. The method transpose is a very tedious self writing recursion through the UntypedStruct populating a Group (some ugly Java mutable low level thing).
Credit must go to https://github.com/davidainslie for figuring out how these underlying libraries work, and labouring out the code, which like I said, we intend on making Open Source soon!
AFAIK, there is no way to split one RDD into multiple RDD per se. This is just how the way Spark's DAG works: only child RDDs pulling data from parent RDDs.
We can, however, have multiple child RDDs reading from the same parent RDD. To avoid recomputing the parent RDD, there is no other way but to cache it. I assume that you want to avoid caching because you're afraid of insufficient memory. We can avoid Out Of Memory (OOM) issue by persisting the RDD to MEMORY_AND_DISK so that large RDD will spill to disk if and when needed.
Let's begin with your original data:
val allDataRDD = sc.parallelize(Seq(Row(1,1,1),Row(2,2,2),Row(3,3,3)))
We can persist this in memory first, but allow it to spill over to disk in case of insufficient memory:
allDataRDD.persist(StorageLevel.MEMORY_AND_DISK)
We then create the 3 RDD outputs:
filtered_data_1 = allDataRDD.filter(_.get(1)==1) // //
filtered_data_2 = allDataRDD.filter(_.get(2)==1) // use your own filter funcs here
filtered_data_3 = allDataRDD.filter(_.get(3)==1) // //
We then write the outputs:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
var resultDF_2 = sqlContext.createDataFrame(filtered_data_2, schema_2)
resultDF_2.write.parquet(output_path_2)
var resultDF_3 = sqlContext.createDataFrame(filtered_data_3, schema_3)
resultDF_3.write.parquet(output_path_3)
If you truly really want to avoid multiple passes, there is a workaround using a custom partitioner. You can repartition your data into 3 partitions and each partition will have its own task and hence its own output file/part. The caveat is that parallelism will be heavily reduced to 3 threads/tasks, and there's also the risk of >2GB of data stored in a single partition (Spark has a 2GB limit per partition). I am not providing detailed code for this method because I don't think it can write parquet files with different schema.
I have many (relatively small) AVRO files with different schema, each in one location like this:
Object Name: A
/path/to/A
A_1.avro
A_2.avro
...
A_N.avro
Object Name: B
/path/to/B
B_1.avro
B_2.avro
...
B_N.avro
Object Name: C
/path/to/C
C_1.avro
C_2.avro
...
C_N.avro
...
and my goal is to read them in parallel via Spark and store each row as a blob in one column of the output. As a result my output data will have a consistent schema, something like the following columns:
ID, objectName, RecordDate, Data
Where the 'Data' field contains a string JSON of the original record.
My initial thought was to put the spark read statements in a loop, create the fields shown above for each dataframe, and then apply a union operation to get my final dataframe, like this:
all_df = []
for obj_name in all_object_names:
file_path = get_file_path(object_name)
df = spark.read.format(DATABRIKS_FORMAT).load(file_path)
all_df.append(df)
df_o = all_df.drop()
for df in all_df:
df_o = df_o.union(df)
# write df_o to the output
However I'm not sure if the read operations are going to be parallelized.
I also came across the sc.textFile() function to read all the AVRO files in one shot as string, but couldn't make it work.
So I have 2 questions:
Would the multiple read statements in a loop be parallelized by
Spark? Or is there a more efficient way to achieve this?
Can sc.textFile() be used to read the AVRO files as a string JSON in one column?
I'd appreciate your thoughts and suggestions.
I have Java String array which contains 45 string which is basically column names
String[] fieldNames = {"colname1","colname2",...};
Currently I am storing above array of String in a Spark driver in a static field. My job is running slow so trying to refactor code. I am using above String array while creating a DataFrame
DataFrame dfWithColNames = sourceFrame.toDF(fieldNames);
I want to do the above using broadcast variable to that it don't ship huge string array to every executor. I believe we can do something like the following to create broadcast
String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray
I am new to Spark.
This is a bit old question, however, I hope my solution would help somebody.
In order to broadcast any object (could be a single POJO or a collection) with Spark 2+ you first need to have the following method that creates a classTag for you:
private static <T> ClassTag<T> classTag(Class<T> clazz) {
return scala.reflect.ClassManifestFactory.fromClass(clazz);
}
next you use a JavaSparkContext from a SparkSession to broadcast your object as previously:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(YourObject.class)
)
In case of a collection, say, java.util.List, you use the following:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(List.class)
)
The return variable of sc.broadcast is of type Broadcast<String[]> and not String[]. When you want to access the value, you simply call value() on the variable. From your example it would be like:
Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())
Note, that if you are writing this in Java, you probably want to wrap the SparkContext within the JavaSparkContext. It makes everything easier and you can then avoid having to pass a ClassTag to the broadcast function.
You can read more on broadcasting variables on http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
ArrayList<String> dataToBroadcast = new ArrayList();
dataToBroadcast .add("string1");
...
dataToBroadcast .add("stringn");
//Creating the broadcast variable
//No need to write classTag code by hand use akka.japi.Util which is available
Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
dataToBroadcast,
akka.japi.Util.classTag(ArrayList.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. dataToBroadcast) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to each
//machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSetWhere.foreach((row) -> {
ArrayList<String> stringlist = strngBrdCast.value();
...
...
})
Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.