I'm trying to read data from BigQuery (using TableRow) and write the output to Cassandra. How to do that?
Here's what I've tried. This works:
/* Read BQ */
PCollection<CxCpmMapProfile> data = p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, CxCpmMapProfile>() {
public CxCpmMapProfile apply(SchemaAndRecord record) {
GenericRecord r = record.getRecord();
return new CxCpmMapProfile((String) r.get("channel_no").toString(), (String) r.get("channel_name").toString());
}
}).fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`").usingStandardSql().withoutValidation());
/* Write to Cassandra */
data.apply(CassandraIO.<CxCpmMapProfile>write()
.withHosts(Arrays.asList("<IP addr1>", "<IP addr2>"))
.withPort(9042)
.withUsername("cassandra_user").withPassword("cassandra_password").withKeyspace("cassandra_keyspace")
.withEntity(CxCpmMapProfile.class));
But when I changed Read BQ part using TableRow like this:
/* Read from BQ using readTableRow */
PCollection<TableRow> data = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`")
.usingStandardSql().withoutValidation());
In Write to Cassandra I got the following error
The method apply(PTransform<? super PCollection<TableRow>,OutputT>) in the type PCollection<TableRow> is not applicable for the arguments (CassandraIO.Write<CxCpmMacProfile>)
The error is due to the input PCollection containing TableRow elements, while the CassandraIO read is expecting CxCpmMacProfile elements. You need to read the elements from BigQuery as CxCpmMacProfile elements. The BigQueryIO documentation has an example of reading rows from a table and parsing them into a custom type, done through the read(SerializableFunction) method.
Related
I'm creating an akka-stream using Alpakka and the Slick module but I'm stuck in a type mismatch problem.
One branch is about getting the total number of invoices in their table:
def getTotal(implicit session: SlickSession) = {
import session.profile.api._
val query = TableQuery[Tables.Invoice].length.result
Slick.source(query)
}
But the end line doesn't compile because Alpakka is expecting a StreamingDBIO but I'm providing a FixedSqlAction[Int,slick.dbio.NoStream,slick.dbio.Effect.Read].
How can I move from the non-streaming result to the streaming one?
Taking the length of a table results in a single value, not a stream. So the simplest way to get a Source to feed a stream is
def getTotal(implicit session: SlickSession): Source[Int, NotUsed] =
Source.lazyFuture { () =>
// Don't actually run the query until the stream has materialized and
// demand has reached the source
val query = TableQuery[Tables.Invoice].length.result
session.db.run(query)
}
Alpakka's Slick connector is more oriented towards streaming (including managing pagination etc.) results of queries that have a lot of results. For a single result, converting the Future of the result that vanilla Slick gives you into a stream is sufficient.
If you want to start executing the query as soon as you call getTotal (note that this whether or not the downstream ever runs or demands data from the source), you can have
def getTotal(implicit session: SlickSession): Source[Int, NotUsed] = {
val query = TableQuery[Tables.Invoice].length.result
Source.future(session.db.run(query))
}
Would sth like this work for you?
def getTotal() = {
// Doc Expressions (Scalar values)
// https://scala-slick.org/doc/3.2.0/queries.html
val query = TableQuery[Tables.Invoice].length.result
val res = Await.result(session.db.run(query), 60.seconds)
println(s"Result: $res")
res
}
Using spark, I am trying to read a bunch of xmls from a path, one of the files is a dummy file which is not an xml.
I would like the spark to tell me that one particular file is not valid, in any way
Adding "badRecordsPath" otiton writes the bad data into specified location for JSON files, but the same is not working for xml, is there some other way?
df = (spark.read.format('json')
.option('badRecordsPath','/tmp/data/failed')
.load('/tmp/data/dummy.json')
As far as I know.... Unfortunately it wasnt available in xml package of spark till today in a declarative way... in the way you are expecting...
Json it was working since FailureSafeParser was implemented like below... in DataFrameReader
/**
* Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes through the
* input once to determine the input schema.
*
* #param jsonDataset input Dataset with one JSON object per record
* #since 2.2.0
*/
def json(jsonDataset: Dataset[String]): DataFrame = {
val parsedOptions = new JSONOptions(
extraOptions.toMap,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
val schema = userSpecifiedSchema.getOrElse {
TextInputJsonDataSource.inferFromDataset(jsonDataset, parsedOptions)
}
ExprUtils.verifyColumnNameOfCorruptRecord(schema, parsedOptions.columnNameOfCorruptRecord)
val actualSchema =
StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord))
val createParser = CreateJacksonParser.string _
val parsed = jsonDataset.rdd.mapPartitions { iter =>
val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = true)
val parser = new FailureSafeParser[String](
input => rawParser.parse(input, createParser, UTF8String.fromString),
parsedOptions.parseMode,
schema,
parsedOptions.columnNameOfCorruptRecord)
iter.flatMap(parser.parse)
}
sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = jsonDataset.isStreaming)
}
you can implement the feature programatic way.
read all the files in the folder using sc.textFile .
foreach file using xml parser parse the entries.
If its valid redirect to another path .
If its invalid, then write in to bad record path.
I'm using Google's official Spark-BigQuery connector (com.google.cloud.bigdataoss:bigquery-connector:hadoop2-0.13.6) to retrieve data from BigQuery on a huge time-partitioned table (field myDateField).
So I'm currently doing this (example adapted from the docs) to retrieve recent data (less than a month) :
val config = sparkSession.sparkContext.hadoopConfiguration
config.set(BigQueryConfiguration.GCS_BUCKET_KEY, "mybucket")
val fullyQualifiedInputTableId = "project:dataset.table"
BigQueryConfiguration.configureBigQueryInput(config, fullyQualifiedInputTableId)
val bigQueryRDD: RDD[(LongWritable, JsonObject)] = sparkSession.sparkContext.newAPIHadoopRDD(
config,
classOf[GsonBigQueryInputFormat],
classOf[LongWritable],
classOf[JsonObject]
)
val convertedRDD: RDD[MyClass] = bigQueryRDD.map { case (_, jsonObject) =>
convertJsonObjectToMyClass(jsonObject)
}
val recentData: RDD[MyClass] = convertedRDD.filter { case MyClass(_, myDateField) =>
myDateField >= "2018-08-10"
}
println(recentData.count())
Questions
I'm wondering if the connector queries all data from the BigQuery table, like :
SELECT *
FROM `project.dataset.table`
Or if it does something clever (and more important, less expensive) that use partitioning like :
SELECT *
FROM `project.dataset.table`
WHERE myDateField >= TIMESTAMP("2018-08-10")
moreover, in general, how can I control the costs of a query and be sure that irrelevant data (here, data before "2018-08-10" for example) is not retrieved for nothing?
in case BigQuery retrieves all data, can I provide a specific query? BigQueryConfiguration.INPUT_QUERY_KEY (mapred.bq.input.query) is deprecated, but I don't see any replacement and the docs are not very clear on that
I am looking to store the protobuf messages in Hbase/HDFS using spark streaming. And I have below two questions
What is the efficient way of storing huge number of protobuf
messages and the efficient way of retrieving them to do some
analytics? For example, should they be stored as Strings/byte[] in Hbase or Should they be stored in parquet files in HDFS etc.
How should the hierarchical structure of a protobuf
messages be stored? I mean, should the nested elements be flattened
out before storage, or is there any mechanism to store them as is?
If the nested elements are collections or maps should they be
exploded and stored as multiple rows?
The sample structure of Protobuf message is shown below
> +--MsgNode-1
> +--Attribute1 - String
> +--Attribute2 - Int
> +--MsgNode-2
> +--Attribute1 - String
> +--Attribute2 - Double
> +--MsgNode-3 - List of MsgNode-3's
> +--Attribute1 - Int
I am planning to use Spark streaming to collect the protobuf messages as bytes and store them in Hbase/HDFS.
Question 1 :
What is the efficient way of storing huge number of protobuf messages
and the efficient way of retrieving them to do some analytics? For
example, should they be stored as Strings/byte[] in Hbase or Should
they be stored in parquet files in HDFS etc.
I would recommend
- store Proto-buf as Parquet AVRO files(splitting in to meaningful message with AVRO schema).
This can be achieved using dataframes api spark 1.5 and above (PartiotionBy with SaveMode.Append)
see this a-powerful-big-data-trio
If you store as string or byte array you cant do data analytics directly (query on raw data ) is not possible.
If you are using cloudera, impala(which supports parquet-avro) can be used to query rawdata.
Question 2:
How should the hierarchical structure of a protobuf messages be
stored? I mean, should the nested elements be flattened out before
storage, or is there any mechanism to store them as is? If the nested
elements are collections or maps should they be exploded and stored as
multiple rows?
If you store data in a raw format from spark streaming, How will you query if business wants to query and know what kind of data they received(this requirement is very common).
In the first place, You have to understand your data (i.e. relation between different messages with in protobuf so that single row or multiple rows you can decide) then develop protobuf parser to parse message structure of protobuf. based on your data, convert it to avro generic record to save as parquet file.
TIP :
protobuf parsers can be developed in different ways based on your requirements.
one of the generic way is like below example.
public SortedMap<String, Object> convertProtoMessageToMap(GeneratedMessage src) {
final SortedMap<String, Object> finalMap = new TreeMap<String, Object>();
final Map<FieldDescriptor, Object> fields = src.getAllFields();
for (final Map.Entry<FieldDescriptor, Object> fieldPair : fields.entrySet()) {
final FieldDescriptor desc = fieldPair.getKey();
if (desc.isRepeated()) {
final List<?> fieldList = (List<?>) fieldPair.getValue();
if (fieldList.size() != 0) {
final List<String> arrayListOfElements = new ArrayList<String>();
for (final Object o : fieldList) {
arrayListOfElements.add(o.toString());
}
finalMap.put(desc.getName(), arrayListOfElements);
}
} else {
finalMap.put(desc.getName(), fieldPair.getValue().toString());
}
}
return finalMap;
}
I want to read schema of a keyspace in cassandra.
I know that, in Cassandra-cli we can execute following command to get Schema
show schema keyspace1;
But i want to read schema from remote machine using java.
How i can solve this? Plzzz help me....
This one i solved by using thrift client
KsDef keyspaceDefinition = _client.describe_keyspace(_keyspace);
List<CfDef> columnDefinition = keyspaceDefinition.getCf_defs();
Here key space definition contains whole schema details, so from that KsDef we can read whatever we want. In my case i want to read metadata so i am reading column metadata from the above column definitions as shown below.
for(int i=0;i<columnDefinition.size();i++){
List<ColumnDef> columnMetadata = columnDefinition.get(i).getColumn_metadata();
for(int j=0;j<columnMetadata.size();j++){
columnfamilyNames.add(columnDefinition.get(i).getName());
columnNames.add(new String((columnMetadata.get(j).getName())));
validationClasses.add(columnMetadata.get(j).getValidation_class());
//ar.add(coldef.get(i).getName()+"\t"+bb_to_str(colmeta.get(j).getName())+"\t"+colmeta.get(j).getValidationClass());
}
}
here columnfamilyNames, columnNames and validationClasses are arraylists.