Pub/Sub to BigQuery (Batch) using Dataflow (Python)

Pub/Sub to BigQuery (Batch) using Dataflow (Python) - python-3.x

I created a streaming Dataflow pipeline in Python and just want to clarify if my below code is doing what I expected. This is what I intend to do :
Consume from Pub/Sub continuously
Batch load into BigQuery every 1 minute instead of streaming to bring down the cost
This is the code snippet in Python
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
May I know if the above code is doing what I intend it to do? Stream from Pub/Sub and at every 60 seconds, it will batch insert into BigQuery. I purposely set the max_files_per_bundle to 1 to prevent more than 1 shard being created so that there is only 1 file being loaded every minute but not sure if I am doing it right. There is withNumFileShards option for Java version but I could not find the equivalent in Python. I refer to the documentation below:
https://beam.apache.org/releases/pydoc/2.31.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery
https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
Just curious if I should use windowing to achieve what I intend to do?
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| 'Window' >> beam.WindowInto(window.FixedWindows(60), trigger=AfterProcessingTime(60),
accumulation_mode=AccumulationMode.DISCARDING)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
The first method is good enough without the windowing in second method? I am using the first method now but I am not sure if every minute, it's doing multiple load from multiple files or it actually merge all the pub/sub message into 1 and do a single bulk load?
Thank you!

Not the python solution, but I resort to Java version in the end
public static PTransform<PCollection<String>, PCollection<TableRow>> jsonToTableRow() {
return new JsonToTableRow();
}
private static class JsonToTableRow
extends PTransform<PCollection<String>, PCollection<TableRow>> {
#Override
public PCollection<TableRow> expand(PCollection<String> stringPCollection) {
return stringPCollection.apply("JsonToTableRow", MapElements.via(
new SimpleFunction<String, TableRow>() {
#Override
public TableRow apply(String json) {
try {
InputStream inputStream =
new ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8));
return TableRowJsonCoder.of().decode(inputStream, Context.OUTER);
} catch (IOException e) {
throw new RuntimeException("Unable to parse input", e);
}
}
}));
}
}
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
options.setDiskSizeGb(10);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read from PubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(jsonToTableRow())
.apply("WriteToBigQuery", BigQueryIO.writeTableRows().to(options.getOutputTable())
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(1))
.withNumFileShards(1)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
pipeline.run();

Related

Spark accumulator, I get always 0 value

I'm using a LongAccumulator to count the number of record which I save in Cassandra.
object Main extends App {
val conf = args(0)
val ssc = StreamingContext.getStreamingContext(conf)
Runner.apply(conf).startJob(ssc)
StreamingContext.startStreamingContext(ssc)
StreamingContext.stopStreamingContext(ssc)
}
class Runner (conf: Conf) {
override def startJob(ssc: StreamingContext): Unit = {
accTotal = ssc.sparkContext.longAccumulator("total")
val inputKafka = createDirectStream(ssc, kafkaParams, topicsSet)
val rddAvro = inputKafka.map{x => x.value()}
saveToCassandra(rddAvro)
println("XXX:" + accTotal.value) //-->0
}
def saveToCassandra(upserts: DStream[Data]) = {
val rddCassandraUpsert = upserts.map {
record =>
accTotal.add(1)
println("ACC: " + accTotal.value) --> 1,2,3,4.. OK. Spark Web UI, ok too.
DataExt(record.data,
record.data1)}
rddCassandraUpsert.saveToCassandra(keyspace, table)
}
}
I see that the code is executed right and I save data in Cassandra, when I finally print the accumulator the value is 0, but if I print it in the map fuction I can see the right values. Why?
I'm using Spark 2.0.2 and executing from Intellj in local mode. I have checked the spark web UI and I can see the accumulador updated.

The problem is probably here:
object Main extends App {
...
Spark doesn't support applications extending App, doing so, can result in non-deterministic behaviors:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
You should always use standard applications with main:
object Main {
def main(args: Array[String]) {
...

EMR with multiple encryption key providers

I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas:
CSE custom key provider
CSE-KMS
Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type?
Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS but then enable CSE for each stream separately?

The idea is to support any file systems scheme and configure it individually. For example:
# custom encryption key provider
fs.s3x.cse.enabled = true
fs.s3x.cse.materialsDescription.enabled = true
fs.s3x.cse.encryptionMaterialsProvider = my.company.fs.encryption.CustomKeyProvider
#no encryption
fs.s3u.cse.enabled = false
#AWS KMS
fs.s3k.cse.enabled = true
fs.s3k.cse.encryptionMaterialsProvider = com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider
fs.s3k.cse.kms.keyId = some-kms-id
And then to use it in spark like this:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(“s3x://aws-s3-bucket/input”)
.as(Encoders.bean(TestRecord.class))
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("path", “s3k://aws-s3-bucket/output”)
.option("checkpointLocation", “s3u://aws-s3-bucket/checkpointing”)
.start();
Ta handle this I’ve implemented a custom Hadoop file system (extends org.apache.hadoop.fs.FileSystem) that delegates calls to real file system but with modified configurations.
// Create delegate FS
this.config.set("fs.s3n.impl", “com.amazon.ws.emr.hadoop.fs.EmrFileSystem”);
this.config.set("fs.s3n.impl.disable.cache", Boolean.toString(true));
this.delegatingFs = FileSystem.get(s3nURI(originalUri, SCHEME_S3N), substituteS3Config(conf));
Configuration that passes to delegating file system should take all original settings and replace any occurrences of fs.s3*. with fs.s3n..
private Configuration substituteS3Config(final Configuration conf) {
if (conf == null) return null;
final String fsSchemaPrefix = "fs." + getScheme() + ".";
final String fsS3SchemaPrefix = "fs.s3.";
final String fsSchemaImpl = "fs." + getScheme() + ".impl";
Configuration substitutedConfig = new Configuration(conf);
for (Map.Entry<String, String> configEntry : conf) {
String propName = configEntry.getKey();
if (!fsSchemaImpl.equals(propName)
&& propName.startsWith(fsSchemaPrefix)) {
final String newPropName = propName.replace(fsSchemaPrefix, fsS3SchemaPrefix);
LOG.info("Substituting property '{}' with '{}'", propName, newPropName);
substitutedConfig.set(newPropName, configEntry.getValue());
}
}
return substitutedConfig;
}
Besides that make sure that delegating fs receives uris and paths with supporting scheme and returns paths with custom scheme
#Override
public FileStatus getFileStatus(final Path f) throws IOException {
FileStatus status = this.delegatingFs.getFileStatus(s3Path(f));
if (status != null) {
status.setPath(customS3Path(status.getPath()));
}
return status;
}
private Path s3Path(final Path p) {
if (p.toUri() != null && getScheme().equals(p.toUri().getScheme())) {
return new Path(s3nURI(p.toUri(), SCHEME_S3N));
}
return p;
}
private Path customS3Path(final Path p) {
if (p.toUri() != null && !getScheme().equals(p.toUri().getScheme())) {
return new Path(s3nURI(p.toUri(), getScheme()));
}
return p;
}
private URI s3nURI(final URI originalUri, final String newScheme) {
try {
return new URI(
newScheme,
originalUri.getUserInfo(),
originalUri.getHost(),
originalUri.getPort(),
originalUri.getPath(),
originalUri.getQuery(),
originalUri.getFragment());
} catch (URISyntaxException e) {
LOG.warn("Unable to convert URI {} to {} scheme", originalUri, newScheme);
}
return originalUri;
}
The final step is to register custom file system with Hadoop (spark-defaults classification)
spark.hadoop.fs.s3x.impl = my.company.fs.DynamicS3FileSystem
spark.hadoop.fs.s3u.impl = my.company.fs.DynamicS3FileSystem
spark.hadoop.fs.s3k.impl = my.company.fs.DynamicS3FileSystem

When you use EMRFS, you can specify per-bucket configs in the format:
fs.s3.bucket.<bucket name>.<some.configuration>
So, for example, to turn off CSE except for a bucket s3://foobar, you can set:
"Classification": "emrfs-site",
"Properties": {
"fs.s3.cse.enabled": "false",
"fs.s3.bucket.foobar.cse.enabled": "true",
[your other configs as usual]
}
Please note that it must be fs.s3 and not fs.{arbitrary-scheme} like fs.s3n.

I can't speak for Amazon EMR, but on hadoop's s3a connector, you can set the encryption policy on a bucket-by-bucket basis. However, S3A doesn't support client side encryption on account of it breaking fundamental assumptions about file lengths (the amount of data you can read MUST == the length in a directory listing/getFileStatus call).
I expect amazon to do something similar. You may be able to create a custom Hadoop Configuration object with the different settings & use that to retrieve the filesystem instance used to save things. Tricky in Spark though.

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure:
events_v3
-- event_date=2015-01-01
-- event_hour=2015-01-1
-- part10000.parquet.gz
-- event_date=2015-01-02
-- event_hour=5
-- part10000.parquet.gz
I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data.
query:
select * from raw_events where event_date='2016-01-01'
similar problem : http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs-Njw2YVd36bRfCG3DKZrH0tw#mail.gmail.com%3E ( but its old)
Log:
App > 16/09/15 03:14:03 main INFO HadoopFsRelation: Listing leaf files and directories in parallel under: s3a://bucket/events_v3/
and then it spawns 350 tasks since there are 350 days worth of data.
I have disabled schemaMerge, and have also specified the schema to read as, so it can just go to the partition that I am looking at, why should it print all the leaf files ?
Listing leaf files with 2 executors take 10 minutes, and the query actual execution takes on 20 seconds
code sample:
val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
df.createOrReplaceTempView("temp_events")
sparkSession.sql(
"""
|select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
""".stripMargin).show()

As soon as spark is given a directory to read from it issues call to listLeafFiles (org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala). This in turn calls fs.listStatus which makes an api call to get list of files and directories. Now for each directory this method is called again. This hapens recursively until no directories are left. This by design works good in a HDFS system. But works bad in s3 since list file is an RPC call. S3 on other had supports get all files by prefix, which is exactly what we need.
So for example if we had above directory structure with 1 year worth of data with each directory for hour and 10 sub directory we would have , 365 * 24 * 10 = 87k api calls, this can be reduced to 138 api calls given that there are only 137000 files. Each s3 api calls return 1000 files.
Code:
org/apache/hadoop/fs/s3a/S3AFileSystem.java
public FileStatus[] listStatusRecursively(Path f) throws FileNotFoundException,
IOException {
String key = pathToKey(f);
if (LOG.isDebugEnabled()) {
LOG.debug("List status for path: " + f);
}
final List<FileStatus> result = new ArrayList<FileStatus>();
final FileStatus fileStatus = getFileStatus(f);
if (fileStatus.isDirectory()) {
if (!key.isEmpty()) {
key = key + "/";
}
ListObjectsRequest request = new ListObjectsRequest();
request.setBucketName(bucket);
request.setPrefix(key);
request.setMaxKeys(maxKeys);
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: doing listObjects for directory " + key);
}
ObjectListing objects = s3.listObjects(request);
statistics.incrementReadOps(1);
while (true) {
for (S3ObjectSummary summary : objects.getObjectSummaries()) {
Path keyPath = keyToPath(summary.getKey()).makeQualified(uri, workingDir);
// Skip over keys that are ourselves and old S3N _$folder$ files
if (keyPath.equals(f) || summary.getKey().endsWith(S3N_FOLDER_SUFFIX)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Ignoring: " + keyPath);
}
continue;
}
if (objectRepresentsDirectory(summary.getKey(), summary.getSize())) {
result.add(new S3AFileStatus(true, true, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fd: " + keyPath);
}
} else {
result.add(new S3AFileStatus(summary.getSize(),
dateToLong(summary.getLastModified()), keyPath,
getDefaultBlockSize(f.makeQualified(uri, workingDir))));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fi: " + keyPath);
}
}
}
for (String prefix : objects.getCommonPrefixes()) {
Path keyPath = keyToPath(prefix).makeQualified(uri, workingDir);
if (keyPath.equals(f)) {
continue;
}
result.add(new S3AFileStatus(true, false, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd: " + keyPath);
}
}
if (objects.isTruncated()) {
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: list truncated - getting next batch");
}
objects = s3.listNextBatchOfObjects(objects);
statistics.incrementReadOps(1);
} else {
break;
}
}
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd (not a dir): " + f);
}
result.add(fileStatus);
}
return result.toArray(new FileStatus[result.size()]);
}
/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
def listLeafFiles(fs: FileSystem, status: FileStatus, filter: PathFilter): Array[FileStatus] = {
logTrace(s"Listing ${status.getPath}")
val name = status.getPath.getName.toLowerCase
if (shouldFilterOut(name)) {
Array.empty[FileStatus]
}
else {
val statuses = {
val stats = if(fs.isInstanceOf[S3AFileSystem]){
logWarning("Using Monkey patched version of list status")
println("Using Monkey patched version of list status")
val a = fs.asInstanceOf[S3AFileSystem].listStatusRecursively(status.getPath)
a
// Array.empty[FileStatus]
}
else{
val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)
files ++ dirs.flatMap(dir => listLeafFiles(fs, dir, filter))
}
if (filter != null) stats.filter(f => filter.accept(f.getPath)) else stats
}
// statuses do not have any dirs.
statuses.filterNot(status => shouldFilterOut(status.getPath.getName)).map {
case f: LocatedFileStatus => f
// NOTE:
//
// - Although S3/S3A/S3N file system can be quite slow for remote file metadata
// operations, calling `getFileBlockLocations` does no harm here since these file system
// implementations don't actually issue RPC for this method.
//
// - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not
// be a big deal since we always use to `listLeafFilesInParallel` when the number of
// paths exceeds threshold.
case f => createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
}
}
}

To clarify Gaurav's answer, that code snipped is from Hadoop branch-2, Probably not going to surface until Hadoop 2.9 (see HADOOP-13208); and someone needs to update Spark to use that feature (which won't harm code using HDFS, just won't show any speedup there).
One thing to consider is: what makes a good file layout for Object Stores.
Don't have deep directory trees with only a few files per directory
Do have shallow trees with many files
Consider using the first few characters of a file for the most changing value (such as day/hour), rather than the last. Why? Some object stores appear to use the leading characters for their hashing, not the trailing ones ... if you give your names more uniqueness then they get spread out over more servers, with better bandwidth/less risk of throttling.
If you are using the Hadoop 2.7 libraries, switch to s3a:// over s3n://. It's already faster, and getting better every week, at least in the ASF source tree.
Finally, Apache Hadoop, Apache Spark and related projects are all open source. Contributions are welcome. That's not just the code, it's documentation, testing, and, for this performance stuff, testing against your actual datasets. Even giving us details about what causes problems (and your dataset layouts) is interesting.

How can I retrieve the build parameters from a queued job?

I would like to write a system groovy script which inspects the queued jobs in Jenkins, and extracts the build parameters (and build cause as a bonus) supplied as the job was scheduled. Ideas?
Specifically:
def q = Jenkins.instance.queue
q.items.each { println it.task.name }
retrieves the queued items. I can't for the life of me figure out where the build parameters live.
The closest I am getting is this:
def q = Jenkins.instance.queue
q.items.each {
println("${it.task.name}:")
it.task.properties.each { key, val ->
println(" ${key}=${val}")
}
}
This gets me this:
4.1.next-build-launcher:
com.sonyericsson.jenkins.plugins.bfa.model.ScannerJobProperty$ScannerJobPropertyDescriptor#b299407=com.sonyericsson.jenkins.plugins.bfa.model.ScannerJobProperty#5e04bfd7
com.chikli.hudson.plugin.naginator.NaginatorOptOutProperty$DescriptorImpl#40d04eaa=com.chikli.hudson.plugin.naginator.NaginatorOptOutProperty#16b308db
hudson.model.ParametersDefinitionProperty$DescriptorImpl#b744c43=hudson.mod el.ParametersDefinitionProperty#440a6d81
...

The params property of the queue element itself contains a string with the parameters in a property file format -- key=value with multiple parameters separated by newlines.
def q = Jenkins.instance.queue
q.items.each {
println("${it.task.name}:")
println("Parameters: ${it.params}")
}
yields:
dbacher params:
Parameters:
MyParameter=Hello world
BoolParameter=true
I'm no Groovy expert, but when exploring the Jenkins scripting interface, I've found the following functions to be very helpful:
def showProps(inst, prefix="Properties:") {
println prefix
for (prop in inst.properties) {
def pc = ""
if (prop.value != null) {
pc = prop.value.class
}
println(" $prop.key : $prop.value ($pc)")
}
}
def showMethods(inst, prefix="Methods:") {
println prefix
inst.metaClass.methods.name.unique().each {
println " $it"
}
}
The showProps function reveals that the queue element has another property named causes that you'll need to do some more decoding on:
causes : [hudson.model.Cause$UserIdCause#56af8f1c] (class java.util.Collections$UnmodifiableRandomAccessList)

geolocalisation is very slow

I've a application for the geolocalisation and I retrieve the current geoposition but the display on the application is VERY slow...
The constructor :
public TaskGeo()
{
InitializeComponent();
_geolocator = new Geolocator();
_geolocator.DesiredAccuracy = PositionAccuracy.High;
_geolocator.MovementThreshold = 100;
_geolocator.PositionChanged += _geolocator_PositionChanged;
_geolocator.StatusChanged += _geolocator_StatusChanged;
if (_geolocator.LocationStatus == PositionStatus.Disabled)
this.DisplayNeedGPS();
}
the code for the display on the app :
void _geolocator_PositionChanged(Geolocator sender, PositionChangedEventArgs args)
{
// saving and display of the position
App.RootFrame.Dispatcher.BeginInvoke(() =>
{
this._CurrentPosition = args.Position;
this.lblLon.Text = "Lon: " + this._CurrentPosition.Coordinate.Longitude;
this.lblLat.Text = "Lat: " + this._CurrentPosition.Coordinate.Latitude;
this.LocationChanged(this._CurrentPosition.Coordinate.Longitude, this._CurrentPosition.Coordinate.Latitude);
});
}
And the code for the query :
private void LocationChanged(double lat, double lon)
{
ReverseGeocodeQuery rgq = new ReverseGeocodeQuery();
rgq.GeoCoordinate = new GeoCoordinate(lat, lon);
rgq.QueryCompleted += rgq_QueryCompleted;
rgq.QueryAsync();
}
How can I improve the code to display faster the position ? Thanks in advance !

Getting this sort of information is basically pretty slow. To quote the great Louis C. K. "It is going to space, give it a second". Because you've specified PositionAccuracy.High this means that the location must be found using GPS, which is comparatively slow, and not any of the faster fallback methods such as using local wi-fi or cell phone towers.
You could reduce your demands for accuracy overall or initially request a lower accuracy and then refine it once the information from the GPS is available. The second option is better. If you look at a map application they typically do this by showing you about where you are and then improving it after the GPS lock is acquired.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pub/Sub to BigQuery (Batch) using Dataflow (Python) - python-3.x

Related

Spark accumulator, I get always 0 value

EMR with multiple encryption key providers

Spark lists all leaf node even in partitioned data

How can I retrieve the build parameters from a queued job?

geolocalisation is very slow

Categories

Resources