EMR with multiple encryption key providers - apache-spark

I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas:
CSE custom key provider
CSE-KMS
Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type?
Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS but then enable CSE for each stream separately?

The idea is to support any file systems scheme and configure it individually. For example:
# custom encryption key provider
fs.s3x.cse.enabled = true
fs.s3x.cse.materialsDescription.enabled = true
fs.s3x.cse.encryptionMaterialsProvider = my.company.fs.encryption.CustomKeyProvider
#no encryption
fs.s3u.cse.enabled = false
#AWS KMS
fs.s3k.cse.enabled = true
fs.s3k.cse.encryptionMaterialsProvider = com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider
fs.s3k.cse.kms.keyId = some-kms-id
And then to use it in spark like this:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(“s3x://aws-s3-bucket/input”)
.as(Encoders.bean(TestRecord.class))
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("path", “s3k://aws-s3-bucket/output”)
.option("checkpointLocation", “s3u://aws-s3-bucket/checkpointing”)
.start();
Ta handle this I’ve implemented a custom Hadoop file system (extends org.apache.hadoop.fs.FileSystem) that delegates calls to real file system but with modified configurations.
// Create delegate FS
this.config.set("fs.s3n.impl", “com.amazon.ws.emr.hadoop.fs.EmrFileSystem”);
this.config.set("fs.s3n.impl.disable.cache", Boolean.toString(true));
this.delegatingFs = FileSystem.get(s3nURI(originalUri, SCHEME_S3N), substituteS3Config(conf));
Configuration that passes to delegating file system should take all original settings and replace any occurrences of fs.s3*. with fs.s3n..
private Configuration substituteS3Config(final Configuration conf) {
if (conf == null) return null;
final String fsSchemaPrefix = "fs." + getScheme() + ".";
final String fsS3SchemaPrefix = "fs.s3.";
final String fsSchemaImpl = "fs." + getScheme() + ".impl";
Configuration substitutedConfig = new Configuration(conf);
for (Map.Entry<String, String> configEntry : conf) {
String propName = configEntry.getKey();
if (!fsSchemaImpl.equals(propName)
&& propName.startsWith(fsSchemaPrefix)) {
final String newPropName = propName.replace(fsSchemaPrefix, fsS3SchemaPrefix);
LOG.info("Substituting property '{}' with '{}'", propName, newPropName);
substitutedConfig.set(newPropName, configEntry.getValue());
}
}
return substitutedConfig;
}
Besides that make sure that delegating fs receives uris and paths with supporting scheme and returns paths with custom scheme
#Override
public FileStatus getFileStatus(final Path f) throws IOException {
FileStatus status = this.delegatingFs.getFileStatus(s3Path(f));
if (status != null) {
status.setPath(customS3Path(status.getPath()));
}
return status;
}
private Path s3Path(final Path p) {
if (p.toUri() != null && getScheme().equals(p.toUri().getScheme())) {
return new Path(s3nURI(p.toUri(), SCHEME_S3N));
}
return p;
}
private Path customS3Path(final Path p) {
if (p.toUri() != null && !getScheme().equals(p.toUri().getScheme())) {
return new Path(s3nURI(p.toUri(), getScheme()));
}
return p;
}
private URI s3nURI(final URI originalUri, final String newScheme) {
try {
return new URI(
newScheme,
originalUri.getUserInfo(),
originalUri.getHost(),
originalUri.getPort(),
originalUri.getPath(),
originalUri.getQuery(),
originalUri.getFragment());
} catch (URISyntaxException e) {
LOG.warn("Unable to convert URI {} to {} scheme", originalUri, newScheme);
}
return originalUri;
}
The final step is to register custom file system with Hadoop (spark-defaults classification)
spark.hadoop.fs.s3x.impl = my.company.fs.DynamicS3FileSystem
spark.hadoop.fs.s3u.impl = my.company.fs.DynamicS3FileSystem
spark.hadoop.fs.s3k.impl = my.company.fs.DynamicS3FileSystem

When you use EMRFS, you can specify per-bucket configs in the format:
fs.s3.bucket.<bucket name>.<some.configuration>
So, for example, to turn off CSE except for a bucket s3://foobar, you can set:
"Classification": "emrfs-site",
"Properties": {
"fs.s3.cse.enabled": "false",
"fs.s3.bucket.foobar.cse.enabled": "true",
[your other configs as usual]
}
Please note that it must be fs.s3 and not fs.{arbitrary-scheme} like fs.s3n.

I can't speak for Amazon EMR, but on hadoop's s3a connector, you can set the encryption policy on a bucket-by-bucket basis. However, S3A doesn't support client side encryption on account of it breaking fundamental assumptions about file lengths (the amount of data you can read MUST == the length in a directory listing/getFileStatus call).
I expect amazon to do something similar. You may be able to create a custom Hadoop Configuration object with the different settings & use that to retrieve the filesystem instance used to save things. Tricky in Spark though.

Related

How to Upload files with Spring GraphQlClient?

I'm looking at the docs, but it does not mention file Upload.
The GraphQl Upload service I need to consume is this one.
Would it be possible with GraphQlClient?
I believe it is possible indeed.
You should build your GraphQlClient or reuse a previous one. Then, you should write the the document (or refer to an existing one using documentName) specifying the variables required. Finally, you assign that variable to a value, which can be any Object (in your case, a file; in my example, an Integer number 1), using the variable method of the RequestSpec.
graphQlClient
.mutate()
.header("Authorization","Basic XXXXX")
.build()
.document("""
query artistaPorId($unId:ID){
artistaPorId(id:$unId){
apellido
}
}
""")
.variable("unId",1)
.retrieve("artistaPorId")
.toEntity(Artista.class)
.subscribe( // handle onNext, onError, etc
);
My schema.graphqls:
type Query {
artistaPorId(id:ID): Artista
}
type Artista {
id: ID
apellido: String
estilo: String
}
I have figured it out with the help of this answner.
This uses Spring WebClient. I just need to figure it out how to extract data from the response (it's a graphql response).
var variables = new LinkedHashMap<String, Object>();
variables.put("cartao", cartao);
variables.put("mime", MediaType.APPLICATION_PDF_VALUE);
variables.put("file", null);
var query = StreamUtils.copyToString(new ClassPathResource("/graphql-documents/UploadPrescriptionMutation.graphql").getInputStream(),
Charset.defaultCharset());
var params = new LinkedHashMap<String, Object>();
params.put("operationName", "upload");
params.put("query", query);
params.put("variables", variables);
MultipartBodyBuilder builder = new MultipartBodyBuilder();
builder.part("operations", objectMapper.writeValueAsString(params));
builder.part("map", "{\"uploaded_file\": [\"variables.file\"]}");
builder.part("uploaded_file", new FileSystemResource(file));
return webClient.post()
.headers(h -> h.setBearerAuth(token))
.contentType(MediaType.MULTIPART_FORM_DATA)
.body(BodyInserters.fromMultipartData(builder.build()))
.retrieve()
.bodyToMono(Prescription_AddPrescriptionDto.class)
.block();
GraphQL file:
mutation upload($file: Prescription_Upload!, $cartao: String!, $mime: String!) {
Prescription_uploadPrescription(
file: $file uploadInfo: {
cardNumber: $cartao
source: UNKNOWN
mime: $mime })
{
id
uploadedAt
deletedAt
expiresAt
source
file
name
cardNumber
identity
status
storage
originalName
extension
}
}

Pub/Sub to BigQuery (Batch) using Dataflow (Python)

I created a streaming Dataflow pipeline in Python and just want to clarify if my below code is doing what I expected. This is what I intend to do :
Consume from Pub/Sub continuously
Batch load into BigQuery every 1 minute instead of streaming to bring down the cost
This is the code snippet in Python
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
May I know if the above code is doing what I intend it to do? Stream from Pub/Sub and at every 60 seconds, it will batch insert into BigQuery. I purposely set the max_files_per_bundle to 1 to prevent more than 1 shard being created so that there is only 1 file being loaded every minute but not sure if I am doing it right. There is withNumFileShards option for Java version but I could not find the equivalent in Python. I refer to the documentation below:
https://beam.apache.org/releases/pydoc/2.31.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery
https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
Just curious if I should use windowing to achieve what I intend to do?
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| 'Window' >> beam.WindowInto(window.FixedWindows(60), trigger=AfterProcessingTime(60),
accumulation_mode=AccumulationMode.DISCARDING)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
The first method is good enough without the windowing in second method? I am using the first method now but I am not sure if every minute, it's doing multiple load from multiple files or it actually merge all the pub/sub message into 1 and do a single bulk load?
Thank you!
Not the python solution, but I resort to Java version in the end
public static PTransform<PCollection<String>, PCollection<TableRow>> jsonToTableRow() {
return new JsonToTableRow();
}
private static class JsonToTableRow
extends PTransform<PCollection<String>, PCollection<TableRow>> {
#Override
public PCollection<TableRow> expand(PCollection<String> stringPCollection) {
return stringPCollection.apply("JsonToTableRow", MapElements.via(
new SimpleFunction<String, TableRow>() {
#Override
public TableRow apply(String json) {
try {
InputStream inputStream =
new ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8));
return TableRowJsonCoder.of().decode(inputStream, Context.OUTER);
} catch (IOException e) {
throw new RuntimeException("Unable to parse input", e);
}
}
}));
}
}
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
options.setDiskSizeGb(10);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read from PubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(jsonToTableRow())
.apply("WriteToBigQuery", BigQueryIO.writeTableRows().to(options.getOutputTable())
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(1))
.withNumFileShards(1)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
pipeline.run();

Deserialize json one record at a time

I am working with large json files and memory is a concern. I would like to read one object into memory at a time from file. Is this possible?
In ServiceStack.Text docs it says there is an API using reader/stream
But I can't see how to get that working. The files are too large to deserialize in one go. Is it possible to handle this scenario with SS?
Thanks
No you'll want to use a streaming JSON parser like System.Text.Json Utf8JsonReader, this is the example on System.Text.Json introductory page:
byte[] data = Encoding.UTF8.GetBytes(json);
Utf8JsonReader reader = new Utf8JsonReader(data, isFinalBlock: true, state: default);
while (reader.Read())
{
Console.Write(reader.TokenType);
switch (reader.TokenType)
{
case JsonTokenType.PropertyName:
case JsonTokenType.String:
{
string text = reader.GetString();
Console.Write(" ");
Console.Write(text);
break;
}
case JsonTokenType.Number:
{
int value = reader.GetInt32();
Console.Write(" ");
Console.Write(value);
break;
}
// Other token types elided for brevity
}
Console.WriteLine();
}

JOOQ keep setting schema to default PUBLIC

My gradle.build (using nu.studer.jooq plugin)
jooq {
MyProject(sourceSets.main) {
generator {
database {
name = 'org.jooq.meta.extensions.ddl.DDLDatabase'
properties {
property {
key = 'scripts'
value = 'src/main/resources/database.sql'
}
}
inputSchema = ''
outputSchema = 'something'
// schemata {
// schema {
// inputSchema = "" // I've tried this too
// outputSchema = 'something'
// }
// }
forcedTypes {
forcedType {
name = 'varchar'
expression = '.*'
types = 'JSONB?'
}
forcedType {
name = 'varchar'
expression = '.*'
types = 'INET'
}
}
}
generate {
relations = true
springAnnotations = true
deprecated = false
fluentSetters = true
// ...
}
target {
packageName = 'com.springforum'
}
}
}
}
In the build process, it can generate the schema just fine, but it keep using the PUBLIC schema for the output even though I've set outputSchema (I've tried using empty string and non-empty string)
Update: The problem only happen if inputSchema is empty, I tried with another sql script with schema and it works as intended
This is a known issue that originates from the fact that behind the scenes DDLDatabase uses an H2 in-memory database to emulate running your SQL script, and then reverse engineers that H2 database. By default, in H2 (and a few other databases), everything goes in the PUBLIC schema. The issue is here: #7650
jOOQ 3.11 workaround
Currently (as of jOOQ 3.11), I suggest you either specify the schema in your DDL script explicitly, or use inputSchema = "PUBLIC", knowing the above.
jOOQ 3.12 solution
In jOOQ 3.12, this was fixed through #7759. It will be possible to specify the behaviour of unqualified schema objects:
<!-- The default schema for unqualified objects:
- public: all unqualified objects are located in the PUBLIC (upper case) schema
- none: all unqualified objects are located in the default schema (default)
This configuration can be overridden with the schema mapping feature -->
<property>
<key>unqualifiedSchema</key>
<value>none</value>
</property>

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure:
events_v3
-- event_date=2015-01-01
-- event_hour=2015-01-1
-- part10000.parquet.gz
-- event_date=2015-01-02
-- event_hour=5
-- part10000.parquet.gz
I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data.
query:
select * from raw_events where event_date='2016-01-01'
similar problem : http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs-Njw2YVd36bRfCG3DKZrH0tw#mail.gmail.com%3E ( but its old)
Log:
App > 16/09/15 03:14:03 main INFO HadoopFsRelation: Listing leaf files and directories in parallel under: s3a://bucket/events_v3/
and then it spawns 350 tasks since there are 350 days worth of data.
I have disabled schemaMerge, and have also specified the schema to read as, so it can just go to the partition that I am looking at, why should it print all the leaf files ?
Listing leaf files with 2 executors take 10 minutes, and the query actual execution takes on 20 seconds
code sample:
val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
df.createOrReplaceTempView("temp_events")
sparkSession.sql(
"""
|select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
""".stripMargin).show()
As soon as spark is given a directory to read from it issues call to listLeafFiles (org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala). This in turn calls fs.listStatus which makes an api call to get list of files and directories. Now for each directory this method is called again. This hapens recursively until no directories are left. This by design works good in a HDFS system. But works bad in s3 since list file is an RPC call. S3 on other had supports get all files by prefix, which is exactly what we need.
So for example if we had above directory structure with 1 year worth of data with each directory for hour and 10 sub directory we would have , 365 * 24 * 10 = 87k api calls, this can be reduced to 138 api calls given that there are only 137000 files. Each s3 api calls return 1000 files.
Code:
org/apache/hadoop/fs/s3a/S3AFileSystem.java
public FileStatus[] listStatusRecursively(Path f) throws FileNotFoundException,
IOException {
String key = pathToKey(f);
if (LOG.isDebugEnabled()) {
LOG.debug("List status for path: " + f);
}
final List<FileStatus> result = new ArrayList<FileStatus>();
final FileStatus fileStatus = getFileStatus(f);
if (fileStatus.isDirectory()) {
if (!key.isEmpty()) {
key = key + "/";
}
ListObjectsRequest request = new ListObjectsRequest();
request.setBucketName(bucket);
request.setPrefix(key);
request.setMaxKeys(maxKeys);
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: doing listObjects for directory " + key);
}
ObjectListing objects = s3.listObjects(request);
statistics.incrementReadOps(1);
while (true) {
for (S3ObjectSummary summary : objects.getObjectSummaries()) {
Path keyPath = keyToPath(summary.getKey()).makeQualified(uri, workingDir);
// Skip over keys that are ourselves and old S3N _$folder$ files
if (keyPath.equals(f) || summary.getKey().endsWith(S3N_FOLDER_SUFFIX)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Ignoring: " + keyPath);
}
continue;
}
if (objectRepresentsDirectory(summary.getKey(), summary.getSize())) {
result.add(new S3AFileStatus(true, true, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fd: " + keyPath);
}
} else {
result.add(new S3AFileStatus(summary.getSize(),
dateToLong(summary.getLastModified()), keyPath,
getDefaultBlockSize(f.makeQualified(uri, workingDir))));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fi: " + keyPath);
}
}
}
for (String prefix : objects.getCommonPrefixes()) {
Path keyPath = keyToPath(prefix).makeQualified(uri, workingDir);
if (keyPath.equals(f)) {
continue;
}
result.add(new S3AFileStatus(true, false, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd: " + keyPath);
}
}
if (objects.isTruncated()) {
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: list truncated - getting next batch");
}
objects = s3.listNextBatchOfObjects(objects);
statistics.incrementReadOps(1);
} else {
break;
}
}
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd (not a dir): " + f);
}
result.add(fileStatus);
}
return result.toArray(new FileStatus[result.size()]);
}
/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
def listLeafFiles(fs: FileSystem, status: FileStatus, filter: PathFilter): Array[FileStatus] = {
logTrace(s"Listing ${status.getPath}")
val name = status.getPath.getName.toLowerCase
if (shouldFilterOut(name)) {
Array.empty[FileStatus]
}
else {
val statuses = {
val stats = if(fs.isInstanceOf[S3AFileSystem]){
logWarning("Using Monkey patched version of list status")
println("Using Monkey patched version of list status")
val a = fs.asInstanceOf[S3AFileSystem].listStatusRecursively(status.getPath)
a
// Array.empty[FileStatus]
}
else{
val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)
files ++ dirs.flatMap(dir => listLeafFiles(fs, dir, filter))
}
if (filter != null) stats.filter(f => filter.accept(f.getPath)) else stats
}
// statuses do not have any dirs.
statuses.filterNot(status => shouldFilterOut(status.getPath.getName)).map {
case f: LocatedFileStatus => f
// NOTE:
//
// - Although S3/S3A/S3N file system can be quite slow for remote file metadata
// operations, calling `getFileBlockLocations` does no harm here since these file system
// implementations don't actually issue RPC for this method.
//
// - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not
// be a big deal since we always use to `listLeafFilesInParallel` when the number of
// paths exceeds threshold.
case f => createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
}
}
}
To clarify Gaurav's answer, that code snipped is from Hadoop branch-2, Probably not going to surface until Hadoop 2.9 (see HADOOP-13208); and someone needs to update Spark to use that feature (which won't harm code using HDFS, just won't show any speedup there).
One thing to consider is: what makes a good file layout for Object Stores.
Don't have deep directory trees with only a few files per directory
Do have shallow trees with many files
Consider using the first few characters of a file for the most changing value (such as day/hour), rather than the last. Why? Some object stores appear to use the leading characters for their hashing, not the trailing ones ... if you give your names more uniqueness then they get spread out over more servers, with better bandwidth/less risk of throttling.
If you are using the Hadoop 2.7 libraries, switch to s3a:// over s3n://. It's already faster, and getting better every week, at least in the ASF source tree.
Finally, Apache Hadoop, Apache Spark and related projects are all open source. Contributions are welcome. That's not just the code, it's documentation, testing, and, for this performance stuff, testing against your actual datasets. Even giving us details about what causes problems (and your dataset layouts) is interesting.

Resources