Operation APPEND failed with HTTP500? - azure

package org.apache.spark.examples.kafkaToflink;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.Properties;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import com.microsoft.azure.datalake.store.ADLException;
import com.microsoft.azure.datalake.store.ADLFileOutputStream;
import com.microsoft.azure.datalake.store.ADLStoreClient;
import com.microsoft.azure.datalake.store.IfExists;
import com.microsoft.azure.datalake.store.oauth2.AccessTokenProvider;
import com.microsoft.azure.datalake.store.oauth2.ClientCredsTokenProvider;
import scala.util.parsing.combinator.testing.Str;
public class App {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.72:9092");
properties.setProperty("group.id", "test");
DataStream<String> stream = env.addSource(
new FlinkKafkaConsumer010<String>("tenant", new SimpleStringSchema(), properties), "Kafka_Source");
stream.addSink(new ADLSink()).name("Custom_Sink").setParallelism(128);
env.execute("App");
}
}
class ADLSink<String> extends RichSinkFunction<String> {
private java.lang.String clientId = "***********";
private java.lang.String authTokenEndpoint = "***************";
private java.lang.String clientKey = "*****************";
private java.lang.String accountFQDN = "****************";
private java.lang.String filename = "/Bitfinex/ETHBTC/ORDERBOOK/ORDERBOOK.json";
#Override
public void invoke(String value) {
AccessTokenProvider provider = new ClientCredsTokenProvider(authTokenEndpoint, clientId, clientKey);
ADLStoreClient client = ADLStoreClient.createClient(accountFQDN, provider);
try {
client.setPermission(filename, "744");
ADLFileOutputStream stream = client.getAppendStream(filename);
System.out.println(value);
stream.write(value.toString().getBytes());
stream.close();
} catch (ADLException e) {
System.out.println(e.requestId);
} catch (Exception e) {
System.out.println(e.getMessage());
System.out.println(e.getCause());
}
}
}
I am continuously trying to append a file which is in Azure data lake Store using while loop .But sometimes it gives this , Operation APPEND failed with HTTP500, error in starting or sometimes after 10 min. I am using java

Anubhav, Azure Data Lake streams are single-writer streams - i.e., you cannot write to the same stream from multiple threads, unless you do some form of synchronization between these threads. This is because each write specifies the offset it is writing to, and with multiple threads, the offsets are not consistent.
You seem to be writing from multiple threads (.setParallelism(128) call in your code)
In your case, you have two choices:
Write to a different file in each thread. I do not know your use-case, but we have found that for a lot of cases that is the natural use of different threads - to write to different files.
If it is important to have all the threads write to the same file, then you will need to refactor the sink a little bit so that all the instances have reference to the same ADLFileOutputStream, and you will need to make sure the calls to write() and close() are synchronized.
Now, there is one more issue here - the error you got should have been an HTPP 4xx error (indicating a lease conflict, since the ADLFileOutputStreams acquire lease), rather than HTTP 500, which says there was a server-side problem. To troubleshoot that, I will need to know your account name and time of access. That info is not safe to share on StackOverflow, so please open a support ticket for that and reference this SO question, so the issue gets eventually routed to me.

Related

Spring reactive file integration

I am trying to use spring-integration-file to poll a directory and create a reactive stream from files placed in this directory. This is working for the most part, but when I place a file but have no subscriber in place I get an exception. To demonstrate the problem I have written a small demo application:
import org.reactivestreams.Publisher;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.http.MediaType;
import org.springframework.integration.dsl.IntegrationFlows;
import org.springframework.integration.dsl.Pollers;
import org.springframework.integration.file.dsl.Files;
import org.springframework.messaging.Message;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
import reactor.core.publisher.Flux;
import java.io.File;
#SpringBootApplication
#RestController
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
#Bean
public Publisher<Message<File>> reactiveSource() {
return IntegrationFlows
.from(Files.inboundAdapter(new File("."))
.patternFilter("*.csv"),
e -> e.poller(Pollers.fixedDelay(1000)))
.channel("processFileChannel")
.toReactivePublisher();
}
#GetMapping(value = "/files", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> files() {
return Flux.from(reactiveSource())
.map(message -> message.getPayload().getAbsolutePath());
}
}
So if I now do a curl to localhost:8080/files and then place a csv file in the root directory of the project everything is fine, I see the path of the file as response to my curl. But when I don't do a curl and then place a file in the root directory I get the following exception:
java.lang.IllegalStateException: The [bean 'reactiveSource.channel#0'; defined in: 'com.example.demo.DemoApplication';
from source: 'bean method reactiveSource'] doesn't have subscribers to accept messages
at org.springframework.util.Assert.state(Assert.java:97)
at org.springframework.integration.channel.FluxMessageChannel.doSend(FluxMessageChannel.java:61)
at org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:570)
... 38 more
I thought one of the attributes of reactive streams was that when there was no subscriber the stream would not start due to the stream being lazy. But apparently this is not the case. Could someone explain what I would need to do to have the stream not start if there is no subscriber?
If you use one of the latest version, then you can use a FluxMessageChannel channel instead of that DirectChannel for the "processFileChannel". This way a SourcePollingChannel adapter will becomes reactive and indeed the source is not going to be polled until a subscription happens to that FluxMessageChannel.
You then create a Flux in your files() API from this FluxMessageChannel - no need in the .toReactivePublisher().
See more in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/reactive-streams.html#source-polling-channel-adapter
The point is that .toReactivePublisher() just makes an integration flow as a Publisher exactly at this point. Everything before this point is in regular, imperative way and works independently from the downstream logic.
UPDATE
Something like this:
#Bean
FluxMessageChannel filesChannel() {
return new FluxMessageChannel();
}
#Bean
public IntegrationFlow reactiveSource() {
return IntegrationFlows
.from(Files.inboundAdapter(new File("."))
.patternFilter("*.csv"),
e -> e.poller(Pollers.fixedDelay(1000)))
.channel(filesChannel())
.get();
}
#GetMapping(value = "/files", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> files() {
return Flux.from(filesChannel())
.map(message -> ((File) message.getPayload()).getAbsolutePath());
}

Cannot get AppInsights working under Spring Boot

I followed the https://learn.microsoft.com/en-us/azure/application-insights/app-insights-java-get-started, but still without success.
I have applicationinsights-web dependency in place via maven
I added ApplicationInsights.xml to main/resources with hardcoded instrumentation key and even with <SDKLogger /> inside
I added the scan path: #ComponentScan({...., "com.microsoft.applicationinsights.web.spring"})
Results:
I see no logs about looking up the configuration file, even if I make the syntax error in in or remove it completely
in debug I see that RequestNameHandlerInterceptorAdapter is instantiated via com.microsoft.applicationinsights.web.spring.internal.InterceptorRegistry, and during calls the preHandle method is called, but calls to ThreadContext.getRequestTelemetryContext() returns always null and nothing more happens
It looks like it is something obvious, but no idea what. What part/classes are responsible for loading the configuration file?
I was a little bit confused with documentation. As mentioned by yonisha, the filter does the whole magic. The following configuration class takes care of creating and adding the filter in Spring Boot application.
import com.microsoft.applicationinsights.web.internal.WebRequestTrackingFilter;
import org.springframework.boot.web.servlet.FilterRegistrationBean;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import javax.servlet.Filter;
#Configuration
#ComponentScan("com.microsoft.applicationinsights.web.spring")
public class ApplicationInsightsConfiguration {
#Bean
public FilterRegistrationBean someFilterRegistration() {
FilterRegistrationBean registration = new FilterRegistrationBean();
registration.setFilter(appInsightsWebRequestTrackingFilter());
registration.addUrlPatterns("/*");
registration.setName("webRequestTrackingFilter");
registration.setOrder(1);
return registration;
}
#Bean(name = "appInsightsWebRequestTrackingFilter")
public Filter appInsightsWebRequestTrackingFilter() {
return new WebRequestTrackingFilter();
}
Important: It will work nicely if you set the server.context-path property to some value. If not, AI initialization will fail with error
AI: ERROR 03-04-2017 14:11, 20: WebApp name is not found, unable to register WebApp
In order to keep servlet-context empty, I had to implement wrappers for the filter and 2 other classes to override it, but it was a very dirty fix... Would be great, if the name could be passed as a parameter to the filter, but that is not yet possible (https://github.com/Microsoft/ApplicationInsights-Java/issues/359)
In spring boot , We need to configure WebRequestTrackingFilter by extending WebSecurityConfigurerAdapter and overriding configure(HttpSecurity httpSecurity)
#Bean
public WebRequestTrackingFilter applicationInsightsFilterBean() throws Exception {
WebRequestTrackingFilter webRequestTrackingFilter = new WebRequestTrackingFilter();
return webRequestTrackingFilter;
}
#Override
public void configure(HttpSecurity httpSecurity) throws Exception {
//other stuff...
httpSecurity.addFilterBefore(applicationInsightsFilterBean(), UsernamePasswordAuthenticationFilter.class);
}
you need to have below configuration also ..
applicationinsights-web dependency in place via maven
added ApplicationInsights.xml to main/resources
Here is newer guide for Spring Boot Application Insights integration that worked well for me just now:
https://github.com/AzureCAT-GSI/DevCamp/tree/master/HOL/java/06-appinsights
The idea is basically what Tomasz has above with some minor differences.
package devCamp.WebApp.configurations;
import javax.servlet.Filter;
import org.springframework.boot.context.embedded.FilterRegistrationBean;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import com.microsoft.applicationinsights.TelemetryConfiguration;
import com.microsoft.applicationinsights.web.internal.WebRequestTrackingFilter;
#Configuration
public class AppInsightsConfig {
#Bean
public String telemetryConfig() {
String telemetryKey = System.getenv("APPLICATION_INSIGHTS_IKEY");
if (telemetryKey != null) {
TelemetryConfiguration.getActive().setInstrumentationKey(telemetryKey);
}
return telemetryKey;
}
#Bean
public FilterRegistrationBean aiFilterRegistration() {
FilterRegistrationBean registration = new FilterRegistrationBean();
registration.setFilter(new WebRequestTrackingFilter());
registration.addUrlPatterns("/**");
registration.setOrder(1);
return registration;
}
#Bean(name = "WebRequestTrackingFilter")
public Filter WebRequestTrackingFilter() {
return new WebRequestTrackingFilter();
}
}
The guide at the link above has a full set of instructions and includes client side js and a java log appender example as well. Hope this helps.
The above all method works! However you can try the whole new seamless experience using Application Insights SpringBoot Starter.
https://github.com/Microsoft/ApplicationInsights-Java/blob/master/azure-application-insights-spring-boot-starter/README.md
This is currently in BETA

Is Mapreduce job using multiplethreading

I am curious about whether mapreduce job is using multiple threading in a single machine. For example, I have 10 servers in the hadoop cluster, by default, if the input file is large enough, there will be 10 mappers. Is the single mapper using multiple threading in a single machine?
Is the single mapper using multiple threading in a single machine?
YES. Mapreduce job can use multithreaded mapper(Multiple threads or thread pool running map method) .
I have used for better CPU utilization for Map only Hbase jobs...
MultiThreadedMapper is a good fit if your operation is highly CPU intensive, could increase the speed.
mapper class should extend org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of regular org.apache.hadoop.mapreduce.Mapper .
The Multithreadedmapper has a different implementation of run()
method. like below.
run(org.apache.hadoop.mapreduce.Mapper.Context context)
Run the application's maps using a thread pool.
You can set the number of threads within a mapper in MultiThreadedMapper by
MultithreadedMapper.setNumberOfThreads(n); or you can set the property in loading from a property file mapred.map.multithreadedrunner.threads = n
and use above setter method(per job basis) to control jobs which are less cpu intensive.
The affect of doing this, you can see in mapreduce counters specially CPU related counters.
Example Code snippet of MultithreadedMapper implementation:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.regex.Pattern;
public class MultithreadedWordCount {
// class should be thread safe
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
public static enum PREPOST { SETUP, CLEANUP }
#Override()
protected void setup(Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws java.io.IOException, java.lang.InterruptedException {
// will be called several times
context.getCounter(PREPOST.SETUP).increment(1);
}
#Override
protected void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
String[] words = value.toString().toLowerCase().split("[\\p{Blank}[\\p{Punct}]]+");
for (String word : words) {
context.write(new Text(word), new LongWritable(1));
}
}
#Override()
protected void cleanup(Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException {
// will be called several times
context.getCounter(PREPOST.CLEANUP).increment(1);
}
}
public static class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
#Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context
) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value: values) {
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(WordCount.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
MultithreadedMapper.setMapperClass(job, MultithreadedWordCount.WordCountMapper.class);
MultithreadedMapper.setNumberOfThreads(job, 10);
job.setMapperClass(MultithreadedMapper.class);
job.setCombinerClass(MultithreadedWordCount.WordCountReducer.class);
job.setReducerClass(MultithreadedWordCount.WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
/* begin defaults */
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
/* end defaults */
job.waitForCompletion(true);
}
}
Please refer https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/Mapper.html
Applications may override the run(Context) method to exert greater control on map processing e.g. multi-threaded Mappers etc.
Moreover there is also a MultithreadedMapper . I have never used this though.

Spring Batch thread-safe Map job repository

the Spring Batch docs say of the Map-backed job repository:
Note that the in-memory repository is volatile and so does not allow restart between JVM instances. It also cannot guarantee that two job instances with the same parameters are launched simultaneously, and is not suitable for use in a multi-threaded Job, or a locally partitioned Step. So use the database version of the repository wherever you need those features.
I would like to use a Map job repository, and I do not care about restarting, prevention of concurrent job executions, etc. but I do care about being able to use multi-threading and local partitioning.
My batch application has some partitioned steps, and at first glance it seems to run just fine with a Map-backed job repository.
What is the reason it said to be not possible with MapJobRepositoryFactoryBean? Looking at the implementation of Map DAOs, they are using ConcurrentHashMap. Is this not thread-safe ?
I would advise you to follow the documentation, rather than relying on implementation details. Even if the maps are individually thread-safe, there might be race conditions in changes than involve more than one of these maps.
You can use an in-memory database very easily. Example
#Grapes([
#Grab('org.springframework:spring-jdbc:4.0.5.RELEASE'),
#Grab('com.h2database:h2:1.3.175'),
#Grab('org.springframework.batch:spring-batch-core:3.0.6.RELEASE'),
// must be passed with -cp, for whatever reason the GroovyClassLoader
// is not used for com.thoughtworks.xstream.io.json.JettisonMappedXmlDriver
//#Grab('org.codehaus.jettison:jettison:1.2'),
])
import org.h2.jdbcx.JdbcDataSource
import org.springframework.batch.core.Job
import org.springframework.batch.core.JobParameters
import org.springframework.batch.core.Step
import org.springframework.batch.core.StepContribution
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory
import org.springframework.batch.core.launch.JobLauncher
import org.springframework.batch.core.scope.context.ChunkContext
import org.springframework.batch.core.step.tasklet.Tasklet
import org.springframework.batch.repeat.RepeatStatus
import org.springframework.beans.factory.annotation.Autowired
import org.springframework.context.annotation.AnnotationConfigApplicationContext
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration
import org.springframework.core.io.ResourceLoader
import org.springframework.jdbc.datasource.init.DatabasePopulatorUtils
import org.springframework.jdbc.datasource.init.ResourceDatabasePopulator
import javax.annotation.PostConstruct
import javax.sql.DataSource
#Configuration
#EnableBatchProcessing
class AppConfig {
#Autowired
private JobBuilderFactory jobs
#Autowired
private StepBuilderFactory steps
#Bean
public Job job() {
return jobs.get("myJob").start(step1()).build()
}
#Bean
Step step1() {
this.steps.get('step1')
.tasklet(new MyTasklet())
.build()
}
#Bean
DataSource dataSource() {
new JdbcDataSource().with {
url = 'jdbc:h2:mem:temp_db;DB_CLOSE_DELAY=-1'
user = 'sa'
password = 'sa'
it
}
}
#Bean
BatchSchemaPopulator batchSchemaPopulator() {
new BatchSchemaPopulator()
}
}
class BatchSchemaPopulator {
#Autowired
ResourceLoader resourceLoader
#Autowired
DataSource dataSource
#PostConstruct
void init() {
def populator = new ResourceDatabasePopulator()
populator.addScript(
resourceLoader.getResource(
'classpath:/org/springframework/batch/core/schema-h2.sql'))
DatabasePopulatorUtils.execute populator, dataSource
}
}
class MyTasklet implements Tasklet {
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
println 'TEST!'
}
}
def ctx = new AnnotationConfigApplicationContext(AppConfig)
def launcher = ctx.getBean(JobLauncher)
def jobExecution = launcher.run(ctx.getBean(Job), new JobParameters([:]))
println "Status is: ${jobExecution.status}"

How to implement referential integrity and cascading actions in liferay service xml file?

I have a requirement that I need to implement onDelete Cascade functionality with liferay service builder. How can I achieve this in liferay?
First of all:
onDelete Cascade is NOT a Liferay Service Builder functionality. It is a funtionality provided by your database.
Next:
Liferay has the premise, that all data processing and evaluation should be done in the code, and NOT in the database.
Having said that:
Something similar to onDelete Cascade would be to implement a Model Listener. A ModelListener is listening to changes of a Model. (I know, misleading name ;) ) In this model listener, you would implement onAfterRemove. In the onAfterRemove goes your code for the deletion of related records.
Here is a small sample I have written. The code is listening to changes of the group object, and tries to delete the referenced ObjectGeodata object, if there is one present.
package de.osc.geodata.modellistener;
import com.liferay.portal.ModelListenerException;
import com.liferay.portal.kernel.exception.PortalException;
import com.liferay.portal.kernel.exception.SystemException;
import com.liferay.portal.kernel.log.Log;
import com.liferay.portal.kernel.log.LogFactoryUtil;
import com.liferay.portal.kernel.search.IndexerRegistryUtil;
import com.liferay.portal.kernel.search.SearchException;
import com.liferay.portal.model.BaseModelListener;
import com.liferay.portal.model.Group;
import com.liferay.portal.service.ServiceContext;
import de.osc.geodata.model.ObjectGeodata;
import de.osc.geodata.service.ObjectGeodataLocalServiceUtil;
public class GroupModelListener extends BaseModelListener<Group> {
private static final Log log = LogFactoryUtil.getLog(GroupModelListener.class);
#Override
public void onAfterRemove(Group model) throws ModelListenerException {
try {
ObjectGeodata objectGeodata = ObjectGeodataLocalServiceUtil.getObjectGeodata(model.getClassNameId(), model.getClassPK());
if (objectGeodata != null) {
ObjectGeodataLocalServiceUtil.deleteObjectGeodata(objectGeodata);
}
} catch (SystemException e) {
log.warn("No GeodataObject found in Index.", e);
}
super.onAfterRemove(model);
}
}

Resources