Performing a distributed search through spark-solr - apache-spark

I'm using spark-solr in order to perform Solr queries. However, my searches don't work as they're supposed to because for some reason the requests being generated by spark prevent the searches from being distributed. I have discovered it by looking at the Solr logs where I saw that a distrib=false parameter is added to the sent requests. When executing the queries manually (not using spark) with distrib=true the results were fine.
I was trying to set the parameters sent by spark by changing the "solr.params" value in the options dictionary (I'm using pyspark):
options = {
"collection": "collection_name",
"zkhost": "server:port",
"solr.params": "distrib=true"
}
spark.read.format("solr").options(**options).load().show()
This change did not have any effect: I still see in the logs that a distrib=false parameter is being sent. Other parameters passed through the "solr.params" key (such as fq=something) do have an effect on the results. But it looks like spark insists on sending distrib=false no matter what I do.
How do I force a distributed search through spark-solr?

The easy solution is to configure the request handler to run distributed queries using an invariant. The invariant forces the distrib parameter to have a true value even if spark-solr is trying to change it in query time. Introducing the invariant can be done by adding the following lines under the definition of your request handler entry in solrconfig.xml:
<lst name="invariants">
<str name="distrib">true</str>
</lst>
While the introduction of the invariant is going to fix the problem, I think it's kind of a radical solution. This is because the solution involves hiding a behavior in which you overload the value of a parameter. By introducing the invariant you cannot decide to set distrib to false: even if your request explicitly does so, the value of distrib would still be true. This is too risky in my opinion and that's why I'm suggesting another solution which might be harder to implement but wouldn't suffer from that flaw.
The solution is to implement a query component which is going to force distrib=true only when receiving a forceDistrib=true flag as a parameter.
public class ForceDistribComponent extends SearchComponent {
private static String FORCE_DISTRIB_PARAM = "forceDistrib";
#Override
public void prepare(ResponseBuilder rb) throws IOException {
ModifiableSolrParams params = new ModifiableSolrParams(rb.req.getParams());
if (!params.getBool(FORCE_DISTRIB_PARAM, false)) return;
params.set(CommonParams.DISTRIB, true);
params.set(FORCE_DISTRIB_PARAM, false);
rb.req.setParams(params);
}
}
After building the component you can configure solr to use it by adding the component to solrconfig.xml and set your request handler to use it.
Adding the component to solrconfig.xml is done by adding the following entry to the solrconfig.xml file:
<searchComponent name="forceDistrib" class="ForceDistribComponent"/>
Configuring the request handler to use the forceDistrib component is done by adding it to the list of components under the request handler entry. It must be the first component in the list:
<arr name="components">
<str>forceDistrib</str>
<str>query</str>
...
</arr>
This solution, while more involved than simply introducing an invariant, is much safer.

Related

How to get query string in case CQLStatement,QueryState and QueryOptions is given

Cassandra has org.apache.cassandra.cql3.QueryHandler interface which provide apis to handle external queries from client.
Below api which handles prepared statment:
public ResultMessage processPrepared(CQLStatement statement, QueryState state, QueryOptions options) throws RequestExecutionException, RequestValidationException;
I want to log queryString and value passed to it, in case CQLStatement,QueryState and QueryOptions is given . How can i get it?
I Believe a person who has worked on cassandra code can help me out in this.
This would be very difficult in 2.1. With newer versions where for logging they needed this they just recreate it as well as possible. You can see how in the ReadCommand implementations, theres a name() or toCQLString() used in things like slow query logging. You could backport this and the 2 implementations of appendCQLWhereClause for ability to do similar and then build one for modification statement.
in getPrepared() you can get the rawCQLStatement from the ParsedStatement.Prepared and stash it in the thread local.
You may want to alternatively consider using a custom implementation of tracing (example) or using triggers and building a mutation logger.
Do the following:
create a class that would implement the QueryHandler interface and make Cassandra aware of it
in that class you can maintain a list of the queries (add to this list when prepare method is being called) and the current query that you will get from the list when getPrepared it's called; you can get it from the list using the MD5Digest id
when processPrepared is called you can replace the ? in the query string with the values in the QueryOptions options.getValues().
HTH

using spring cache read only, how set spring cache redis read only

when I use spring cache with redis, I use it in two app, the one read and write,the other is only read,how can I config?
I try do like this, but it does not work!
#Cacheable(value = "books", key = "#isbn", condition = "false")
Can anyone help ?
You have misunderstood the purpose of the #Cacheable annotation's "condition" attribute. Per the documentation...
If true, the method is cached - if not, it behaves as if the method is
not cached, that is executed every since time no matter what values
are in the cache or what arguments are used.
The condition attribute just determines whether the cache (e.g. Redis) is consulted first, before executing the (potentially expensive) method. If condition evaluates to false, then the method will always be executed and the result subsequently cached.
In the read-only app, I am assuming you want the cache consulted first, if the value is not in the cache, then execute the method, however, DO NOT cache the result. Is this correct?
If so, then you only need specify the unless attribute instead of the condition attribute like so...
#Cacheable(value="books", key="#isbn", unless="true")
void someBookMutatingOperation(String isbn, ...) { .. }
If, however, you want to avoid the cacheable method invocation in the read-only (version of the) app altogether and just consult the cache regardless of whether a value actually exists in the cache or not, then your problem is quite a bit more complex/difficult.
Spring's Cache Abstraction operates on the premise that if a value is not in the cache then it will return null to indicate a cache miss, which is then followed by a subsequent method invocation. Only when a cache returns a value for the specified key(s) will the method invocation be avoided.
Without a custom extension (perhaps using (additional) AOP interceptors) there is no way to avoid the OOTB behavior.
I will not elaborate on this later technique unless your use case requires it.
Hope this helps.
#John Blum
thanks! happy new year.
your answer inspired me, I have read a part of the spring cache source code. the CacheInterceptor class. the CacheAspectSupport class.
private Object execute(CacheOperationInvoker invoker, CacheOperationContexts contexts) {
// Process any early evictions
processCacheEvicts(contexts.get(CacheEvictOperation.class), true, ExpressionEvaluator.NO_RESULT);
// Check if we have a cached item matching the conditions
Cache.ValueWrapper cacheHit = findCachedItem(contexts.get(CacheableOperation.class));
// Collect puts from any #Cacheable miss, if no cached item is found
List<CachePutRequest> cachePutRequests = new LinkedList<CachePutRequest>();
if (cacheHit == null) {
collectPutRequests(contexts.get(CacheableOperation.class), ExpressionEvaluator.NO_RESULT, cachePutRequests);
}
Cache.ValueWrapper result = null;
// If there are no put requests, just use the cache hit
if (cachePutRequests.isEmpty() && !hasCachePut(contexts)) {
result = cacheHit;
}
// Invoke the method if don't have a cache hit
if (result == null) {
result = new SimpleValueWrapper(invokeOperation(invoker));
}
// Collect any explicit #CachePuts
collectPutRequests(contexts.get(CachePutOperation.class), result.get(), cachePutRequests);
// Process any collected put requests, either from #CachePut or a #Cacheable miss
for (CachePutRequest cachePutRequest : cachePutRequests) {
cachePutRequest.apply(result.get());
}
// Process any late evictions
processCacheEvicts(contexts.get(CacheEvictOperation.class), false, result.get());
return result.get();
}
I think should prevent the cachePutRequest execute. if no cache be hit, to invoke the method body of #Cacheable and don't cached the result. use unless will prevent the method invoke. Is this correct?
#Tonney Bing
First of all, my apologies for misguiding you on my previous answer...
If condition evaluates to false, then the method will always be
executed and the result subsequently cached.
The last part is NOT true. In fact, the condition attribute does prevent the #Cacheable method result from being cached. But, neither the condition nor the unless attribute prevent the #Cacheable service method from being invoked.
Also, my code example above was not correct. The unless attribute needs to be set to true to prevent caching of the #Cacheable method result.
After re-reading this section in the Spring Reference Guide, I came to realize my mistake and wrote an example test class to verify Spring's "conditional" caching behavior.
So...
With respect to your business use case, the way I understand it based on your original question and subsequently, your response to my previous answer, you have a #Cacheable service method that needs to be suppressed of invocation in the read-only app regardless of whether the value is in the cache or not! In other words, the value should always be retrieved from the cache and the #Cacheable service method should NOT be invoked in read-only mode.
Now to avoid polluting your application code with Spring infrastructure component references, and specifically, with a Spring CacheManager, this is a good example of a "cross-cutting concern" (since multiple, mutating-based application service operations may exist) and therefore, can be handled appropriately using AOP.
I have coded such an example satisfying your requirements here.
This is a self-contained test class. The key characteristics of this test class include...
The use of external configuration (by way of the app.mode.read-only System property) to determine if the app is in read-only mode.
The use of AOP and a custom Aspect to control whether the subsequent invocation of the Joint Point (i.e. the #Cacheable service method) is allowed (no, in a read-only context). In addition, I appropriately set the order in which the Advice (namely, the #Cacheable based advice along with the handleReadOnlyMode advice in the UseCacheExclusivelyInReadOnlyModeAspect Aspect) should fire based on precedence.
Take note of the #Cacheable annotation on the service method...
#Cacheable(value = "Factorials", unless = "T(java.lang.System).getProperty('app.mode.read-only', 'false')")
public Long factorial(long number) { .. }
You can see the intended behavior with the System.err output statements in the test class.
Hope this helps!

Problems de-serializing System.Security.Claims.Claim

I'm implementing an oAuth server and need to store refresh tokens, to do this I have (at the moment) chosen to serialize the tokens into JSON.
While I can see that the JSON includes everything that would be needed to rehydrate, when I de-serialize with token.FromJson() the embedded claims are not being reconstructed correctly.
So far I've considered inheriting from JsonConverter to create a claims converter but don't see a way of adjusting the global JsConfig to utilise it :(
Can any one point me in a good direction?
So...
Walking away from the code and returning did the trick!
Instead of using a JsonConverter you need to utilise a generic version of JsConfig when changing/overriding the behaviour of ServiceStack on a specific class, just stick the following in your services start-up code for example.
JsConfig<Claim>.SerializeFn = claim => string.Format("{0}|{1}", claim.Type, claim.Value);
JsConfig<Claim>.DeSerializeFn = claimDetails =>
{
var values = claimDetails.Split('|');
return new Claim(values[0], values[1]);
};

Design: Spring Integration jdbc best practice

After using Spring Integration in a project, my observation is to use jdbc adapter or gateway only at the start or end of the flow. If we use them in the middle of flow then it will become too verbose and complex.
For example:
<jdbc:outbound-gateway
query="select * from foo where
c1=:headers[c1] AND
c2=:headers[c2] AND
c3=:headers[c3] AND
c4=:headers[c4]"
row-mapper="fooMapper" data-source="myDataSource" max-rows-per-poll="100000" />
<int:service-activator ref="serviceActivator" method="processFoo" />
In the above <jdbc:outbound-gateway>, we need to pass all the placeholders (c1, c2, c3, c4) in the header of Message. We need to look back and forth in java code and xml file for any change in where condition or when there are too many where clauses.
It is also error prone. For example, if we misspelled :headers[c1] to :headers[d1] then it will not throw any exception and replace :headers[d1] with null.
If query does not return any row then it will throw exception by default. So, we have to use requires-reply="false" to change default behaviour.
If we want to proceed when query does not return any value then we have to add advice to gateway, as shown below:
<jdbc:outbound-gateway ... >
<jdbc:request-handler-advice-chain>
<bean class="com.service.NullReplyAdvice" />
</jdbc:request-handler-advice-chain>
</jdbc:outbound-gateway>
Please correct me if there are flaws in understanding of the concept.
We need to look back and forth in java code and xml file for any change in where condition or when there are too many where clauses.
It's true even for raw Java code around the JDBC: if you change the model you, of course, should change the SELECT, because it is just a String. And that's why there is a lot of work to make it type-safe - ORM, QueryDSL, Spring-Data etc.
if we misspelled :headers[c1] to :headers[d1] then it will not throw any exception and replace :headers[d1] with null.
That's because the headers is just a Map and it's truth that you get null, if there is no such a key in the map. To overcome that typo issue you can use POJO payload with getters, or some custom header, and again - the POJO with getters. In this case you end up with exception that there is no such a property against object. Although you'll see that issue only at runtime, not on compile. And again the same is with Hashtable - only at runtime.
So, we have to use requires-reply="false" to change default behaviour.
You should understand it at design time: allow or not to return nothing for the component.
The last idea is good. Wouldn't you mind to share your NullReplyAdvice?
Actually I achieve the same with <filter> before the JDBC gateway: to determine if there is something to fetch by count(*) query. From there I can lead my flow to the different logic, rather than the direct flow, when SELECT returns rows.
UPDATE
When you want to use Model object to keep business-specific values within Message, it's just enough to put this object to the header:
public class Foo {
private String foo1;
private String foo2;
public String getFoo1() {
return foo1;
}
public String getFoo2() {
return foo2;
}
}
...
MessageBuilder.withPayload(payload).setHeader("foo", foo).build();
...
<jdbc:outbound-gateway
query="select * from foo where
c1=:headers[foo].foo1 AND
c1=:headers[foo].foo2"/>

How to perform a search on several entities with Symfony 2

I need to perform a search on several entities with the same string then order the results.
I've heard/read a little about FOSElasticaBundle, would this bundle be able to do it? It seems (to me) to have almost to much features for this purpose and I'm not sure it could run on a shared server (hostgator).
The other solution I can think of at the moment is doing the search "manually" (by using join and union) but I'm wondering where should I put such a function: in an existing controller, a new one, a new bundle or somewhere else?
I'm worried as well that this manual solution could come to a cost, especially on some non-indexable fields.
You would do custom entity repositories. Check out the docs. Basically this extends the default FindAll, FindOneBy, etc.
You would have a function like so:
class MyEntityRepository extends Doctrine\ORM\EntityRepository {
public function findByCustomRule(){
//this is mapped to your entity (automatically adds the select)
$queryBuilder = $this->createQueryBuilder('someAlias');
$queryBuilder->orderBy('...');
//this is mapped to any entity
$queryBuilder = $this->getEntityManager()->createQueryBuilder();
$queryBuilder->select('...');
//result
$result = $queryBuilder->getQuery()->getResult();
}
}
This class is defined in the doctrine mapping and lives inside the Entity folder.. Check the docs out and you should get a basic idea.

Resources