Nutch 2.3+Gora+Hbase : Strange Values Parsed - nutch

I have started playing around with Nutch2.3+Gora+HBase, but the some results look strange in HBase, is there any configuration or a simple step I might be missing?
status: value=\x00\x00\x00\x01
fetchTime: value=\x00\x00\x01L\x93\x92\x0F\x5C
prevFetchTime: value=\x00\x00\x01L\x91]\xF5\x1C
Expected values should be:
status: 4
fetchTime: 1426888912463 (i.e. in timestamp format)
prevFetchTime: 1424296904936 (i.e. in timestamp format)
Why am I getting the actual values instead of the expected ones?

Related

ADF copy activity failed while extracting data from DB2- Issue found for few records having special characters

I am doing a full extract from a table ABC. In copy activity, I have given a query as
select * from ABC
whereas I am facing issue for few rows (It has special characters - Japanese and Korean)
Error code 2200
Failure type User configuration issue
Details Failure happened on 'Source' side. ErrorCode=DB2DriverRunFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error thrown from driver. Sql code: '-343',Source=Microsoft.DataTransfer.ClientLibrary.Db2Connector,''Type=Microsoft.HostIntegration.DrdaClient.DrdaException,Message=HISMPCB0001 In BasePrimitiveConverter an exception has occurred. Exception description: Output buffer is smaller than required size 12 SQLSTATE=HY000 SQLCODE=-343,Source=Microsoft.HostIntegration.Drda.Requester,'
The character which is causing the issue is '轎ᆃ '
In the error description, it states that there is BasePrimitiveConverter exception that has occurred. The exception description indicates that the output buffer is smaller than the required size. So, please try converting the column to an acceptable type like graphic in db2. Refer to the following link to understand more.
https://bytes.com/topic/db2/answers/488983-storing-some-japanese-data
Referring to these links, I understand that this error might be due to the datatype of source column, or the encoding used. Try working with different encoding options available in your source dataset. Here is a similar problem with a different source but a similar problem of not being able to retrieve special characters.
https://learn.microsoft.com/en-us/answers/questions/467456/failure-happened-source-side-in-copy-activity-for.html

Parsing postgres date in logstash using datefilter plugin

I am using logstashto read the postgres data using jdbc input plugin and push to elastic
The data is coming properly and things seems to be working fine just for a small problem i.e.
My logs table has a field requesttimestamp with datatype of timestamp. Since there is a historical data and also to ensure that the timelines are based on the data and not the time of run, I am trying to set the value of #timestamp with requettimestamp.
the filter configuration is as follows:
{
match => ["requesttimestamp", "yyyy-MM-dd HH:mm:ss.SSS"]
}
but while running it is tagging it as _dateparsefailure and using the system time as #timestamp
I also tried using the following format : ISO8601 but nothing seems to be working.
I am sure it is something simple but not able to put the finger to it.

Number of results not as it should be on boolean Solr query

I have a Solr instance running with about 200 entries in its database. I want to search for strings with OR but fail to get a working query.
When running a simple query like this: q=fieldname:"string", i get 13 results. When running another query like this: q=fieldname:"otherstring", i gt 18 results. In the end i would expect it to be 27 results because together there are 31 results and 4 of the are the same ones as they contain both strings.
Problem now comes if i want to search both these strings at once it will return all kinds of results but not the expected 27. I found this site describing how it should work and tried a couple of different things:
q=fieldname:"string otherstring" gives me 10
q=fieldname:"otherstring string" gives me 0
q=fieldname:"string otherstring"~1 gives me 10
q=fieldname:"otherstring string"~1 gives me 1
q=fieldname:"(string otherstring)" gives me 37 but some are not related at all
q=(+fieldname:"string" +fieldname:"otherstring")" same as above
I could go on with this as i tried more if these combinations. Can anyone help me getting a query with the correct number of results or can anyone explain me what i am doing wrong?
If you want to perform an OR query, use OR explicitly:
q=fieldname:"string" OR fieldname:"otherstring"
The other versions will give varying results depending on the value q.op and the query parser in use.
q=fieldname:("string" OR "otherstring")
should be semantically identical.

querysting binding error when using ServiceStack version 4.0.34 and OrmLiteCacheClient

We're getting an "unable to bind to request" when calling a service with the following querystring:
/SomeService?playerid=59326&fromdate=4-1-2014&todate=12-11-2014
We have been using this querysting format for awhile now.
The problem is some either a change in 4.0.34, or something in the OrmLightCacheClient, which we had turned off for awhile and only just recently turned it back on.
If I change the dates to following format, it seems to work.
/SomeService?playerid=59326&fromdate=2014-4-1&todate=2014-12-31.
We can roll with the changed querystring date format for now, but wanted to report the error.
When supplying a date only (i.e. doesn't include a time) it should be unambiguously defined using the YYYY-MM-DD format.

Using indexed types for ElasticSearch in Titan

I currently have a VM running Titan over a local Cassandra backend and would like the ability to use ElasticSearch to index strings using CONTAINS matches and regular expressions. Here's what I have so far:
After titan.sh is run, a Groovy script is used to load in the data from separate vertex and edge files. The first stage of this script loads the graph from Titan and sets up the ES properties:
config.setProperty("storage.backend","cassandra")
config.setProperty("storage.hostname","127.0.0.1")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","db/es")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
The second part of the script sets up the indexed types:
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make();
The third part loads in the data from the CSV files, this has been tested and works fine.
My problem is, I don't seem to be able to use the ElasticSearch functions when I do a Gremlin query. For example:
g.E.has("property",CONTAINS,"test")
returns 0 results, even though I know this field contains the string "test" for that property at least once. Weirder still, when I change CONTAINS to something that isn't recognised by ElasticSearch I get a "no such property" error. I can also perform exact string matches and any numerical comparisons including greater or less than, however I expect the default indexing method is being used over ElasticSearch in these instances.
Due to the lack of errors when I try to run a more advanced ES query, I am at a loss on what is causing the problem here. Is there anything I may have missed?
Thanks,
Adam
I'm not quite sure what's going wrong in your code. From your description everything looks fine. Can you try the follwing script (just paste it into your Gremlin REPL):
config = new BaseConfiguration()
config.setProperty("storage.backend","inmemory")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","/tmp/es-so")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
g = TitanFactory.open(config)
g.makeKey("name").dataType(String.class).make()
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make()
g.makeLabel("knows").make()
g.commit()
alice = g.addVertex(["name":"alice"])
bob = g.addVertex(["name":"bob"])
alice.addEdge("knows", bob, ["property":"foo test bar"])
g.commit()
// test queries
g.E.has("property",CONTAINS,"test")
g.query().has("property",CONTAINS,"test").edges()
The last 2 lines should return something like e[1t-4-1w][4-knows-8]. If that works and you still can't figure out what's wrong in your code, it would be good if you can share your full code (e.g. in Github or in a Gist).
Cheers,
Daniel

Resources