gora-mongodb.mapping.XML properties File - nutch

I'm new to Nutch (2.2.1) and trying to run it on Cygwin/Windows 7 with the latest version of Gora (0.5) so I can persist data to a MongoDB (2.6) datastore. I changed the Nutch-Site.XML File to include my Mongo property but I'm a little confused about the gora-mongodb.mapping.XML properties file here that's needed. Just wondering do I need to:
1) create a Java class within the Nutch/Gora project which I specify in class-name property in the gora-mongodb.mapping File or will Gora create this for me? The documentation doesn't appear to be very clear.
2) I created a sample File in my apache-nutch-2.2.1\runtime\local\conf folder and added the name of my MongoDB collection. When I run Nutch I get the following error:
$ ./nutch crawl urls -dir testCrawl -depth 3 -topN 5
cygpath: can't convert empty path
Exception in thread "main" org.apache.gora.util.GoraException: java.lang.IllegalStateException: A collection is not specified
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
Caused by: java.lang.IllegalStateException: A collection is not specified
at org.apache.gora.mongodb.store.MongoMappingBuilder.build(MongoMappingBuilder.java:77)
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:168)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 8 more
Any help or clarification around this file would be appreciated.

You need 2 files in nutch/conf:
gora.properties: where you declare you are going to use mongodb backend.
gora-mongodb-mapping.xml (notice the dash, not the dot you wrote): where you create a mapping between names in Gora entities and the fields in the datastore.
The version you are using I really think it is not prepared to work with Gora 0.5, but give it a shot. Copy gora-mongodb-mapping.xml from Nutch-2.3-SNAPSHOT to nutch/conf/
If it does not work, try using Nutch-2.3-SNAPSHOT instead of 2.2.1.

Related

I want to add the the raw content which is stored in segment folder nutch version 1.17

While running this command below:
bin/nutch solrindex http://localhost:8983/solr/nutch/ testingnewline/crawldb -linkdb testingnewline/linkdb -dir testingnewline/segments/ -deleteGone -addBinaryContent
It is throwing below exception.
Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=https://www.saintlukeskc.org/] Error adding field 'binaryContent'
May I know what changes need to do I need to change the schema.xml.Please help me.
The Solr schema must contain the field "binaryContent", see Nutch's default Solr schema.xml which contains all the fields eventually added by Nutch or one of the plugins.

Solr Question about Loading Changes to Schema

I'm new to Solr and received the following error when adding a document through pysolr:
pysolr.SolrError: Solr responded with an error (HTTP 400): [Reason: ERROR: [doc=bc4aa768-6f35-4888-80e0-1578d9971b3c] Error adding field 'periodical_nlm'='2984692R' msg=For input string: "2984692R"]
I ended up finding out that the first periodical_nlm value added was 404536.0, so I assumed it was a type issue. In Python I then cast every periodical_nlm explicitly to string before adding 2984692R. However, the error persisted.
I Googled a bit and found that I should probably explicitly tell Solr that I want that field to be a string. I've not gotten very "hands on" with the schema yet, so I just had some questions:
(1) There appear to be two schema files: managed-schema in the directory for the core and managed-schema in the conf folder of the core. I'm assuming that the initialized schema which is in use is the one in the conf folder?
(2) Which do I update in order for things to proceed smoothly? I attempted adding the following to the schema file in the core directory but the error persisted:
field name="periodical_nlm" type="string" indexed="true" stored="true" required="false" multiValued="false" />
Do I need to rerun some initialization process or add something to the conf file separately?
Thank you so much and please let me know if you need more info. I'm running on a Windows 10 Home x64 platform (not sure if that's important if there are any command-line things I need to run...).
As long as you reload the core after changing the managed-schema file under conf, you should be fine. Be aware that you should do this before indexing content - so you might need to clean out the index by deleting everything, then changing the schema and re-indexing your content. Changing the schema does not change content that has already been indexed.
Otherwise your assumption is correct, and the schemaless mode (where the type is determined by the format of the first value submitted (not the type - as that's usually not included in any way, all values are just strings when being submitted, so Solr attempts to guess the type by applying a hierarchy of pattern matching)) is useful for prototyping - when you're moving to production you should always define the schema explicitly to avoid issues like you've seen here.

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

Sphinx4 figuring out correct models

I am trying to use the Sphinx4 library for speech recognition, but I cannot seem to figure out the correct combination of acoustic model-dictionary-language model. I have tried out various combinations and I get a different error every time.
I am trying to follow the tutorial on http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4. I do not have a config.xml as I would if I was using ConfigurationManager instead of Configuration, because there is no perceivable way of passing the location of the config file to the Configuration itself (ConfigMgr takes it as an argument to the constructor); and that might be my problem right there. I just do not know how to point to one, and since the tutorial says "It is possible to configure low-level components of the application through XML file although you should do that ONLY IF you understand what is going on.", I assume having a config.xml file is not compulsory.
Combining the latest dictionary (7b - obtained from Sourceforge) with the latest acoustic model (cmusphinx-en-us-5.2.tar.gz - from SF again) and the language model (cmusphinx-5.0-en-us.lm.gz - from SF again) results in NullPointerException in startRecognition. The issue is similar to the problem here: sphinx-4 NullPointerException at startRecognition, but the link given in the answer no longer works. I obtained 0.7a from SF (since that is the dict the link seems to point at), but I am getting even earlier in the execution Error loading word: ;;; when I use that one. I tried downloading latest models and dict from the Github repo, that results in java.lang.IndexOutOfBoundsException: Index: 16128, Size: 16128.
Any help is much appreciated!
You need to use latest code from github
http://github.com/cmusphinx/sphinx4
as described by tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4
Correct models (en-us) are already included, you should not replace anything. You should not configure any XML files, use samples as provided in the sources.

Apache Pig: Load a file that shows fine using hadoop fs -text

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.
What I've tried:
x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;
but that only gives me garbage. How can I view the file using pig?
What might be of relevance is that my hdfs is still using CDH-2 at the moment.
Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.
According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.
If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.
I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.
Update:
Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:
-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot
USING SequenceFileLoader AS (key:long, val:long, etc.);
If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.
You can find here examples how to read/write them.
If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .
Note, that Elephant-Bird needs to have an installed Thrift in your machine.
Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:
<plugin>
<groupId>org.apache.thrift.tools</groupId>
<artifactId>maven-thrift-plugin</artifactId>
<version>0.1.10</version>
<configuration>
<thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
</configuration>
</plugin>

Resources