Apache Nutch 1.17 indexer rabbit not working - nutch

I am trying to push crawled documents to the rabbit. Have followed all the docs available.
IndexWriters Mapping
RabbitMQ README
However, I can't manage to run indexer-rabbit. Looking at the logs, there's no even mentioning above indexer-rabbit. I am just trying to make it work before further configuration. I tried connecting to RabbitMQ with a small custom program. Everythings working.
I have included indexer in nutch-site.xml as well.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-rabbit|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>rabbitmq.publisher.server.uri</name>
<value>amqp://guest:guest#172.17.0.2:5672/</value>
</property>
<property>
<name>publisher.queue.type</name>
<value>RabbitMQ</value>
</property>
Also, the mappings are default and seem quite right for testing.
<writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
<parameters>
<param name="type" value="http"/>
<param name="url" value="http://localhost:8983/solr/nutch"/>
<param name="collection" value=""/>
<param name="weight.field" value=""/>
<param name="commitSize" value="1000"/>
<param name="auth" value="false"/>
<param name="username" value="username"/>
<param name="password" value="password"/>
</parameters>
<mapping>
<copy>
<!-- <field source="content" dest="search"/> -->
<!-- <field source="title" dest="title,search"/> -->
</copy>
<rename>
<field source="metatag.description" dest="description"/>
<field source="metatag.keywords" dest="keywords"/>
</rename>
<remove>
<field source="segment"/>
</remove>
</mapping>
</writer>
<writer id="indexer_rabbit_1" class="org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter">
<parameters>
<param name="server.uri" value="amqp://guest:guest#172.17.0.2:5672/"/>
<param name="binding" value="false"/>
<param name="binding.arguments" value=""/>
<param name="exchange.name" value=""/>
<param name="exchange.options" value="type=direct,durable=true"/>
<param name="queue.name" value="nutch.queue"/>
<param name="queue.options" value="durable=true,exclusive=false,auto-delete=false"/>
<param name="routingkey" value=""/>
<param name="commit.mode" value="multiple"/>
<param name="commit.size" value="250"/>
<param name="headers.static" value=""/>
<param name="headers.dynamic" value=""/>
</parameters>
<mapping>
<copy>
<field source="title" dest="title,search"/>
</copy>
<rename>
<field source="metatag.description" dest="description"/>
<field source="metatag.keywords" dest="keywords"/>
</rename>
<remove>
<field source="content"/>
<field source="segment"/>
<field source="boost"/>
</remove>
</mapping>
</writer>
Does anybody have any idea what am I missing here?

Turned out it was my stupid mistake. Was just a slight problem. I didn't added index parameter in crawl command. Previous command looked like this.
./bin/crawl -s ./urls --hostdbupdate --hostdbgenerate --size-fetchlist 20 ./crawl 3
In this command, there no index parameter. So indexing was getting skipped. New command should be:
./bin/crawl -i -s ./urls --hostdbupdate --hostdbgenerate --size-fetchlist 20 ./crawl 3

Related

WSO2 rabbitMQ - channels created and not closed - memory leak problem

I am using WSO2 ESB with RabbitMQ, I have one proxy service and one sequence.
The proxy service works as consumer for RabbitMQ queue (via rabbitmq transport), consumed messages are send to HTTP endpoint.
The sequence works as producer to add actions in queue, calling an endpoint.
I also have an API that, for each call, inserts in the queue. Everything works correctly, but every time that we call the API, in the rabbit queue many channels are created without closing them, causing the "memory leak" problem on the Rabbit server machine.
We tried to create "direct" and "fanout" exchange, but did not resolve the memory leak problem.
Below the sequence code:
<sequence name="add-insertqueue-tostore" trace="disable" xmlns="http://ws.apache.org/ns/synapse">
<property name="FORCE_SC_ACCEPTED" scope="axis2" type="STRING" value="true"/>
<property name="OUT_ONLY" scope="default" type="STRING" value="true"/>
<call>
<endpoint key="gov:endpoints/rabbit/insert-toqueue.xml"/>
</call>
<log level="full">
<property name="Sequence" value="AddToQueue"/>
<property name="step" value="Message inserted"/>
</log>
<property name="FORCE_SC_ACCEPTED" scope="axis2" type="STRING" value="false"/>
<property name="OUT_ONLY" scope="default" type="STRING" value="false"/>
</sequence>
below the endpoint code
<endpoint name="insert-toqueue" xmlns="http://ws.apache.org/ns/synapse">
<address uri="rabbitmq:/AMQPProxy?rabbitmq.server.host.name=rabbit.server&rabbitmq.server.port=5672&rabbitmq.server.user.name=username&rabbitmq.server.password=password&rabbitmq.queue.name=queue&rabbitmq.server.virtual.host=/virtual-host&rabbitmq.exchange.name=exchange"/>
</endpoint>
below the consumer code
<proxy name="rabbit-consumer" startOnLoad="true" trace="enable" transports="rabbitmq" xmlns="http://ws.apache.org/ns/synapse">
<target>
<inSequence>
<property action="remove" name="SOAPAction" scope="transport"/>
<property action="remove" name="WSAction" scope="transport"/>
<property name="ContentType" scope="transport" type="STRING" value="application/json"/>
<property name="messageType" scope="axis2" type="STRING" value="application/json"/>
<property name="HTTP_METHOD" scope="axis2" type="STRING" value="POST"/>
<property expression="json-eval($.name)" name="name" scope="default" type="STRING"/>
<property expression="json-eval($.surname)" name="surname" scope="default" type="STRING"/>
<log level="full"/>
<call-template target="my-template">
<with-param name="name" value="{get-property('name')}" xmlns:ns="http://org.apache.synapse/xsd"/>
<with-param name="surname" value="{get-property('surname')}" xmlns:ns="http://org.apache.synapse/xsd"/>
</call-template>
<property name="OUT_ONLY" scope="default" type="STRING" value="true"/>
</inSequence>
<outSequence/>
<faultSequence/>
</target>
<parameter name="rabbitmq.exchange.type">fanout</parameter>
<parameter name="rabbitmq.exchange.name">exchange</parameter>
<parameter name="rabbitmq.queue.name">queue</parameter>
<parameter name="rabbitmq.connection.factory">AMQPConnectionFactory</parameter>
</proxy>
Do you know how to resolve this problem? thanks a lot
Did you try to reboot?
Normally it works.
Bye
When configure RabbitMQ sender maybe u can try "cached" one.
<transportSender name="rabbitmq" class="org.apache.axis2.transport.rabbitmq.RabbitMQSender">
<parameter name="CachedRabbitMQConnectionFactory" locked="false">
<parameter name="rabbitmq.server.host.name" locked="false">localhost</parameter>
<parameter name="rabbitmq.server.port" locked="false">5672</parameter>
<parameter name="rabbitmq.server.user.name" locked="false">user</parameter>
<parameter name="rabbitmq.server.password" locked="false">abc123</parameter>
</parameter>
</transportSender>
Doc link : https://docs.wso2.com/display/EI611/RabbitMQ+AMQP+Transport

log4net RollingFileAppender compress old files

Is there an opportunity in log4net/RollingFileAppender to compress the old BackupFiles per configuration? I have found this property in the log4php, but not in log4net:
<configuration xmlns="http://logging.apache.org/log4php/">
<appender name="default" class="LoggerAppenderRollingFile">
<layout class="LoggerLayoutSimple" />
<param name="file" value="file.log" />
<param name="maxFileSize" value="1MB" />
<param name="maxBackupIndex" value="5" />
</appender>
<root>
<appender_ref ref="default" />
</root>
</configuration>
About 2 years ago the Answer for the same Question was - "not possible". I wonder, if it is possible now.
Thank you.

How to change the log4net rolling filenames to log_YYYMMDD_HHmmss.txt

I am maintaining some c# code and I want to log4net to store old log files as:
log_YYYMMDD_HHmmss.txt
eg:
log_20140617_193526.txt
I believe this is the relevant part of the config file, with my attempts at modifying it...
<appender name="HourlyAppender" type="log4net.Appender.RollingFileAppender">
<file type="log4net.Util.PatternString"
value="${ALLUSERSPROFILE}/Optex/RedwallServer/Log/log.txt" />
<appendToFile value="false" />
<datePattern value="yyyyMMdd_HHmmss.\tx\t" />
<rollingStyle value="Date" />
<layout type="log4net.Layout.PatternLayout">
<param name="Header" value="" />
<param name="Footer" value="" />
<param name="ConversionPattern" value="%d [%t] %-5p %c %m%n" />
</layout>
</appender>
It is producing a current log file of:
log.txt
And old log files are stored like:
log.txt20140617_193526.txt
Anyone any idea how I can change the prefix from "log.txt" to "log_"?
What I would really like is to figure this out myself, but I can't for the life of me find any decent documentation. I found this on rollingConfig but it is not what I'm after...
http://logging.apache.org/log4net/release/sdk/log4net.Appender.RollingFileAppender.html
It seem you have to change log.txt to log_:
<file type="log4net.Util.PatternString"
value="${ALLUSERSPROFILE}/Optex/RedwallServer/Log/log_" />

Log4net filename with date and two logging sources

I'm trying to configure a RollingFileAppender that has a filename format similar to the IIS log files - testYYMMDD.log - and that rolls over by UTC date.
Here is my configuration:
<appender name="rfDate" type="log4net.Appender.RollingFileAppender">
<param name="File" value="C:\Logs\test" />
<param name="StaticLogFileName" value="false" />
<param name="AppendToFile" value="true" />
<param name="RollingStyle" value="Date" />
<param name="DatePattern" value="yyMMdd.lo\g" />
<param name="MaxSizeRollBackups" value="7" />
<dateTimeStrategy type="log4net.Appender.RollingFileAppender+UniversalDateTime" />
<layout type="log4net.Layout.PatternLayout">
<param name="ConversionPattern" value="%utcdate [%t] %-6p %c - %m%n" />
</layout>
</appender>
It works fine if there is only one source writing to this file. However, I have two sources that I want to write to this same one file.
When the first one writes, the log file is created successfully - test140409.log
When the second source writes, a new log file is created - test140409.log140409.log
Both logging sources continue writing to their respective files.
Is there any way to have the date IN the filename, with two logging sources writing to the one log file, without it creating two files?

where to put log4j.xml in netbeans 6.8?

In my case,
I have configured my log4j.xml like this.
<appender name="FILE" class="org.apache.log4j.RollingFileAppender">
<errorHandler class="org.apache.log4j.helpers.OnlyOnceErrorHandler" />
<param name="File" value="F:/myLogger.log" />
<param name="Append" value="true" />
<param name="MaxFileSize" value="20000KB" />
<param name="MaxBackupIndex" value="400" />
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%-5p: %d{dd MMM yyyy HH:mm:ss.SSS} %-5l - %m%n%n" />
</layout>
</appender>
<!-- Root Logger -->
<root>
<priority value="error" />
<appender-ref ref="FILE" />
</root>
and put log4.xml file in source package.
but logger file is not created at specified folder.Let me know the exact path.
(it may be tha t i am using some external JAR which may have log4j.xml) so how to give priority to root application log4j.xml file.

Resources