I’ve been working with a team over the last few months to create a search engine that could sort by date for the local student newspaper. Among the open-source search engines available, Nutch seems to be the easiest to set up. However, I couldn’t find any tutorials on sorting by date so I decided to write this one.
Nutch, in its default configuration, will sort pages by relevance using Lucene scores. You can get Nutch to sort by another field in its index using the sort parameter, but first, you need a field to sort by. (For an example of how sorting works, add &sort=url to the end of your Nutch query. Obviously sorting by url isn’t very useful, but you can see how the query works. Note that you can also add &reverse=true to sort the results in reverse order.)
Sorting by date in Nutch essentially involves two parts: first, using a plugin to get Nutch to index dates into a new field in your index, and second, getting your query page to add the additional parameter to search queries.
Indexing dates with a Nutch plugin
NOTE: I figured out most of the information here from the tutorial on writing a Nutch plugin available on the Nutch Wiki. If you need more detailed information on any of the steps here I recommend reading that tutorial.
All Nutch plugins implement an interface. These interfaces are called extension points. The Nutch Wiki has a list of available extension points. The goal of our plugin is to add an additional date field to the Lucene index, so we want to implement the IndexingFilter interface.
Nutch plugins consist of xml description files and your Java source code files. File structure is as follows:
/nutch-dir
/src
/plugin
/yourpluginname
build.xml
plugin.xml
/src
/org
/apache
/nutch
/yourpluginname
ImplementationClassName.java
Here’s an example build.xml file:
<?xml version="1.0"?> <project name="yourpluginname" default="jar"> <import file="../build-plugin.xml"/> </project>
And here’s an example plugin.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="yourpluginname"
name="longer/human-readable name of your plugin"
version="0.0.1"
provider-name="your_website">
<runtime>
<library name="yourpluginname.jar">
<export name="*"/>
</library>
</runtime>
<extension id="org.apache.nutch.urldateindexer.yourpluginname"
name="longer/human-readable name of your plugin"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="ImplementationClassName"
class="org.apache.nutch.yourpluginname.ImplementationClassName"/>
</extension>
</plugin>
NOTE: I had some trouble using my own package name, but it might have been due to another problem. When in doubt I recommend sticking with org.apache.nutch.yourpluginname.
And here’s the code for the Java source file, ImplementationClassName.java. (You’ll want to change it to your own name to match what you put in the plugin.xml file.)
package org.apache.nutch.yourpluginname;
//imports
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateTools.Resolution;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.parse.Parse;
public class UrlDateIndexer implements IndexingFilter {
private Configuration conf;
public UrlDateIndexer() {
}
//this is where your code to add a new field to the index goes
public Document filter(Document doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks)
throws IndexingException {
// Get url from the method inputs
String urlString = url.toString();
// Use regex to identify type of date and parse date from url format (code for the TdcDateParser class not shown)
TdcDateParser tdcDateParser = new TdcDateParser();
Date publishedDate = tdcDateParser.parseUrl(urlString);
// Store date in field
if (null != publishedDate){
//create a new field
Field publishedDateField = new Field("published_date",DateTools.dateToString(publishedDate, Resolution.DAY),
Field.Store.YES, Field.Index.UN_TOKENIZED);
//add the field to the index
doc.add(publishedDateField);
}
return doc;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
}
In the case of the plugin I build, the goal was to parse the date from the url of a page. You’ll need to modify this method to get the date from wherever is appropriate for your search engine. The basic flow is to create a field, add that field to a document and then return the document.
There are a few things to note about the line that creates the new field:
Field publishedDateField = new Field("published_date",DateTools.dateToString(publishedDate, Resolution.DAY),
Field.Store.YES, Field.Index.UN_TOKENIZED);
dateToString method to store dates. Field.Store to YES is mandatory to be able to sort by a field.Now that you have finished creating your files, upload them to the nutch directory on your server if you haven’t already done so. Then, add a line like the following to the build.xml file in the /nutch-dir/src/plugin directory:
<ant dir="yourpluginname" target="deploy"/>
Then launch ant from the command line in that same directory (/nutch-dir/src/plugin). This should compile the classes and jar file for your plugin and place them in the /nutch-dir/build directory. It should also place a copy of the jar file and plugin.xml file in the /nutch-dir/plugins directory.
NOTE: I had to modify the deploy.dir property in the /nutch-dir/src/plugin/build-plugin.xml file to get ant to put the jar and plugin.xml file in the right plugins directory. Originally it was set to put them in build/plugins. I’m not sure which directory is formally correct, but the /nutch-dir/plugins directory is where Nutch was set to read from on my system. If this is the case for you, just remove /build from the deploy.dir line. My resulting deploy.dir line is as follows:
<property name="deploy.dir" location="${nutch.root}/plugins/${name}"/>
Next, you need to modify the /nutch-dir/conf/nutch-site.xml file to recognize your new plugin. Just add your plugin name to the plugin.includes property. For example:
<property> <name>plugin.includes</name> <value>yourpluginname|protocol-http|urlfilter-regex|parse-(text|html|js)|index -basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|reg ex|basic)</value> </property>
Launch a Nutch crawl of just a few pages to test your new plugin. To see if the plugin was activated by Nutch, open /nutch-dir/logs/hadoop.log and look for a line like the following:
2009-04-17 10:54:27,477 INFO plugin.PluginRepository - Indexer of dates from URLs (urldateindexer)
If the plugin didn’t show up, make sure your plugin jar and xml files are in the directory indicated in these log file lines:
2009-04-17 10:54:23,617 INFO plugin.PluginRepository - Plugins: looking in: /nutch-0.9/plugins 2009-04-17 10:54:24,062 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-04-17 10:54:24,062 INFO plugin.PluginRepository - Registered Plugins:
Finally, test to see whether your plugin actually added fields to the index. You can do this using Luke. Download the yourcrawl/index directory from your crawl to your local machine and open it with Luke. Check that the name of your new field is listed in the fields list and that the field holds the expected values for your sample documents. If not, modify your source code, run ant in the /nutch-dir/src/plugin directory to rebuild the jar file and run your crawl again.
Sorting by date from your query page
You’re almost done! Now that your new field is in the index, you just need to modify your query page to sort by the new field. The default nutch query pages are /apache-tomcat-6.0.14/webapps/ROOT/search.jsp and (language)/search.html in that directory.
If you always want to sort by date, just find the line of code with input name="query" in it and add a parameter to sort by date after it. For example:
<input name="query" size="44" /> <input type="submit" value="Search" /> <input type="hidden" name="sort" value="published_date" />
Note that you may also want to add the following parameter to make Nutch put the most recent dates first.
<input type="hidden" name="reverse" value="true" />
If you want to do something more sophisticated, like only sort by date in certain situations, you can modify the code in search.jsp. Look for the following line, and modify it as needed:
String sort = request.getParameter("sort");
For instance, I added a checkbox to my site that let users chose whether to sort by date or not. Since this checkbox returned a value only if it was checked, I wrote code that used the presence of that parameter to determine whether to assign a value to sort.
if (request.getParameter("sort_by_date") != null){
sort = "published_date";
reverse = true;
}
Good luck with your own Nutch sorting projects! Feel free to leave a comment or contact me if you have any questions.
You were always so smart. Im proud of you!
Ryan -
I was wondering if you have a working demo of this up anywhere for your newspaper website? We are looking for a search option for our News website with a sort-by-date option and are considering Nutch or Solr.
Hi Scott,
We don’t have a working demo up on the newspaper site because we ended up using a custom Google Search Engine with the date value heavily weighted instead. This accomplished 99% of what we were hoping to achieve while minimizing many of the administrative and software maintenance headaches. I was advised on this project by professor C. Lee Giles. He might still have a running copy of the Nutch program around somewhere, or at least the source code. Shoot me an e-mail if you’re interested.
I’m a bit confused why the plugin was needed. Is it because nutch stores dates in the wrong way, or because you needed to get a specific date parsed from the page?
If so, why weren’t the default fields (date, publishedDate, updatedDate) good enough?
I am curious because I’m having date issues with Nutch right now. They don’t seem to be in a format that I can do range queries with in solr.
Hi Max,
Since the site I was indexing used Movable Type to generate the pages, the files are occasionally recreated. That meant the file’s modification date wasn’t necessarily the date of the content on that page. The date of the content itself was indicated by the url of the page, e.g. “/2009/02/01/pagename.html.”
I created the plugin so I could specifically index and sort by the date in the url.
Unfortunately this was a while ago and I don’t specifically recall why I didn’t use the default fields you referenced to store the date. It may be because 1) I was simply unaware of them (I couldn’t easily find a list of the fields online), 2) the fields were already populated with other data such as the file’s modification date, or 3) the fields weren’t configured precisely as I needed them. As you’ll note from the source code, the field had to be both stored and un-tokenized.
If you get it working with the default fields, let me know!
Hello
Changing the deploy.dir on my system causes the nutch-1.1.war file to not be deployed correctly in tomcat6, can you help understand why?
catalina.out
outputs
12/Ago/2010 18:11:25 org.apache.catalina.core.StandardContext start
SEVERE: Error listenerStart
12/Ago/2010 18:11:25 org.apache.catalina.core.StandardContext start
SEVERE: Context [/nutch-1.1] startup failed due to previous errors
What I did was to change
plugin.folders
from
plugins,build/plugins
to
plugins
IE. what I did to solve the problem
André, not sure off the top of my head. Can you try putting build/plugins back in and see if it starts working again? My guess is there are some files there that Nutch needs.
Also, can you provide more context about why you are making this change? Maybe we can figure out another way to accomplish the same thing.
Hii
when I append &sort=date to my url, it is working fine.
But I am not able to sort based on the size of the document by appending &sort=contentLength to my url. What are the set of values the sort parameter can take.
I have included both index-more and query-more plugin.
Please help. Thanks
Hi Hemanth,
Nutch can only sort by parameters it has indexed. I’m not sure if it indexes contentLength, since Nutch doesn’t index all page metadata properties by default. I’m not sure if there’s documentation about what properties are indexed by default — your best bet might be to just write code to force it to index contentLength if it’s not working.
the index-more plugin adds date, content-length, contentType, primaryType and subtype fields to the index.
I am sure they are being indexed, because I can actually display them along with the search results.
Anyway, Thank you Ryan. Your article above is really good. I will try it out.
[...] Writing a plugin to add dates by Ryan Pfister [...]
This is Very Interesting Mr. Pfister. Thank you for sharing this quality information with us all. Haha just kiddin’.` Wats poppin man? and wats nutch? anyway tell kira i sed hi!
Good day! I could have sworn I’ve been to this blog before but after reading through some of the post I realized it’s new to me.
Nonetheless, I’m definitely happy I found it and I’ll be bookmarking
and checking back often!