TomcatExpert

Crawler Session Manager Valve

posted by mthomas on May 18, 2011 07:25 AM

For organizations with large publically searchable websites, such as those found in ecommerce companies with large product catalogues or companies with active online communities, web crawlers or bots can trigger the creation of many thousands of sessions as they crawl these large sites. Normally crawling sites without relying on cookies or session IDs, these bots can create a session for each page crawled which, depending on the size of the site, may result in significant memory consumption. New in Apache Tomcat 7, a Crawler Session Manager Valve ensures that crawlers are associated with a single session - just like normal users - regardless of whether or not they provide a session token with their requests.

A Relevant Example

One of the roles I play in the Apache Tomcat project is managing the issues.apache.org servers which run the two Apache issue trackers we have—two instances of Bugzilla and one instance of JIRA. Not surprisingly, JIRA runs on Tomcat. A few months ago, while looking at the JIRA management interface, I noticed that we were seeing around 100,000 concurrent sessions. Given that there are only 60,000 registered users and less than 5,000 active users any month, this number appeared extremely inflated.

After a bit of investigation, the access logs revealed that when many of the webcrawlers (e.g., googlebot, bingbot, etc) were crawling the JIRA site, they were creating a new session for every request. For our JIRA instance, this meant that about 95% of the open sessions were left over from a bot creating a single request. For instance, a bot requesting 100 pages, would open 100 sessions. Each one of these requests would hang around in memory for about 4 hours, chewing up tremendous memory resources on the server.

The Fix

The goal for the Crawler Session Manager Valve is to ensure that when that same crawler requests those 100 pages, it only results in a single session. To do this, Tomcat uses a regular expression to see if the incoming request is from a known user agent HTTP request header (by default it checks for *[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*), and it keeps a note of all the IP addresses those headers came from as well as the last Session ID of that request.

When a crawler first access the site, a new session is created as part of that first request, however upon requesting a second page – the Crawler Session Manager Valve recognizes the crawler from its user agent header, matches it to the IP address and insert the previous session ID into the request. Thus, the crawler only ever opens a single session.

Configuring the Crawler Session Manager Valve

Shipped with Tomcat 7, the Crawler Session Manager is not enabled by default. To turn on the valve, see the valve documentation at http://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve.

There are two main options for configuring this valve. The first is the crawlerUserAgents property which allows you to specify what bots to look for by their user agent header name. Additionally you can configure the sessionInactiveInterval which specifies how long Tomcat should hold on to the assigned session ID. It is not recommended to hold onto the session ID for more than a couple hours as these bots do tend to change their IP addresses regularly.

The Result

For the issues.apache.org site, implementing this valve on the JIRA site alone took the concurrent number of sessions average down from 100,000 to about 5,000. Additionally, there was a significant drop in resource usage on the server, and it is also now relatively simple to monitor from the Current Sessions page what web crawlers are currently active on the site and how many hits they are generating.

Special note: Although JIRA is only certified to run on Tomcat 5 and Tomcat 6, we actually run it on the latest Tomcat 7 release. Running JIRA on Tomcat 7 has not caused any issues which, as an aside, is a testament to how well Tomcat 7 and the Servlet 3.0 specification has been engineered for backwards compatibility.

Mark Thomas is a Senior Software Engineer for the SpringSource Division of VMware, Inc. (NYSE: VMW). Mark has been using and developing Tomcat for over six years. He first got involved in the development of Tomcat when he needed better control over the SSL configuration than was available at the time. After fixing that first bug, he started working his way through the remaining Tomcat bugs and is still going. Along the way Mark has become a Tomcat committer and PMC member, volunteered to be the Tomcat 4 & 7 release manager, created the Tomcat security pages, become a member of the ASF and joined the Apache Security Committee. He also helps maintain the ASF's Bugzilla instances. Mark has a MEng in Electronic and Electrical Engineering from the University of Birmingham, United Kingdom.

Comments

How do I run tomcat without any sessions?

We have a webapp that is exclusively for web service calls. There will never need to be any data retained beyond a single request (e.g., each request is completely stateless). So we really have no need to create any sessions at all for this webapp.

Is there a way to run tomcat (7.0.16) so that it doesn't waste any resources at all creating sessions?

--

Robin D. Wilson

Do we have anything similar

Do we have anything similar for apache server??

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.