For organizations with large publically searchable websites, such as those found in ecommerce companies with large product catalogues or companies with active online communities, web crawlers or bots can trigger the creation of many thousands of sessions as they crawl these large sites. Normally crawling sites without relying on cookies or session IDs, these bots can create a session for each page crawled which, depending on the size of the site, may result in significant memory consumption. New in Apache Tomcat 7, a Crawler Session Manager Valve ensures that crawlers are associated with a single session - just like normal users - regardless of whether or not they provide a session token with their requests.
One of the roles I play in the Apache Tomcat project is managing the issues.apache.org servers which run the two Apache issue trackers we have—two instances of Bugzilla and one instance of JIRA. Not surprisingly, JIRA runs on Tomcat. A few months ago, while looking at the JIRA management interface, I noticed that we were seeing around 100,000 concurrent sessions. Given that there are only 60,000 registered users and less than 5,000 active users any month, this number appeared extremely inflated.
After a bit of investigation, the access logs revealed that when many of the webcrawlers (e.g., googlebot, bingbot, etc) were crawling the JIRA site, they were creating a new session for every request. For our JIRA instance, this meant that about 95% of the open sessions were left over from a bot creating a single request. For instance, a bot requesting 100 pages, would open 100 sessions. Each one of these requests would hang around in memory for about 4 hours, chewing up tremendous memory resources on the server.
The goal for the Crawler Session Manager Valve is to ensure that when that same crawler requests those 100 pages, it only results in a single session. To do this, Tomcat uses a regular expression to see if the incoming request is from a known user agent HTTP request header (by default it checks for
*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*), and it keeps a note of all the IP addresses those headers came from as well as the last Session ID of that request.
When a crawler first access the site, a new session is created as part of that first request, however upon requesting a second page – the Crawler Session Manager Valve recognizes the crawler from its user agent header, matches it to the IP address and insert the previous session ID into the request. Thus, the crawler only ever opens a single session.
Shipped with Tomcat 7, the Crawler Session Manager is not enabled by default. To turn on the valve, see the valve documentation at http://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve.
There are two main options for configuring this valve. The first is the crawlerUserAgents property which allows you to specify what bots to look for by their user agent header name. Additionally you can configure the sessionInactiveInterval which specifies how long Tomcat should hold on to the assigned session ID. It is not recommended to hold onto the session ID for more than a couple hours as these bots do tend to change their IP addresses regularly.
For the issues.apache.org site, implementing this valve on the JIRA site alone took the concurrent number of sessions average down from 100,000 to about 5,000. Additionally, there was a significant drop in resource usage on the server, and it is also now relatively simple to monitor from the Current Sessions page what web crawlers are currently active on the site and how many hits they are generating.
Special note: Although JIRA is only certified to run on Tomcat 5 and Tomcat 6, we actually run it on the latest Tomcat 7 release. Running JIRA on Tomcat 7 has not caused any issues which, as an aside, is a testament to how well Tomcat 7 and the Servlet 3.0 specification has been engineered for backwards compatibility.