Archive for the ‘apache httpd’ Category

A Java alternative to xsendfile for apache httpd (that works)

Wednesday, December 15th, 2010

X-Sendfile is a special and non-standard HTTP header that when returned from a backend application server, the frontend webserver will start serving the file that was specified in the header. Quoting mod_xsendfile for apache on why is this useful:

  • Some applications require checking for special privileges.
  • Others have to lookup values first (e.g.. from a DB) in order to correctly process a download request.
  • Or store values (download-counters come into mind).
  • etc.

lighttpd and nginx already have this capability built in. In apache httpd though you need install mod_xsendfile. In case you cannot get it working (I couldn’t in a sensible timeframe) or if you are in an environment where you cannot install extra apache modules then your only hope is serving the file via Java.

For the case of access control I’ve seen (and also written) ProtectedFileServe servlets before, which check a condition and manually stream the file back to the caller. Serving the file like that can be error prone and a better solution is to utilize what already exists in the web container, which in the case of Tomcat is the DefaultServlet.

The following example will delegate the filename for a request from /serve?/resources/filename to the default servlet.

/**
 * Enforces an application authorization check before delegating to the
 * default file servlet which will serve the "protected" file
 * (found under /resources)
 *
 * Will require an apache httpd mod rewrite to convert normal requests:
 *   /resources/image1.png
 *   /resources/docs/doc1.pdf
 *
 * into:
 *   /serve?/resources/image1.png
 *   /serve?/resources/docs/doc1.pdf
 *
 */
public class ProtectedFileServe extends HttpServlet {
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) 
                                 throws ServletException, IOException {
        final String query = req.getQueryString();
        if (query!=null && query.startsWith("/resources/") && isLoggedIn(req)) {
            req.getRequestDispatcher(query).forward(req, resp);
            return;
        }
        resp.sendError(HttpServletResponse.SC_UNAUTHORIZED);
    }
    
    /**
     * Determines whether the requested file should be served
     */
    private boolean isLoggedIn(HttpServletRequest request) {
        return ...;
    }
    
}

Map it in web.xml:

<servlet>
    <servlet-name>ProtectedFileServe</servlet-name>
    <servlet-class>com.example.ProtectedFileServe</servlet-class>
    <load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
    <servlet-name>ProtectedFileServe</servlet-name>
    <url-pattern>/serve</url-pattern>
</servlet-mapping>

Now a request to /serve?/resources/foo.jpg will serve the file /resources/foo.jpg via the default servlet only if the user is logged in.

An enhancement to the URL structure is to apply the following mod_rewrite rule in the apache configuration which will allow URLs such as /resources/foo.jpg to correctly reach the servlet:

RewriteEngine on
RewriteCond %{REQUEST_URI} ^/resources/.*
RewriteRule (.*) /serve?$1 [PT,L]

Simple DoS protection with mod_security

Wednesday, July 22nd, 2009

ModSecurity™ is an open source, free web application firewall (WAF) Apache module. It provides protection from a range of attacks against web applications and allows for HTTP traffic monitoring and real-time analysis with little or no changes to existing infrastructure.

It can do many things for you, such as detecting for XSS, SQL injection or file inclusion attacks.

A special use of mod_security can be simple protection from DoS attacks. Suppose your apache or application logs reveal that some specific IP is requesting too many pages per second (e.g 30 pages/sec from a single IP when your normal peak is 5 globally). In the best case scenario this could result in a slight decrease of the performance of the site which could be noticeable by the other users. In the worst case scenario it could bring the whole site down (denial of service). This attack could of course be unintentional. A misconfigured crawler or a spam bot could be the source of the problem, but in any case you’d like to block such requests.

Here is a possible configuration for mod_security to prevent those simple DoS attacks with explanatory comments:

SecRuleEngine On

SecAuditEngine RelevantOnly
SecAuditLogType Serial
SecAuditLog logs/mod_security.log

# a folder where mod_security will store data variables
SecDataDir logs/mod_security-data

# ignore requests from localhost or some other IP
SecRule REMOTE_ADDR "^127\.0\.0\.1$" "phase:1,nolog,allow"

# for all non static urls count requests per second per ip
# (increase var requests by one, expires in 1 second)
SecRule REQUEST_BASENAME "!(\.avi$|\.bmp$|\.css$|\.doc$|\.flv$|\.gif$|\
                            \.htm$|\.html$|\.ico$|\.jpg$|\.js$|\.mp3$|\
                            \.mpeg$|\.pdf$|\.png$|\.pps$|\.ppt$|\.swf$|\
                            \.txt$|\.wmv$|\.xls$|\.xml$|\.zip$)"\
                            "phase:1,nolog,pass,initcol:ip=%{REMOTE_ADDR},setvar:ip.requests=+1,expirevar:ip.requests=1"

# if there where more than 5 requests per second for this IP
# set var block to 1 (expires in 5 seconds) and increase var blocks by one (expires in an hour)
SecRule ip:requests "@eq 5" "phase:1,pass,nolog,setvar:ip.block=1,expirevar:ip.block=5,setvar:ip.blocks=+1,expirevar:ip.blocks=3600"

# if user was blocked more than 5 times (var blocks>5), log and return http 403
SecRule ip:blocks "@ge 5" "phase:1,deny,log,logdata:'req/sec: %{ip.requests}, blocks: %{ip.blocks}',status:403"

# if user is blocked (var block=1), log and return http 403
SecRule ip:block "@eq 1" "phase:1,deny,log,logdata:'req/sec: %{ip.requests}, blocks: %{ip.blocks}',status:403"

# 403 is some static page or message
ErrorDocument 403 "<center><h2>take it easy yo!"

In case you experiment with this configuration on production make sure you keep an eye on mod_security.log to validate that you are really blocking out requests that you intend to.

Good luck!

mod_expires and Cache Killers

Sunday, May 3rd, 2009

Rule 3 of Steve Souders’ YSlow suggests that websites should Add a far future Expires header to the components. Components with a cache header could be static files such as those with extensions .css, .js, .jpg, .png, .gif etc. This gives a huge boost in client side performance of users with a primed cache. In apache this is done via mod_expires and an example configuration would be:

ExpiresActive On
ExpiresByType image/x-icon "access plus 1 month"
ExpiresByType text/css "access plus 1 month"
ExpiresByType application/javascript "access plus 1 month"
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType image/png "access plus 1 month"

All this works well until you need to update a cached static file. The users with the primed cache will either have to wait 1 month to get the new file, or explicitly invalidate their cache. Some people will even ask their users to do a hard refresh but this obviously does not scale and it’s not very robust.

Since you cannot send an automatic signal to the browsers to reload those files all you can do is change the URL of those files (explicit invalidation). You could simply rename all those files, but an easier way to achieve the same effect is by adding a fake (unused – dummy) parameter at the end of the resource URL:

<img src="logo.jpg?2" />

The next logical step would be to automate this into the build system and have every production release feature new cache killer tokens. It seems that many well known sites do that already:

http://slashdot.org/

href="//s.fsdn.com/sd/idlecore-tidied.css?T_2_5_0_254a"
src="//s.fsdn.com/sd/all-minified.js?T_2_5_0_254a" 

http://stackoverflow.com/

href="/content/all.css?v=3184"
src="/content/js/master.js?v=3141"

http://digg.com/

@import "/css/189/global.css";
src="http://media.digg.com/js/loader/187/dialog|digg|shouts"

http://www.bbc.co.uk/

@import 'http://wwwimg.bbc.co.uk/home/release-29-7/style/homepage.min.css';
src="http://wwwimg.bbc.co.uk/home/release-29-7/script/glow.homepage.compressed.js"

http://www.guardian.co.uk/

href="http://static.guim.co.uk/static/73484/common/styles/wide/ie.css"
src="http://static.guim.co.uk/static/73484/common/scripts/gu.js"

What happens with images referenced from within css files? You could rewrite the css files automatically as part of your production build process with Ant.

<tstamp>
    <format property="cacheKill" pattern="yyyyMMddhhmm" locale="en,UK"/>
</tstamp>

<target name="rewrite-css">
    <replace dir="${build.web.dir}" value="css?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>css&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="png?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>png&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="gif?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>gif&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="jpg?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>jpg&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="css?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>css')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="png?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>png')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="gif?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>gif')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="jpg?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>jpg')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="css?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>css)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="png?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>png)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="gif?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>gif)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="jpg?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>jpg)</replacetoken></replace>
</target>

This will take care of the following background image reference styles for css, png, gif and jpg files:

... background-image: url("images/ed-bg.gif");
... background-image: url('images/ed-bg.gif');
... background-image: url(images/ed-bg.gif);

and convert them to:

... background-image: url("images/ed-bg.gif?200905031126");
... background-image: url('images/ed-bg.gif?200905031126');
... background-image: url(images/ed-bg.gif?200905031126);

Good luck!

robots.txt control for host aliases via mod_rewrite

Saturday, February 21st, 2009

Suppose you have a website launched at two different hosts.

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias beta.example.com
    ....
</VirtualHost>

The content is the same but you want to serve a different robots.txt file, possibly excluding any indexing from the secondary host.

It would be handy if we could simply say:

User-agent: *
Allow: http://www.example.com/

User-agent: *
Disallow: http://beta.example.com/

to allow all bots crawl the primary host and dissalow them from the secondary one, but this syntax is imaginary. Firstly there is no Allow keyword in the spec, and secondly URLs must be relative.

The solution is to have 2 different robots.txt files:

robots-www.txt

User-agent: *
Dissalow:

robots-beta.txt

User-agent: *
Dissalow: /

and serve them via mod_rewrite like this:

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias beta.example.com
    ...
    RewriteEngine On
    RewriteCond %{HTTP_HOST} ^www\.example\.com$
    RewriteRule ^/robots.txt$ /robots-www.txt [L]
    RewriteCond %{HTTP_HOST} ^beta\.example\.com$
    RewriteRule ^/robots.txt$ /robots-beta.txt [L]
</VirtualHost>

Now http://www.example.com/robots.txt will silently serve robots-www.txt and http://beta.example.com/robots.txt will serve robots-beta.txt

This is also handy in domain name migration periods where you are waiting for dns to flush all around the globe until you feel safe for completely shutting down the secondary host and possibly assigning 301 redirects to the primary.

Logging website subsections with Apache

Saturday, April 26th, 2008

This is a typical VirtualHost for example.com in Apache HTTP Server:

<VirtualHost xx.xxx.xx.xx:80>
    DocumentRoot /home/example.com/site
    ServerName example.com
    ErrorLog /home/example.com/logs/error.log
    CustomLog /home/example.com/logs/access.log "combined"
</VirtualHost>

This CustomLog directive will produce something like this on the logfile:

64.229.111.27 - - [08/Aug/2007:16:12:54 +0300] GET /foo/index.html HTTP/1.1
64.229.111.27 - - [08/Aug/2007:16:12:54 +0300] GET /bar/index.html HTTP/1.1
64.229.111.27 - - [08/Aug/2007:16:12:54 +0300] GET /bar/about.html HTTP/1.1

Suppose that you wanted to log requests for /foo and /bar subsections in 2 separate files. That way you (or your client) may find it easier later on to analyze web traffic for the subsites individually.

What you can do is log conditionally using environment variables that you’ve set using SetEnvIf based on the request.

SetEnvIf Request_URI "/foo.*" subsite_foo
SetEnvIf Request_URI "/bar.*" subsite_bar
CustomLog /home/example.com/logs/access.foo.log combined env=subsite_foo
CustomLog /home/example.com/logs/access.bar.log combined env=subsite_bar

This works nice, but not when the urls for your individual subsites look like this:

http://example.com/?site=foo&page=5

http://example.com/?site=bar&page=8

That’s because the Request_URI does not contain the query string (whatever comes after the ? character).

The solution comes with a little bit of RewriteCond magic, which helps us set an environment variable using ${QUERY_STRING}:

RewriteEngine on
RewriteCond %{QUERY_STRING} .*site=foo.*
RewriteRule (.*) $1 [E=subsite_foo:1]
RewriteCond %{QUERY_STRING} .*site=bar.*
RewriteRule (.*) $1 [E=subsite_bar:1]
CustomLog /home/example.com/logs/access.foo.log combined env=subsite_foo
CustomLog /home/example.com/logs/access.bar.log combined env=subsite_bar

Happy logging…