Retaining browser scrollTop between page refreshes

August 18th, 2009

Sometimes when you develop web applications with CRUD pages and other backend functionalities you need retain the vertical scrollbar state between page loads. This gives the user a smoother experience, especially when your application doesn’t do any AJAX but it’s based of good old full HTML page request-responses.

We’ll be using jQuery and the cookie plugin:

$(function(){
  // if current viewport height is at least 70% of the previous
  if ($(document.body).height()>$.cookie('h2')*0.7)
    // retain previous scroll
    $('html,body').scrollTop($.cookie('h1'));
});
$(window).unload(function(){
  // store current scroll
  $.cookie('h1', $(window).scrollTop());
  // store current viewport height
  $.cookie('h2', $(document.body).height());
});

We can completely skip the h2 cookie and the if statement but this is an automatic way to prevent scrolling to the very bottom of a short page when coming from a long page. Such use case is common when jumping from the “long list of items” to the “edit an item” page.

Good luck

clearing (or marking) MySQL slow query log

July 28th, 2009

The MySQL slow query log is a very useful feature which all applications in production should have enabled. It logs all queries which complete in more seconds than what the long_query_time server variable specifies.

Sometimes you need to clear this log without restarting mysqld. You could simply erase the file (in Linux) but the file handle would be lost and no further logging would take place.

The best way to purge the log in Linux without having to restart the server is:

cat /dev/null > /path/to/logfile

Another way is to erase or rename the logfile and then run the following statement in mysql:

flush logs;

references:
http://bugs.mysql.com/bug.php?id=3337
http://lists-archives.org/mysql/26837-purge-slow-query-log.html

When you want to do that in Windows though things are different.

I haven’t yet found a way to safely remove the log file while mysqld is running. What I’m left with is to mark the current position in the slow query log so I can look it up and examine the log from that point onwards. The way this can be done is by executing a slow query (slow enough to be logged) and the safest of all looks like:

select "prod-25", sleep(4);

I can now quickly navigate to “prod-25″ in the slow query log; which by the way is a marker of the 25th production release of a system which I’m tunning.

The deployment process could be automatically executing (probably via ant) such a marker query in order to keep all the historic (slow log) data grouped by releases for future analysis.

Simple DoS protection with mod_security

July 22nd, 2009

ModSecurity™ is an open source, free web application firewall (WAF) Apache module. It provides protection from a range of attacks against web applications and allows for HTTP traffic monitoring and real-time analysis with little or no changes to existing infrastructure.

It can do many things for you, such as detecting for XSS, SQL injection or file inclusion attacks.

A special use of mod_security can be simple protection from DoS attacks. Suppose your apache or application logs reveal that some specific IP is requesting too many pages per second (e.g 30 pages/sec from a single IP when your normal peak is 5 globally). In the best case scenario this could result in a slight decrease of the performance of the site which could be noticeable by the other users. In the worst case scenario it could bring the whole site down (denial of service). This attack could of course be unintentional. A misconfigured crawler or a spam bot could be the source of the problem, but in any case you’d like to block such requests.

Here is a possible configuration for mod_security to prevent those simple DoS attacks with explanatory comments:

SecRuleEngine On

SecAuditEngine RelevantOnly
SecAuditLogType Serial
SecAuditLog logs/mod_security.log

# a folder where mod_security will store data variables
SecDataDir logs/mod_security-data

# ignore requests from localhost or some other IP
SecRule REMOTE_ADDR "^127\.0\.0\.1$" "phase:1,nolog,allow"

# for all non static urls count requests per second per ip
# (increase var requests by one, expires in 1 second)
SecRule REQUEST_BASENAME "!(\.avi$|\.bmp$|\.css$|\.doc$|\.flv$|\.gif$|\
                            \.htm$|\.html$|\.ico$|\.jpg$|\.js$|\.mp3$|\
                            \.mpeg$|\.pdf$|\.png$|\.pps$|\.ppt$|\.swf$|\
                            \.txt$|\.wmv$|\.xls$|\.xml$|\.zip$)"\
                            "phase:1,nolog,pass,initcol:ip=%{REMOTE_ADDR},setvar:ip.requests=+1,expirevar:ip.requests=1"

# if there where more than 5 requests per second for this IP
# set var block to 1 (expires in 5 seconds) and increase var blocks by one (expires in an hour)
SecRule ip:requests "@eq 5" "phase:1,pass,nolog,setvar:ip.block=1,expirevar:ip.block=5,setvar:ip.blocks=+1,expirevar:ip.blocks=3600"

# if user was blocked more than 5 times (var blocks>5), log and return http 403
SecRule ip:blocks "@ge 5" "phase:1,deny,log,logdata:'req/sec: %{ip.requests}, blocks: %{ip.blocks}',status:403"

# if user is blocked (var block=1), log and return http 403
SecRule ip:block "@eq 1" "phase:1,deny,log,logdata:'req/sec: %{ip.requests}, blocks: %{ip.blocks}',status:403"

# 403 is some static page or message
ErrorDocument 403 "<center><h2>take it easy yo!"

In case you experiment with this configuration on production make sure you keep an eye on mod_security.log to validate that you are really blocking out requests that you intend to.

Good luck!

fixing javax.mail.MessagingException: Could not connect to SMTP host

July 22nd, 2009

You’ve done everything right. You are using of the JavaMail API with the correct settings and still it doesn’t manage to connect to the SMTP host to dispatch the email. You are on a windows machine and the exception looks like:

javax.mail.MessagingException: Could not connect to SMTP host: your.smtp.host, port: 25;
  nested exception is:
	java.net.SocketException: Software caused connection abort: connect
	at com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:1545)
	at com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:453)
	at javax.mail.Service.connect(Service.java:291)
	at javax.mail.Service.connect(Service.java:172)
	at javax.mail.Service.connect(Service.java:121)
	at javax.mail.Transport.send0(Transport.java:190)
	at javax.mail.Transport.send(Transport.java:120)
        ...
Caused by: java.net.SocketException: Software caused connection abort: connect
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
	at java.net.Socket.connect(Socket.java:519)
	at java.net.Socket.connect(Socket.java:469)
	at com.sun.mail.util.SocketFetcher.createSocket(SocketFetcher.java:267)
	at com.sun.mail.util.SocketFetcher.getSocket(SocketFetcher.java:227)
	at com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:1511)
	... 40 more

One possibility is that something is blocking JavaMail from connecting to the local or a remote SMTP host, and this something can be an anti-virus.

via: http://forums.sun.com/thread.jspa?threadID=590866

p.s Why would a sysadmin want a resident shield anti-virus on a production box serving web content via tomcat still remains a mystery to me

mod_expires and Cache Killers

May 3rd, 2009

Rule 3 of Steve Souders’ YSlow suggests that websites should Add a far future Expires header to the components. Components with a cache header could be static files such as those with extensions .css, .js, .jpg, .png, .gif etc. This gives a huge boost in client side performance of users with a primed cache. In apache this is done via mod_expires and an example configuration would be:

ExpiresActive On
ExpiresByType image/x-icon "access plus 1 month"
ExpiresByType text/css "access plus 1 month"
ExpiresByType application/javascript "access plus 1 month"
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType image/png "access plus 1 month"

All this works well until you need to update a cached static file. The users with the primed cache will either have to wait 1 month to get the new file, or explicitly invalidate their cache. Some people will even ask their users to do a hard refresh but this obviously does not scale and it’s not very robust.

Since you cannot send an automatic signal to the browsers to reload those files all you can do is change the URL of those files (explicit invalidation). You could simply rename all those files, but an easier way to achieve the same effect is by adding a fake (unused – dummy) parameter at the end of the resource URL:

<img src="logo.jpg?2" />

The next logical step would be to automate this into the build system and have every production release feature new cache killer tokens. It seems that many well known sites do that already:

http://slashdot.org/

href="//s.fsdn.com/sd/idlecore-tidied.css?T_2_5_0_254a"
src="//s.fsdn.com/sd/all-minified.js?T_2_5_0_254a" 

http://stackoverflow.com/

href="/content/all.css?v=3184"
src="/content/js/master.js?v=3141"

http://digg.com/

@import "/css/189/global.css";
src="http://media.digg.com/js/loader/187/dialog|digg|shouts"

http://www.bbc.co.uk/

@import 'http://wwwimg.bbc.co.uk/home/release-29-7/style/homepage.min.css';
src="http://wwwimg.bbc.co.uk/home/release-29-7/script/glow.homepage.compressed.js"

http://www.guardian.co.uk/

href="http://static.guim.co.uk/static/73484/common/styles/wide/ie.css"
src="http://static.guim.co.uk/static/73484/common/scripts/gu.js"

What happens with images referenced from within css files? You could rewrite the css files automatically as part of your production build process with Ant.

<tstamp>
    <format property="cacheKill" pattern="yyyyMMddhhmm" locale="en,UK"/>
</tstamp>

<target name="rewrite-css">
    <replace dir="${build.web.dir}" value="css?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>css&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="png?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>png&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="gif?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>gif&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="jpg?${cacheKill}&quot;)"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>jpg&quot;)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="css?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>css')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="png?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>png')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="gif?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>gif')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="jpg?${cacheKill}')"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>jpg')</replacetoken></replace>
    <replace dir="${build.web.dir}" value="css?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>css)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="png?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>png)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="gif?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>gif)</replacetoken></replace>
    <replace dir="${build.web.dir}" value="jpg?${cacheKill})"><include name="css/**/*.css"/><include name="scripts/**/*.css"/><replacetoken>jpg)</replacetoken></replace>
</target>

This will take care of the following background image reference styles for css, png, gif and jpg files:

... background-image: url("images/ed-bg.gif");
... background-image: url('images/ed-bg.gif');
... background-image: url(images/ed-bg.gif);

and convert them to:

... background-image: url("images/ed-bg.gif?200905031126");
... background-image: url('images/ed-bg.gif?200905031126');
... background-image: url(images/ed-bg.gif?200905031126);

Good luck!

A better SMTPAppender

May 2nd, 2009

SMTPAppender for log4j is a type of appender which sends emails via an SMTP server. It’s very useful for applications released in production where you’d definitely need to know of all application errors logged. Of course every caring developer should look at the server logs every now and then, but if you’ve got hundreds of them (applications) then it becomes a full time job in itself.

Sometimes a fresh release of a high traffic website may produce hundreds or thousands of ERROR level log events. Many times this may be something minor which is being logged deep inside your code. Until the bug is fixed and a new release is deployed, your inbox and the mail server may suffer heavily.

What follows is an extension of SMTPAppender which limits the amount of emails sent in a specified period of time. It features sensible defaults which of course can be configured externally via the log4j configuration file.

package com.cherouvim;

import org.apache.log4j.Logger;
import org.apache.log4j.net.SMTPAppender;

public class LimitedSMTPAppender extends SMTPAppender {

    private int limit = 10;           // max at 10 mails ...
    private int cycleSeconds = 3600;  // ... per hour

    public void setLimit(int limit) {
        this.limit = limit;
    }

    public void setCycleSeconds(int cycleSeconds) {
        this.cycleSeconds = cycleSeconds;
    }

    private int lastVisited;
    private long lastCycle;

    protected boolean checkEntryConditions() {
        final long now = System.currentTimeMillis();
        final long thisCycle =  now - (now % (1000L*cycleSeconds));
        if (lastCycle!=thisCycle) {
            lastCycle = thisCycle;
            lastVisited = 0;
        }
        lastVisited++;
        return super.checkEntryConditions() && lastVisited<=limit;
    }

}

The configuration would look something like this:

log4j.appender.???=com.cherouvim.LimitedSMTPAppender
log4j.appender.???.limit=3
log4j.appender.???.cycleSeconds=60
log4j.appender.???.BufferSize=25
log4j.appender.???.SMTPHost=${mail.smtp.host}
log4j.appender.???.From=${mail-sender}
log4j.appender.???.To=${sysadmin.email}
log4j.appender.???.Subject=An error occured
log4j.appender.???.layout=org.apache.log4j.PatternLayout
log4j.appender.???.layout.ConversionPattern=%d{ISO8601} %-5p (%F:%L) - %m%n
log4j.appender.???.threshold=ERROR

The above configuration will limit the mail dispatch to only 3 emails per minute. Any further errors in that minute will not be emailed. The limit and cycleSeconds setting lines can be omitted and the defaults will be applied.

Happy logging!

robots.txt control for host aliases via mod_rewrite

February 21st, 2009

Suppose you have a website launched at two different hosts.

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias beta.example.com
    ....
</VirtualHost>

The content is the same but you want to serve a different robots.txt file, possibly excluding any indexing from the secondary host.

It would be handy if we could simply say:

User-agent: *
Allow: http://www.example.com/

User-agent: *
Disallow: http://beta.example.com/

to allow all bots crawl the primary host and dissalow them from the secondary one, but this syntax is imaginary. Firstly there is no Allow keyword in the spec, and secondly URLs must be relative.

The solution is to have 2 different robots.txt files:

robots-www.txt

User-agent: *
Dissalow:

robots-beta.txt

User-agent: *
Dissalow: /

and serve them via mod_rewrite like this:

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias beta.example.com
    ...
    RewriteEngine On
    RewriteCond %{HTTP_HOST} ^www\.example\.com$
    RewriteRule ^/robots.txt$ /robots-www.txt [L]
    RewriteCond %{HTTP_HOST} ^beta\.example\.com$
    RewriteRule ^/robots.txt$ /robots-beta.txt [L]
</VirtualHost>

Now http://www.example.com/robots.txt will silently serve robots-www.txt and http://beta.example.com/robots.txt will serve robots-beta.txt

This is also handy in domain name migration periods where you are waiting for dns to flush all around the globe until you feel safe for completely shutting down the secondary host and possibly assigning 301 redirects to the primary.

fix hibernate+ehcache: miss for sql

February 21st, 2009

If you are using an entity as a named parameter in a hibernate Query or Criteria which is cachable from ehcache then this entity needs to implement hashcode and equals using a business key. Otherwise the hibernate Query or Criteria may always “look different” to ehcache so it will be a constant cache miss.

DEBUG (MemoryStore.java:138) - query.FooBarCache: query.FooBarMemoryStore miss for sql: /* criteria query */ select this_.id as y0_ from foobars this_ where this_.state=?; parameters: LIVE; max rows: 1; transformer: org.hibernate.transform.PassThroughResultTransformer@294633f0
DEBUG (Cache.java:808) - query.FooBar cache - Miss

My favourite string for testing web applications

February 16th, 2009

Weird title, huh?

When creating templates, pages and action responses for your web application you really need to take HTML escaping into consideration. Sometimes the use cases of the system are so many that you may omit HTML escaping for some piece of dynamic or user entered text.

HTML escaping means converting <foo>bar & " to &lt;foo&gt;bar &amp; &quot; in the HTML source.

One of the reasons for HTML escaping is to avoid XSS attacks or simply to make your site valid.

Reasons for making your HTML output valid include:

  1. It’s the “right thing to do”
  2. Does not tire the browser
  3. Allows you to manually detect (via CTRL+SHIFT+A on web developer toolbar) for real HTML output errors
  4. Saves you from css rendering issues due to HTML anomalies
  5. Ensures your content is easily parsable from third party agents (crawlers, scrappers, XSLT transformators etc)

So my favourite string is <script’”<i>
You can alter your development database content using statements like these:

update articles set title=concat("<script'\"<i>", title);
update users set firstname=concat("<script'\"<i>", lastname), lastname=concat("<script'\"<i>", lastname);
update categories set title=concat("<script'\"<i>", title);
...

If after this database content change your site is not functional then there is a problem. You can also check for HTML validity with CTRL+SHIFT+A on web developer toolbar and quickly spot areas where you missed HTML escaping.

You could even automate this whole process by having a tool (JTidy?) scan that all your pages and use cases produce valid HTML. So indirectly you would be testing for insecure (in XSS terms) parts of the application.

HTML escaping in JSTL
HTML escaping in freemarker
HTML escaping in velocity

Duplicate content test and URL canonicalization

February 15th, 2009

Days ago I uploaded the following script on my server:

<?php

  if ($_SERVER["QUERY_STRING"]=='foo&bar') {
    echo "index test one";
  }

  if ($_SERVER["QUERY_STRING"]=='bar&foo') {
    echo "bar and foo";
  }

  if ($_SERVER["QUERY_STRING"]=='bar&foo&test') {
    echo "bar and foo";
  }

?>

I then published 3 links to my site’s index so Google could follow them:

http://cherouvim.com/foo.php?foo&bar

http://cherouvim.com/foo.php?bar&foo

http://cherouvim.com/foo.php?bar&foo&test

Days later I got this result for the Google query site:cherouvim.com/foo:

The first and third result are the same (duplicate content). Google has indexed them both though. This is a common SEO problem in dynamic web sites where there can be many different URLs linking to the same page (paginators, out of date URLs, archive pages etc) or where you want to do URL Referrer Tracking.

Google has recently published a way of overcoming this problem. You can now specify which is the real (or primary) URL for the page. E.g:

<link rel="canonical" href="/foo.php?foo&bar" />

So, as SEOmoz said, this definitely is The Most Important Advancement in SEO Practices Since Sitemaps.