Wednesday, July 1, 2009

A quick post about web application encoding schemes

Before we get into tools discussions, lets talk a little bit about character encoding schemes. You may remember from my last post that as far as input into a web application goes, assume that all input is malicious and a developer must solidify the defenses to reject known bad content. So, you, as a developer craft together a pretty good regex expression that you pass all of your input through. As long as it's human readable character data, you should be OK, right? Wrong. Attackers can manipulate a character encoding scheme used by an application to cause behavior that the developers did not intend.

Let's look at the common character encodings:

URL Encoding

According to the Web Hacker's Handbook, URLs are permitted to contain only the printable characters in the US-ASCII character set. Therefore, a encoding scheme for URLs was created in order to safely transmit any problematic characters within the extended ASCII character set. For example, the ? and & characters in a URL has a special meanings related to request parameters. If you wanted to inject these characters as data you will need to pass the encoding equivalent.

Here are some common characters in URL encoding:

%3d - =
%20 - space
%0a - new line

Unicode Encoding

This character encoding scheme is designed to support the writing systems all around the world. It can support unusual characters in web applications. 16 bit Unicode encoding and UTF-8 are common unicode encodings.

For example, in UTF-8 , each representation of a characters is a hexidecimal and preceded by a %.

%c2%a9 - copyright

When attacking web applications, unicode encoding can sometimes be used to bypass input validation mechanisms. If an input filter blocks certain expressions, but the component that immediately is invoked after bypassing the filters understand unicode, then it could be possible to launch an attack.

HTML Encoding

This scheme is used to display problematic characters in HTML pages. Some characters have special meanings that are used to define the structure of the document rather than content.

For example, to use these characters as part of the document content, you must HTML encode them:

" - "
' - '
& - &

On top of this, any character can be HTML encoding using its ASCII code in decimal form:

" - "
' - '

HTML encoding is used mainly in checking for XSS vulerabilities in web applications. If an application does not HTML encode its responses, then the application could be vulnerable to XSS attacks.

Base64 Encoding.

This encoding is used primarily for transferring binary information represented as printable ASCII characters.

1 comment:

  1. Ha ha. It looks like this post was 'cleansed' to remove some of the encoding examples above. Maybe to prevent a XSS attack?

    ReplyDelete