Validate HTML input with Linq2XML

2008, Nov 17 4 mins read

Suppose you have a very simple page where user can add comments to an issue, user can enter plain text and also they can use th HTML tag <b> to render in bold some text. In the example code you can see a very simple implementation (default.aspx). It use a xml file for back end storage (so you can run the example without a database) and in Default.aspx all the text that was entered by the user was stored in a CData section of the XML STorage file. When the comments are rendered on the page we simply output all the content. The result is good but have some problems. First of all you can use every html tag, such as <i> moreover, if some hacker enter the text you were <script>alert(‘hacked’);</script> into the textbox, all the user that read the page will execute that script, this is a simple sample of cross site scripting attack.

Here is the content of the storage file

1
2
3
4
5
6
7
<?xml version="1.0" encoding="utf-8"?>
<Comments>
  <Comment>
    <Author>Hacker</Author>
    <CommentText><![CDATA[you where <script>alert('hacked');</script>]]></CommentText>
  </Comment>
</Comments>

In Default2.Aspx there is a simple solution to mitigate this problem, trying to remove all node that are not <b>, but with this solution if a user enter this is a <b>good comment</b> with <i>italic text</i>, we have a big problem, the part in the <i></i> tag gets completely removed. Moreover if some hacker gets a way to change your storage file, or can insert some data in database you still have problem. A better solution was showed in Default3.aspx. Since I consider data in the file as untrusted input because it is under direct control of the user I need to sanitize the comment before I can render text to the user. The goal is having a SanitizeComment function that gets a html fragment as input and return a new fragment with all tags removed, except those that are permitted.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
 1 private String SanitizeComment(String commentText)
 2 {
 3   try
 4   {
 5       XElement doc = XElement.Parse("<span>" + commentText + "</span>");
 6       doc.Descendants().Where(elem => elem.Name != "b")
 7         .ToList().ForEach(elem =>
 8          {
 9              elem.AddAfterSelf(new XText((String)elem));
10              elem.Remove();
11          });
12 
13       String retvalue = doc.ToString();
14       return retvalue;
15   }
16   catch (System.Xml.XmlException)
17   {
18       return AntiXss.HtmlEncode(commentText);
19   }
20 }

This code use Linq2Xml; first of all in line 5 I create a XElement with a concatenation of a <span> tag and the original content of the comment. Then I select in line 6 all the XML nodes that have a name different from b,(the only permitted tag in the output). Then for each of the unpermitted nodes I simply add a new XML node of type XText after the node, and then remove the original node.

If some XMLException occurs, it means that the input is not a well formed XML fragment, so I default to use the AntiXss HtmlEncode function that avoid any cross site scripting risk.

Here is a result of the page when the storage file contains this comments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<?xml version="1.0" encoding="utf-8"?>
<Comments>
  <Comment>
    <Author>Hacker</Author>
    <CommentText><![CDATA[you where <script>alert('hacked');</script>]]></CommentText>
  </Comment>
  <Comment>
    <Author></Author>
    <CommentText><![CDATA[this is a <b>good comment</b>]]></CommentText>
  </Comment>
  <Comment>
    <Author></Author>
    <CommentText><![CDATA[this is a <b>good comment</b> with <i>italic text</i> and <b>again Bold<i> italicbold</i></b>]]></CommentText>
  </Comment>
</Comments>

You can see that the content of the file contains dangerous html code, but here is what is rendered by Default3.aspx that calls SanitizeComment function.

As you can notice the alert(‘hacked’) was shown without the <script> tag, moreover all the text between <i> and </i> gets no removed. SanitizeComment function leaves only the tag <b> and remove all other unwanted tags.

sample code here.

alk.

Tags: Linq2XML Security Validation