HtmlEncode Works Too Well

Dec 13, 2008 at 5:24 PM
In the sample application in the file summary.aspx, a textarea is displayed with the comment:
You can enter HTML in the above box to display rich text.

If I enter the text:
gfdgfdg <a href="">this is great feedback</a> would <b>be</b> cool if this works!

The literal string gets displayed back in the list, not a properly rendered Html string. Behind the scenes, the text is being displayed using:

Obviously this is not working correctly. I would expect it to allow the bold <b> and anchor <a> tags, while, perhaps, filtering out any unwanted attributes, such as a javascript events or the like. A configurable whitelist would be ideal (and perhaps there *is* one and I just am missing it?)

Dec 13, 2008 at 8:24 PM
It's doing exactly what it's supposed to, even if you used the HtmlEncode from the .NET framework it would do exactly the same thing.

HtmlEncode does not make it into a properly rendered Html string; it takes the untrusted input and renders it so it would display as it is entered in HTML output, by escaping the <> and other key characters.

If you want to add filtering then really you need to take the encoded string and decode the bits you want; for example searching for &lt;b&gt; and converting it back into <b>, using a whitelisted approach.
Dec 15, 2008 at 2:16 PM
Oh, I completely agree that HtmlEncode is working correctly. However, I think the error here is with the sample. Perhaps, in the code behind, the sample authors intended to use a different method?

If you read the text below the <textarea> that is displayed to the user, it explicitely says the user can type in Html and it will be rendered to the screen. This is not happening. Instead, the entered text is being HtmlEncode()d and the user winds up seeing the "raw" Html they typed in.

As for unencoding just the tags you want to allow -- that, of course, is not exactly perfect, either. What if the user intentionally typed in <br /> as text in their message (like I just did)?

Honestly, I think the mistake here is in the sample. I did not think AntiXSS intended to include the functionality that is implied on the screen. When I saw it I was excited that they added a new feature. But, the truth is that they had not.

Dec 16, 2008 at 6:12 PM
You are right, the sample has wrong comments. Anti-XSS only encodes the data, it does not filter the data. I will update the sample to reflect the same.
Jan 27, 2009 at 2:04 PM
Edited Jan 27, 2009 at 2:05 PM
What ever happened to the GetSafeHtml method mentioned

This seems frustrating that after all this time the only thing you have improved on is performance and the HttpModule which doesn't address this problem either.  There's endless amounts of people requesting this feature.

Microsoft say they take XSS seriously but they don't understand the issues developers face.  Taking the solution mentioned of replacing the encoded information afterwards then this may be trivual for basic tags (eg b, u etc) but imagine trying to write all the possible regular expressions to allow tags like img (where xss attacks can be placed in the src attribute).  In PHP they have HtmlPurifier but i have not seen anything for .net.

Maybe this library should be renamed to avoid confusion, possibly SlightlyImprovedHtmlEncodeButNotThatUsefulLibrary, kinda rolls of the tongue.