Upcoming changes to AntiXSS

Coordinator
Jul 20, 2010 at 4:36 PM

So this sprint I’ve been playing around with AntiXSS. One of the most popular requests was “Can you support language X?” and now the answer is probably yes. I say probably, because we’re covering the UTF-16 code tables – if you wanted support for Byzantine Musical Notation (really, it exists) then you’re out of luck.

Now there’s a little problem in all of this – Unicode doesn’t have a concept of language, it has code tables. If you’re lucky your language will be self contained in a single code table, for example, the characters for Armenian are contained in the code table  between 0x0530 and 0x058F. If you’re unlucky then they may be scattered all over the place. There’s no easy way to take a language and decide what code tables it needs and languages can be a geo-political nightmare so I’m afraid you’re going to have to learn your Unicode tables. The next version will contain five enums which contain every Unicode code table. You can then combine the enum values to safe list your desired characters via

UnicodeCharacterEncoder.MarkAsSafe(LowerCodeCharts lowerCodeCharts,
LowerMidCodeCharts lowerMidCodeCharts,
MidCodeCharts midCodeCharts,
UpperMidCodeCharts upperMidCodeCharts,
UpperCodeCharts upperCodeCharts)

I apologise in advance for the horrible method signature, however marking code tables as safe should be a once per application initialisation call. Once you safe list your language its characters will be left alone and not converted into their &#xxxx; values, which gives a rather nifty speed boost (one of the major sticking points for using AntiXSS has been the performance hit). The default safe code tables are Basic Latin, C1 Controls and Latin1 supplement, Latin Extended A, Latin Extended B, Spacing Modifier Letters, IPA Extensions and Combining Diacritical Marks. Obviously the fun characters in Basic Latin, < > & and " are always going to be encoded to their entity values.

The safe list applies to all four *ML methods, HtmlEncode, HtmlAttributeEncode, XmlEncode and XmlAttributeEncode. HtmlAttribueEncode will also escape the space character and the apostrophe to their unicode character values, XmlEncode will change the apostrophe to &apos; and XmlAttributeEncode escapes the space and changes the apostrophe.

One added bonus of the new safe lists is that it will fix a problem with .NET 4.0, AntiXSS and UpdatePanels. If you replace the .NET 4.0 encoder with AntiXSS, as Phil detailed and then use update panels things can get a little confused -  the __EVENTVALIDATION field gets encoded, and not decoded, so / turns into &47; and all heck breaks loose. As the Basic Latin safe listing has been widened to leave more characters alone because they’re simply not dangerous the / character will remain a / and your update panels will work as expected.