Why isn't HtmlEncode support multilanguages?

Mar 2, 2011 at 11:16 PM

I am html encoding double byte characters and I hate to have to quadruple my string size. I noticed Encoder.SafeListCodes allows Multilanguage to be excluded however it is only used by JavaScriptEncode and VisualBasicScriptEncode. My question is why only these two. Why isn’t HtmlEncode using this?


Mar 2, 2011 at 11:18 PM

It does, as of v4. If you're seeing problems with DBCS characters and hi/low surrogates then please log a bug.

It's the other way around though - JavaScript and VBScript encoding ignore the safe lists, for now.

Mar 3, 2011 at 1:33 AM
Edited Mar 3, 2011 at 1:35 AM

are you sure about this?

I am using the sourcecode of chageset 61742 which is associated with release: AntiXSS Library V4.0.

And the below unit test is failing

        public void TestHtmlEncode()
            // 256 korean letters
            const string Unencoded = "배우이다해하면청순함과단아함이먼저떠오른다어디하나흠잡을데없는미모에성격까지깔끔하다그런데드레스속휴지라니배우로서치명적인굴욕이다그렇다면드레스굴욕의책임이해당기자와네티즌들에게만있을걸까차적인책임은이다해본인에게있다고본다지난일열린제회서울문화예술대상시상식에서이다해는아나운서김병찬과공동였다영화제등각종시상식에서여자게쏠린시선을의식했는지이다해는개나리를연상시키는노란색드레스를입고화려하게레드카펫을밟았다그녀가레드카펫에등장하자수많은카메라플래쉬가터졌다수많은사진기자들중모언론사기자가한장의특종사진을포";
            string encoded = Microsoft.Security.Application.Encoder.HtmlEncode(Unencoded);

            Assert.AreSame(Unencoded.Length, 256); // always pass
            Assert.IsFalse(encoded.Contains("#")); // failng now
            Assert.AreSame(encoded.Length, 256); // failing now

Mar 3, 2011 at 1:35 AM

update unit test

        public void TestHtmlEncode()
            // 256 korean letters
            const string Unencoded = "배우이다해하면청순함과단아함이먼저떠오른다어디하나흠잡을데없는미모에성격까지깔끔하다그런데드레스속휴지라니배우로서치명적인굴욕이다그렇다면드레스굴욕의책임이해당기자와네티즌들에게만있을걸까차적인책임은이다해본인에게있다고본다지난일열린제회서울문화예술대상시상식에서이다해는아나운서김병찬과공동였다영화제등각종시상식에서여자게쏠린시선을의식했는지이다해는개나리를연상시키는노란색드레스를입고화려하게레드카펫을밟았다그녀가레드카펫에등장하자수많은카메라플래쉬가터졌다수많은사진기자들중모언론사기자가한장의특종사진을포";
            string encoded = Microsoft.Security.Application.Encoder.HtmlEncode(Unencoded);

            Assert.AreSame(Unencoded.Length, 256); // always pass
            Assert.IsFalse(encoded.Contains("#")); // failng now
            Assert.AreSame(encoded.Length, 256); // failing now

Mar 3, 2011 at 3:09 AM

There shouldn't be failing tests - now that is one of the old tests which I err, forgot to remove the file for, but which aren't part of the test project any more. The tests we use are in unicode.cs

Now remember you must whitelist the language sets you want to be left alone - we default to the Latin character set, because, well, that's most of the customer base I'm afraid. So in your application initialisation you need to call


You'll find the language definitions in CodeCharts.cs - note that it's not really languages, but Unicode Character pages. There are some rather large enums there, I believe (and excuse me if this is wrong, my language skills aren't great) you may need


Note the enums are flags, so you can combine.

Mar 3, 2011 at 3:26 AM

Oh if this isn't clear from the documentation let me know, there will be another drop in a couple of months, so I can update it with what you'd like to see documentation wise.

Apr 11, 2011 at 9:59 PM

Quick question our website would like to mark as safe anything entered in japanese.

I tried to add it however I am very confused where can I find the right combination of code charts for japanese?

Do I need to match the five values?

Microsoft.Security.Application.UnicodeCharacterEncoder.MarkAsSafe(Microsoft.Security.Application.LowerCodeCharts.Thai, Microsoft.Security.Application.LowerMidCodeCharts.HangulJamo, Microsoft.Security.Application.MidCodeCharts.Arrows, Microsoft.Security.Application.UpperMidCodeCharts.Bamum, Microsoft.Security.Application.UpperCodeCharts.AlphabeticPresentationForms);


Please advise!


Apr 11, 2011 at 10:03 PM

The code tables are flags, so you can combine ones in each table thus :

LowerCodeCharts.BasicLatin | LowerCodeCharts.C1ControlsAndLatin1Supplement | LowerCodeCharts.LatinExtendedA | LowerCodeCharts.LatinExtendedB


Each enum also has a None value, which you can pass in when you don't want to safe mark any of the languages in that range.


Apr 11, 2011 at 10:52 PM

Hi bdorrans, thanks a lot for your quick reply!!!

Yes, that's what I thought, but do you know where I can find the list of possible values for each combination

for example for Japanese?

I found this reference


But this is still confusing, for example in the above example someone asked how to mark as safe Korean Characters.

Your answer was LowerMidCodeCharts.HangulJamo

so the right way to mark it as safe would look like this?

Microsoft.Security.Application.UnicodeCharacterEncoder.MarkAsSafe(Microsoft.Security.Application.LowerCodeCharts.None, Microsoft.Security.Application.LowerMidCodeCharts.HangulJamo, Microsoft.Security.Application.MidCodeCharts.None, Microsoft.Security.Application.UpperMidCodeCharts.None, Microsoft.Security.Application.UpperCodeCharts.None);

what's the meaning of LowerCodeCharts, LowerMidCodeCharts, etc? are those languages grouped by specific characteristics?


thanks again!

Apr 11, 2011 at 11:26 PM

The reason they're split into lower, lowermid etc. is because a single enum can't hold enough options - nothing more. Unicode doesn't work in terms of languages, but in terms of characters. I won't know how to map a language to a character set to be honest, except for English, because, well that's what I speak. I know there are multiple CJK characters in multiple unicode sets depending on the dialect, for example there's CJK Extension A. If you can figure out which ones you need then you can combine them like I've shown.

What I would say is that you should probably leave the defaults in for LowerCodeCharts, there's a LowerCodeChart.Default especially for that.

But there's more than one type of Japanese, for example there's Katakana and Hiragana, as well as supplemental CJK pages too. So you might try


    UpperMidCodeCharts.CjkRadicalsSupplement | UpperMidCodeCharts.CjkSymbolsAndPunctuation,

as a starting point, and ask your users to be more specific.
Apr 21, 2011 at 8:56 PM

Thanks bdorrans, it worked perfectly, I have one more question , it may not be completely related though...

Our website allow users to share links, we go and scrape the site (using HtmlAgilityPack) we retrieve the images and the description and then we save it to the database for later reference.

Do you think is safe enough to use Sanitizer.GetSafeHtmlFragment to retrieve the description -which based on what I've seen shouldn't contain any html?

I guess my concern is how to deal with malicious url, do you know if the sanitizer will prevent the virus to execute? or is not even possible to get a virus via HttpWebRequest...

I am unable to find any information online, not sure if you know anything about it, basically if the Url contains malicious information

like http://www.yahoo500.com/.../render.jpg?v2020 which looks like a safe url but it install malware do you know if I would get it by scraping a page?


Thanks for your help

Apr 21, 2011 at 8:59 PM

GetSafeHtmlFragment leaves some HTML. Certainly on the server site issuing a HttpWebRequest you're only parsing the return, not executing it in any format. I'm not saying it's impossible, some weirdness with pointers or heap sprays might do something, but the risk is, in my opinion, minimal.

Dec 9, 2011 at 9:27 PM


What would be the best way to allow certain xml tags such as <MySafeTag>? How can we use MarkAsSafe() to accomplish that?

Of course, we certainly want to stop the usual <script ..



Dec 9, 2011 at 9:31 PM

fabian - you can't. MarkAsSafe() is for unicode code tables, characters if you will. You could do a search and replace after the encode, replacing &lt;MySafeTag&gt; thought. *mlEncode only works in terms of characters, not tags.

Dec 9, 2011 at 9:52 PM

Got it. But am I barking up the wrong tree - what about the XmlEncode method? Can I just call that instead and that should take care of sanitizing my xml?



Dec 9, 2011 at 9:53 PM

No - all the encoders are on a per character basis.

Dec 13, 2011 at 9:42 PM

Ok, so you really can't add individual characters to the whitelist per se, you have to add the entire table - correct? In that case, if I only want to consider a few additional characters as safe, I could just do a search and replace after the encode as mentioned. Am I on the right track?