Comparison with validating XHTML via XML Schema

Nov 10, 2009 at 2:17 PM

I gather that the XSS Library uses a character-based whitelist method.

We were looking at another approach to validating FCKEditor-generated XHTML content using a custom ASP.NET validator which validates the text input against an XML Schema representing a greatly cut-down subset of XHTML (basically allowing elements div, p, ul, ol, li, strong, em with no attribute plus a with @href).

This approach (done right) would seem to serve security, editorial and accessibility purposes. It wouldn't work with HTML rather than XHTML (unless there was a prior conversion). You would need to lock down your XHTML editor interface to only present allowable formatting options (FCKEditor is sufficiently configurable). In our test page, we are currently wrapping XHTML fragment input in a div with the http://www.w3.org/1999/xhtml namespace, although your XHTML editor might manage this itself.

So I think the schema-validation approach meets the "principle of inclusions" that XSS Library also follows. Using XML Schema we can be very sure of the input content structure.

Have you considered making this available via the XSS Library, or do you see problems with it? Are the two methods (encoding, schema) in any way complementary?

Might the XSS Library offer such XHTML-subset validation in future?

Coordinator
Nov 13, 2009 at 5:06 AM

This is interesting way to sanitize input, I have to agree that it does fall in the inclusions principle space. XML schema can be used to validate HTML for valid structure. But I would not suggest it as a security option though, primarily you are checking the structure but not its content. For example, you could have an element with class or style attribute with scripts in them. You are only as good as your allowed list of elements and attributes, which can make formatting very hard for users.

I think encoding and sanitization are two options for the same solution. Where sanitization involves removing or replacing undesirable characters and encoding involves transforming undesirable characters into representation which otherwise might pose a security threat. So you would either choose to sanitize the input (which is a tougher problem to solve based on the domain of characters you are accepting) or encode the input. You could do both, but does not add any value.

I would suggest one addition to your approach, validate the schema as you said and in addition validate content inside each element to ensure you accept valid characters. Have a whitelist of characters that you accept and ensure that text inside elements match this whitelist. This way, you are validating the structure and content inside structure. I would like to know your final implementation for curiosity sake.

Thoughts?

Anil Revuru (INFORMATION SECURITY TOOLS)

From: TavisR [mailto:notifications@codeplex.com]
Sent: Tuesday, November 10, 2009 7:22 AM
To: Anil Revuru (INFORMATION SECURITY TOOLS)
Subject: Comparison with validating XHTML via XML Schema [AntiXSS:74660]

From: TavisR

I gather that the XSS Library uses a character-based whitelist method.

We were looking at another approach to validating FCKEditor-generated XHTML content using a custom ASP.NET validator which validates the text input against an XML Schema representing a greatly cut-down subset of XHTML (basically allowing elements div, p, ul, ol, li, strong, em with no attribute plus a with @href).

This approach (done right) would seem to serve security, editorial and accessibility purposes. It wouldn't work with HTML rather than XHTML (unless there was a prior conversion). You would need to lock down your XHTML editor interface to only present allowable formatting options (FCKEditor is sufficiently configurable). In our test page, we are currently wrapping XHTML fragment input in a div with the http://www.w3.org/1999/xhtml namespace, although your XHTML editor might manage this itself.

So I think the schema-validation approach meets the "principle of inclusions" that XSS Library also follows. Using XML Schema we can be very sure of the input content structure.

Have you considered making this available via the XSS Library, or do you see problems with it? Are the two methods (encoding, schema) in any way complementary?

Might the XSS Library offer such XHTML-subset validation in future?

Read the full discussion online.

To add a post to this discussion, reply to this email (AntiXSS@discussions.codeplex.com)

To start a new discussion for this project, email AntiXSS@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Nov 13, 2009 at 8:55 AM

Thanks for your reply. Here is our prototype "simple text entry" schema based on W3C XHTML. I tried to create this with XHTML modules, but that was more complex than I could manage in the time. We do not want users to format the input with styles, but just use paragraphs, lists and hyperlinks.

Our test ASP.NET page validates the HTML entry of our FCKEditor page. Here are the various components.

ASP.NET page:

<%@ Page Title="Custom XML Schema validation of FCKEditor HTML input" Language="VB" MasterPageFile="~/plainer.master" AutoEventWireup="true" CodeFile="default.aspx.vb" Inherits="tests_validation_html_customvalidator_default" %>
<%@ Register TagPrefix="FCKeditorV2" Namespace="FredCK.FCKeditorV2" Assembly="FredCK.FCKeditorV2" %>

<asp:Content ID="Content1" ContentPlaceHolderID="plaContent" Runat="Server">
	<h1><a href="/tests/validation/html/customvalidator">Custom XML Schema validation of FCKEditor HTML input</a>
		<abbr title="major version">0</abbr>.
		<abbr title="minor version">4</abbr>.
		<abbr title="revision">0</abbr>.
		<abbr title="build">48</abbr>
		</h1>
	<p>The Additional Information/Please include details box is validated against a
		<a href="http://www.adamsmithcollege.ac.uk/xml/schema/w3c/1999/xhtml/simpletextentry.xsd">very cut-down XHTML-based schema</a>.
		The box uses the FCKEditor Basic toolbar set, which includes the <code style="font-size: large;"><b>B</b></code> and
		<code style="font-size: large;"><i>I</i></code> buttons; these elements are not allowed
		in the schema, so if they are used in the input text, it should fail the validation.</p>
	<p>The Additional Information/Reason for making this application is validated against a
		<a href="http://www.adamsmithcollege.ac.uk/xml/schema/w3c/1999/xhtml/simpletextentry.xsd">very cut-down XHTML-based schema</a>.
		The box uses a custom toolbar set called ascSimpleTextEntry which does <strong>not</strong> include the <code style="font-size: large;"><b>B</b></code>
		and <code style="font-size: large;"><i>I</i></code> buttons.</p>
	<p>The FCKEditor controls on this page use the ascconfig.js configuration file.</p>
	<div><asp:Label ID="lblOutput" Text="Validation test results and HTML code captured: <br /><br />" runat="server" /></div>
	<div id="formlong">
		<fieldset>
			<legend>Additional Information</legend>
			<label>Please include details of any skills, aptitudes, or personal qualities and explain how you might use them in this post.
				<FCKeditorV2:FCKeditor id="fckAdditionalInformation" CustomConfigurationsPath="/fckeditor/ascconfig.js"
					EnableXHTML="true" ToolbarSet="Basic" EnableSourceXHTML="true" ForcePasteAsPlainText="true" FormatSource="true"
					runat="server" />
				<asp:CustomValidator ID="cvlAdditionalInformation" ControlToValidate="fckAdditionalInformation"
           OnServerValidate="ServerValidation" ErrorMessage="Additional Information is not valid. " 
           Display="Dynamic" runat="server" />
			</label>
			<label>Reason for making this application:
				<FCKeditorV2:FCKeditor id="fckApplicationReason" CustomConfigurationsPath="/fckeditor/ascconfig.js"
					EnableXHTML="true" ToolbarSet="ascSimpleTextEntry" EnableSourceXHTML="true" ForcePasteAsPlainText="true" FormatSource="true"
					runat="server" />
				<asp:CustomValidator ID="cvlApplicationReason" ControlToValidate="fckApplicationReason"
           OnServerValidate="ServerValidation" ErrorMessage="Additional Information is not valid. " 
           Display="Dynamic" runat="server" />
			</label>
		</fieldset>
    <asp:Button id="btnSubmit" Text="Validate" OnClick="ValidateBtn_OnClick" CausesValidation="true"
			runat="server"/>
		<asp:ValidationSummary ID="vlsInsert" ValidationGroup="vlgInsert" runat="server" />
	</div>
</asp:Content>

Code behind (VB.NET):

Imports System
Imports System.IO
Imports System.Xml
Imports System.Xml.Schema
Imports System.Xml.XPath
Partial Class tests_validation_html_customvalidator_default
	Inherits System.Web.UI.Page
	Public strValidationMessages As String
	Function ValidateXhtml(ByVal strXhtmlFragment As String) As Boolean
		' Validate a string as a cut-down subset of XHTML.
		Dim booXhtmlValid As Boolean = True
		Dim trdXhtml As TextReader = New StringReader(strXhtmlFragment)
		' Try something from http://msdn.microsoft.com/en-us/library/ms162371.aspx
		Dim settings As XmlReaderSettings = New XmlReaderSettings()
		Dim eventHandler As ValidationEventHandler = New ValidationEventHandler(AddressOf ValidationCallBack)
		Dim document As XmlDocument = New XmlDocument()
		Dim navigator As XPathNavigator = document.CreateNavigator()
		Try
			settings.Schemas.Add("http://www.w3.org/1999/xhtml", "http://www.adamsmithcollege.ac.uk/xml/schema/w3c/1999/xhtml/simpletextentry.xsd")
			settings.ValidationType = ValidationType.Schema
			' Create the XmlReader object.
			Dim reader As XmlReader = XmlReader.Create(trdXhtml, settings)
			document.Load(reader)
			' Validate the document, and set the return value of True (validates against schema) or False (fails to validate) accordingly.
			document.Validate(eventHandler)
			If document.SchemaInfo.Validity = XmlSchemaValidity.Invalid Or document.SchemaInfo.Validity = XmlSchemaValidity.NotKnown Then
				booXhtmlValid = False
			End If
			document.Validate(eventHandler)
			' Debugging information.
			strValidationMessages &= "document.OuterXml = " & Server.HtmlEncode(document.OuterXml) & "<br />"
			strValidationMessages &= "document.SchemaInfo.Validity = " & Server.HtmlEncode(document.SchemaInfo.Validity.ToString) & "<br />"
		Catch ex As Exception
			strValidationMessages &= ex.Message
			booXhtmlValid = False
		End Try
		Return booXhtmlValid
	End Function
	'' Display any validation errors.
	Sub ValidationCallBack(ByVal sender As Object, ByVal e As ValidationEventArgs)
		strValidationMessages &= e.Message
	End Sub
	Sub ValidateBtn_OnClick(ByVal sender As Object, ByVal e As EventArgs)
		lblOutput.Text &= strValidationMessages
		If Page.IsValid Then
			lblOutput.Text &= "Page is valid. <br /><br />"
		Else
			lblOutput.Text &= "Page is not valid! <br /><br />"
		End If
	End Sub
	Sub ServerValidation(ByVal source As Object, ByVal arguments As ServerValidateEventArgs)
		Dim strAdditionalInformation As String = "<div xmlns=""http://www.w3.org/1999/xhtml"">" & arguments.Value & "</div>"
		'lblOutput.Text &= "strAdditionalInformation = " & Server.HtmlEncode(strAdditionalInformation) & ". "
		arguments.IsValid = ValidateXhtml(strAdditionalInformation)
	End Sub
End Class

FCKEditor configuration has this custom ascSimpleTextEntry toolbar set which should only show buttons which put in elements which are valid against the "simple text entry" schema:

// Toolbar Sets
FCKConfig.ToolbarSets["ascSimpleTextEntry"] = [
['OrderedList','UnorderedList','-','Link','Unlink','-','About']
] ;

FCKEditor could be replaced by another XHTML-compliant editor (like its successor CKEditor), the principle is the same.

So, following your suggestion, one extra thing we would need to sanitize is the content of the a/@href attribute. To be doubly sure of no HTML text content being HTML unencoded in future, we could sanitize that (even though we intend to store the HTML input encoded, and therefore never unencode it). Assuming these steps were taken, do you think the XSS security angles would be covered?

Nov 17, 2009 at 2:43 PM

I have updated the validating schema to restrict the . Originally, the a/@href (hypertext link URI) element's href attribute was of type xs:anyURI. However, this allowed script input like:

<a href="javascript:alert('oh oh');">scripted</a>

So I have updated the XML schema so that the a/@href attribute uses a custom simple type (called restrictedURI) derived by restriction from xs:anyURI and constrained by a pattern to have to start with "http://" as shown below:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/1999/xhtml" targetNamespace="http://www.w3.org/1999/xhtml">
	<xs:annotation>
		<xs:documentation>
		  Greatly simplified from W3C modular schema.
		</xs:documentation>
	</xs:annotation>
	<!-- div -->
	<xs:complexType name="div.type">
		<xs:choice minOccurs="0" maxOccurs="unbounded">
			<xs:element ref="p"/>
			<xs:element ref="ol"/>
			<xs:element ref="ul"/>
		</xs:choice>
	</xs:complexType>
	<xs:element name="div" type="div.type"/>
	<!-- p -->
	<xs:element name="p" type="inlineContent"/>
	<!-- li -->
	<xs:element name="li" type="inlineContent"/>
	<!-- ol  -->
	<xs:complexType name="xhtml.ol.type">
		<xs:sequence>
			<xs:element ref="li" maxOccurs="unbounded"/>
		</xs:sequence>
	</xs:complexType>
	<xs:element name="ol" type="xhtml.ol.type"/>
	<!-- ul  -->
	<xs:complexType name="xhtml.ul.type">
		<xs:sequence>
			<xs:element ref="li" maxOccurs="unbounded"/>
		</xs:sequence>
	</xs:complexType>
	<xs:element name="ul" type="xhtml.ul.type"/>
	<!-- a -->
	<xs:element name="a">
		<xs:complexType>
			<xs:simpleContent>
				<xs:extension base="xs:string">
					<xs:attribute name="href" type="restrictedURI"/>
				</xs:extension>
			</xs:simpleContent>
		</xs:complexType>
	</xs:element>
	<!-- em -->
	<xs:complexType name="xhtml.em.type" mixed="true"/>
	<xs:element name="em" type="xhtml.em.type"/>
	<!-- strong -->
	<xs:complexType name="xhtml.strong.type" mixed="true"/>
	<xs:element name="strong" type="xhtml.strong.type"/>
	<!-- inline content -->
	<xs:complexType name="inlineContent" mixed="true">
		<xs:choice minOccurs="0" maxOccurs="unbounded">
			<xs:element ref="a"/>
			<xs:element ref="em"/>
			<xs:element ref="strong"/>
		</xs:choice>
	</xs:complexType>
	<!-- restricted URL -->
	<xs:simpleType name="restrictedURI">
		<xs:restriction base="xs:anyURI">
			<xs:pattern value="http://.*" />
		</xs:restriction>
	</xs:simpleType>
</xs:schema>