Most feeds embed HTML markup within feed elements. Some feeds even embed other types of markup, such as SVG or MathML. Since many feed aggregators use a web browser (or browser component) to display content, Universal Feed Parser sanitizes embedded markup to remove things that could pose security risks.
These elements are sanitized by default:
| The unit tests for HTML sanitizing show many different examples of dangerous markup that Universal Feed Parser sanitizes by default. | |
The following HTML elements are allowed by default (all others are stripped): a, abbr, acronym, address, area, article, aside, audio, b, big, blockquote, br, button, canvas, caption, center, cite, code, col, colgroup, command, datagrid, datalist, dd, del, details, dfn, dialog, dir, div, dl, dt, em, event-source, fieldset, figure, footer, font, form, header, h1, h2, h3, h4, h5, h6, hr, i, img, input, ins, keygen, kbd, label, legend, li, m, map, menu, meter, multicol, nav, nextid, noscript, ol, output, optgroup, option, p, pre, progress, q, s, samp, section, select, small, sound, source, spacer, span, strike, strong, sub, sup, table, tbody, td, textarea, time, tfoot, th, thead, tr, tt, u, ul, var, video
The following HTML attributes are allowed by default (all others are stripped): abbr, accept, accept-charset, accesskey, action, align, alt, autoplay, autocomplete, autofocus, axis, background, balance, bgcolor, bgproperties, border, bordercolor, bordercolordark, bordercolorlight, bottompadding, cellpadding, cellspacing, ch, challenge, char, charoff, choff, charset, checked, cite, class, clear, color, cols, colspan, compact, contenteditable, coords, data, datafld, datapagesize, datasrc, datetime, default, delay, dir, disabled, draggable, dynsrc, enctype, end, face, for, form, frame, galleryimg, gutter, headers, height, hidefocus, hidden, high, href, hreflang, hspace, icon, id, inputmode, ismap, keytype, label, leftspacing, lang, list, longdesc, loop, loopcount, loopend, loopstart, low, lowsrc, max, maxlength, media, method, min, multiple, name, nohref, noshade, nowrap, open, optimum, pattern, ping, point-size, prompt, pqg, radiogroup, readonly, rel, repeat-max, repeat-min, replace, required, rev, rightspacing, rows, rowspan, rules, scope, selected, shape, size, span, src, start, step, summary, suppress, tabindex, target, template, title, toppadding, type, unselectable, usemap, urn, valign, value, variable, volume, vspace, vrml, width, wrap, xml:lang
The following SVG elements are allowed by default (all others are stripped): a, animate, animateColor, animateMotion, animateTransform, circle, defs, desc, ellipse, foreignObject, font-face, font-face-name, font-face-src, g, glyph, hkern, linearGradient, line, marker, metadata, missing-glyph, mpath, path, polygon, polyline, radialGradient, rect, set, stop, svg, switch, text, title, tspan, use
The following SVG attributes are allowed by default (all others are stripped): accent-height, accumulate, additive, alphabetic, arabic-form, ascent, attributeName, attributeType, baseProfile, bbox, begin, by, calcMode, cap-height, class, color, color-rendering, content, cx, cy, d, dx, dy, descent, display, dur, end, fill, fill-opacity, fill-rule, font-family, font-size, font-stretch, font-style, font-variant, font-weight, from, fx, fy, g1, g2, glyph-name, gradientUnits, hanging, height, horiz-adv-x, horiz-origin-x, id, ideographic, k, keyPoints, keySplines, keyTimes, lang, mathematical, marker-end, marker-mid, marker-start, markerHeight, markerUnits, markerWidth, max, min, name, offset, opacity, orient, origin, overline-position, overline-thickness, panose-1, path, pathLength, points, preserveAspectRatio, r, refX, refY, repeatCount, repeatDur, requiredExtensions, requiredFeatures, restart, rotate, rx, ry, slope, stemh, stemv, stop-color, stop-opacity, strikethrough-position, strikethrough-thickness, stroke, stroke-dasharray, stroke-dashoffset, stroke-linecap, stroke-linejoin, stroke-miterlimit, stroke-opacity, stroke-width, systemLanguage, target, text-anchor, to, transform, type, u1, u2, underline-position, underline-thickness, unicode, unicode-range, units-per-em, values, version, viewBox, visibility, width, widths, x, x-height, x1, x2, xlink:actuate, xlink:arcrole, xlink:href, xlink:role, xlink:show, xlink:title, xlink:type, xml:base, xml:lang, xml:space, xmlns, xmlns:xlink, y, y1, y2, zoomAndPan
The following MathML elements are allowed by default (all others are stripped): annotation, annotation-xml, maction, math, merror, mfenced, mfrac, mi, mmultiscripts, mn, mo, mover, mpadded, mphantom, mprescripts, mroot, mrow, mspace, msqrt, mstyle, msub, msubsup, msup, mtable, mtd, mtext, mtr, munder, munderover, none, semantics
The following MathML attributes are allowed by default (all others are stripped): actiontype, align, columnalign, columnalign, columnalign, close, columnlines, columnspacing, columnspan, depth, display, displaystyle, encoding, equalcolumns, equalrows, fence, fontstyle, fontweight, frame, height, linethickness, lspace, mathbackground, mathcolor, mathvariant, mathvariant, maxsize, minsize, open, other, rowalign, rowalign, rowalign, rowlines, rowspacing, rowspan, rspace, scriptlevel, selection, separator, separators, stretchy, width, width, xlink:href, xlink:show, xlink:type, xmlns, xmlns:xlink
The following CSS properties are allowed by default in style attributes (all others are stripped): azimuth, background-color, border-bottom-color, border-collapse, border-color, border-left-color, border-right-color, border-top-color, clear, color, cursor, direction, display, elevation, float, font, font-family, font-size, font-style, font-variant, font-weight, height, letter-spacing, line-height, overflow, pause, pause-after, pause-before, pitch, pitch-range, richness, speak, speak-header, speak-numeral, speak-punctuation, speech-rate, stress, text-align, text-decoration, text-indent, unicode-bidi, vertical-align, voice-family, volume, white-space, width
| Not all possible CSS values are allowed for these properties. The allowable values are restricted by a whitelist and a regular expression that allows color values and lengths. URIs are not allowed, to prevent platypus attacks. See the _HTMLSanitizer class for more details. | |
I am often asked why Universal Feed Parser is so hard-assed about HTML and CSS sanitizing. To illustrate the problem, here is an incomplete list of potentially dangerous HTML tags and attributes:
- script, which can contain malicious script
- applet, embed, and object, which can automatically download and execute malicious code
- meta, which can contain malicious redirects
- onload, onunload, and all other on* attributes, which can contain malicious script
- style, link, and the style attribute, which can contain malicious script
style? Yes, style. CSS definitions can contain executable code.
Example: Embedding Javascript in CSS
This sample is taken from http://nightly.feedparser.org/docs/examples/rss20.xml:
<description>Watch out for <span style="background: url(javascript:window.location='http://example.org/')"> nasty tricks</span></description>
This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:
<description>Watch out for <span style="any: expression(window.location='http://example.org/')"> nasty tricks</span></description>
Internet Explorer for Windows will execute the Javascript in both of these examples.
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
Example: Embedding encoded Javascript in CSS
To a browser, this:
<span style="any: expression(window.location='http://example.org/')">
is the same as this (without the line breaks):
<span style="any: expre ssion(window .location='h ttp://exampl e.org/')">
which is the same as this (without the line breaks):
<span style="any: expr ession(win dow.locati on='http:/ /example.o rg/')">
And so on, plus several other variations, plus every combination of every variation.
The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.
Elsewhere
- How to consume RSS safely explains the platypus attack.