This is not an HTML beginner's book. The intended audience is web developers who want to gain a deeper understanding of how the HTML parser works, or the history and rationale behind certain behaviors. Some prior knowledge of HTML and the DOM is assumed. If you are going to implement your own HTML parser (awesome!), then this book will hopefully be helpful, but please implement from the HTML standard. If you contribute to a browser engine or to web standards (awesome!), then this book will hopefully be helpful. If nothing else, I hope it will at least be an interesting read.
Dictionary.com offers the following definition of parse in the context of computers:
to analyze (a string of characters) in order to associate groups of characters with the syntactic units of the underlying grammar.
The Wikipedia page for Parsing offers the following:
Parsing, syntax analysis or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).
In the context of HTML, the HTML parser is responsible for the process of converting a stream of characters (the HTML markup) to a tree representation known as the Document Object Model (the DOM).
This book covers the history of HTML parsers, how to write syntactically correct HTML, how an HTML parser works, including error handling, what can be done with the parsed DOM representation, and how to serialize it back to a string. It also covers parsing of some HTML microsyntaxes (parsing of some attribute values), which are strictly speaking not part of the HTML parser, but a layer above. It further discusses implementations and conformance checkers.
Knowing exactly how the HTML parser works is not necessary to be a successful web developer. However, some things can be good to know, and having a deeper understanding makes it easier to reason about its behavior. It can also be good to know that you should usually pull in an HTML parser instead of writing a regular expression to "parse" HTML.
The following is a non-exhaustive list of things that would be good for most web developers to understand about the HTML parser.
Implied tags/omitted tags. Some tags are optional, and some tags are implied without being optional. This explains why, for example, it's not possible to nest an
<p>. This is discussed in the Implied tags section in Chapter 3. The HTML parser.
document.body being null. Before the
<body> has been parsed,
document.body is null. See Chapter 4. Scripting complications.
Scripting and styling. Knowing what the DOM will look like helps with working with the DOM with script or writing selectors in CSS. This has some overlap with implied tags. For example,
<tbody> is implied in
<table> even if that tag is not present.
Writing correct HTML. Knowing how the parser works may give you more confidence in how to write HTML. For example, a relatively common error is to use "
/>" syntax on a non-void HTML element (
br is a void element,
div is not void), although that is not supported (it will be treated as a regular start tag, ignoring the slash). See Chapter 2. The HTML syntax.
Security. For example, cross-site scripting (XSS) attacks sometimes target holes in sanitizers. Such attacks may be prevented by using an HTML parser-based sanitizer. See Chapter 6. Security implications.
Web compatibility. The HTML parser specification is known to be compatible with HTML as it is used on the web. When Opera implemented the specified HTML parser, it eliminated 20% of its web compatibility bugs (of any kind).
Simon started contributing to the WHATWG in 2005, worked at Opera Software on Quality Assurance and web standards between 2007 and 2017, and currently works with web standards and web platform testing at Bocoup. He contributed to the design of some aspects of the HTML parser specification, such as how SVG in HTML works and finding a web-compatible way to tokenize
script elements. He edited the specification for the
picture element from 2014 onwards and is currently an editor of the WHATWG HTML standard and the WHATWG Quirks Mode standard. His Twitter handle is @zcorpan.
Thanks to Mathias Bynens for suggesting the platypus for the front cover (I asked on Twitter "If the HTML parser were an animal, what would it be?").
The platypus sketch on the front cover is from Wikipedia, by Hmich176, with the following licenses:
GNU Free Documentation License
Creative Commons Attribution-ShareAlike 3.0
The font used on the front cover is Archistico, by Archistico, and has the following license:
You can use the font for commercial purposes, but not sell it! Every once in a comment on the page would be nice.
This book contains quotes from the WHATWG HTML Standard which has the following copyright and license:
Copyright © 2018 WHATWG (Apple, Google, Mozilla, Microsoft). This work is licensed under a Creative Commons Attribution 4.0 International License.
Thanks to Ian Hickson and Henri Sivonen for letting me quote their emails, blog posts, etc. in this book.
Thanks to Ingvar Stepanyan for letting me use some of his Twitter quizzes in this book.
Thanks to Mike Smith for providing a raw log from a validator instance for the Most common errors section in Appendix B. Conformance checkers.
Thanks to Marcos Caceres, Sam Sneddon, Taylor Hunt, Mike Smith, Anne van Kesteren, Marie Staver, Ian Hickson, Mathias Bynens, Henri Sivonen, and Philip Jägenstedt for reviewing this book.
Thanks to Jens Oliver Meiert for contributing fixes for this book.
The source code for this book is available on GitHub. This book and the source code is licensed under CC-BY-4.0. Feel free to report issues, submit pull requests, fork, etc.! If you wish to make a translation or otherwise reuse the work, you are welcome to do so (as allowed by the license). Please report an issue, to avoid duplicate work and so I can help get you set up.
In the web version of this book, there is a feedback link in the bottom-right corner. You can select some text and click the feedback link to create a new issue about the selected text in the GitHub repository. The link has
accesskey="1" so it can be activated with the keyboard — how to activate it depends on the browser and OS, see documentation on MDN about
If you use Twitter, you can provide feedback or ask questions there at @htmlparserbook. You can follow this account if you want to be notified about new commits.