SGML (Standard Generalized Markup Language) was developed as a language that both humans and machines could understand. However, its rules were highly variable. Since these rules were specified through DTDs, each SGML file could have different rules. Defining rules with DTDs offered flexibility but required strict adherence to these rules later on. Although we can still find it in old codebases, we no longer use it today.
DTD (Document Type Definition) is used to specify the rules of SGML. Let’s write a simple SGML file and then see how it would be defined with a DTD.
NOTE: I’m using XML format as a markdown formatter, but this is an SGML file. As you can see, they have a very similar structure.
xml// book.sgml
<!DOCTYPE bookshelf SYSTEM "bookshelf.dtd">
<bookshelf>
<book>
<title>Clean Architecture</title>
<author>Robert C. Martin</author>
<price currency="USD">45.00</price>
<pages>272</pages>
<publisher>Prentice Hall</publisher>
</book>
<book>
<title>The Pragmatic Programmer</title>
<author>Andrew Hunt</author>
<price currency="EUR">35.00</price>
<publisher>Addison-Wesley</publisher>
</book>
</bookshelf>
As everyone who reads this would understand, a dataset for a bookshelf was created here, containing a list of books, and it’s quite easy to convert this into an object in a programming language. At this point, we can see that SGML has achieved its initial purpose.
Can I add another field to this data type? Who decides the rules for this? Of course, DTDs do.
xml// bookshelf.dtd
<!ELEMENT bookshelf (book+)>
<!ELEMENT book (title, author, price, pages?, publisher)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT pages (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ATTLIST price currency CDATA #IMPLIED>
What does this mean?
-
In this DTD, the
bookshelf
element can contain zero or morebook
elements. We can understand that it’s actually an array. -
Each
book
element containstitle
,author
,price
,pages
, andpublisher
elements, and their type is specified asPCDATA
(Parsed Character Data). -
The
title
,author
,price
, andpublisher
elements are mandatory, but thepages
element is not. -
The
price
element includes acurrency
attribute, and this attribute’s value is of typeCDATA
(Character Data).
This way, a data type for the bookshelf was created. However, since only string data could be received in SGML files, limitations like specifying the number of pages as an integer
or the price as a float
could not be set. This had to be done at the application layer, not the data layer.
NOTE: These limitations were eliminated with the advent of XML and the creation of XSD in later years.
Why SGML Did Not Continue to Be Used?
There are many reasons why SGML was not continued to be used. The main ones are that SGML parsers could follow very different rules, leading to inconsistencies between systems. Imagine your SGML parser accepts self-closing tags, but the client sharing your SOAP does not. This could lead to differences, causing the file to be misinterpreted or not interpreted at all. Let’s look at some of these differences:
- Whether it’s case-sensitive or not
- Whether it supports closing tags
- Whether it supports self-closing tags
- Whether it supports character encoding
- Whether attributes must be enclosed in quotes
- Naming rules for attributes
Such rules being applied differently in each SGML parser actually led to the end of SGML. So, what was introduced as a solution to this?
The Emergence of XML
XML (eXtensible Markup Language) was a subset of SGML. By imposing strict rules on SGML’s variable inconsistency, it ensured that every parser worked the same way.
xml// book.xml
<?xml version="1.0" encoding="UTF-8"?>
<bookshelf
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="bookshelf.xsd"
>
<book>
<title>Clean Architecture</title>
<author>Robert C. Martin</author>
<price currency="USD">45.00</price>
<pages>272</pages>
<publisher>Prentice Hall</publisher>
</book>
<book>
<title>The Pragmatic Programmer</title>
<author>Andrew Hunt</author>
<price currency="EUR">35.00</price>
<publisher>Addison-Wesley</publisher>
</book>
</bookshelf>
Actually, this XML file looks exactly the same as the SGML file, with only a difference in the DOCTYPE
part. Here, we specify which XSD file the XML conforms to using the xsi:noNamespaceSchemaLocation
attribute.
xml// bookshelf.xsd
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="bookshelf">
<xs:complexType>
<xs:sequence>
<xs:element name="book" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="author" type="xs:string"/>
<xs:element name="price">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:float">
<xs:attribute name="currency" type="xs:string" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="pages" type="xs:integer" minOccurs="0"/>
<xs:element name="publisher" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
In this XSD file, rules are specified much like in the SGML file, but with stricter rules. For example, pages are not mandatory, but if written, they must be an integer
. Similarly, price must be a float
, and its currency must be specified.
XML schemas look more complex, but with XML’s arrival, inconsistencies between systems were eliminated, and stricter data validation features were introduced.
Is HTML an SGML?
HTML (Hyper Text Markup Language) was an application of SGML until HTML5. Until then, the document type and which DTD it conformed to were indicated. There were different types like Transitional, Strict, and Frameset, each with its own DTD. When pages were created and sent to browsers, it was specified at the beginning of the file which DTD should be used for parsing.
html<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
NOTE: You can view and try to understand these DTDs by going to these links.
In HTML5, these DTDs were removed, and it was simply stated that the file is HTML.
html<!DOCTYPE html>
So yes, HTML was an SGML application until HTML5. However, HTML5 defined its own rules. A simpler structure was created that could be used without breaking older systems. In the new system, even more flexible rules than SGML’s flexible ones were introduced. This was to ensure that even poorly formatted (malformed) HTML could be rendered by browsers without breaking the integrity. However, saying that HTML5 became more flexible than SGML would not be entirely accurate. For example, using a <div>
element inside a <p>
element in HTML5 is incorrect.
Of course, it wasn’t limited to this. HTML5 brought many new features like <video>
and <audio>
elements. Before this, there were browser wars. Browsers like Netscape, Firefox, IE, and Opera could introduce their own features due to the lack of specific standards, causing pages to look different across browsers.
WHATWG (Web Hypertext Application Technology Working Group) put an end to this by creating HTML5. HTML5 does not have its own DTD. Instead, it became a living standard, meaning developers wouldn’t need to use a different DTD to use new features.
What is XHTML Then?
XHTML (Extensible HyperText Markup Language) is HTML made compliant with XML. It has versions like XHTML 1.0, XHTML 1.1, and XHTML 2.0. It was developed before the arrival of HTML5. Its purpose was to make HTML comply with XML.
Remember, the goal of XML was to impose strict rules on SGML’s variable inconsistency. XHTML followed the same logic to make HTML XML-compliant.
Conclusion
Chronologically, everything developed as follows:
A language understandable by both machines and humans was vital for developers. Thus, SGML emerged. DTDs were developed to define SGML’s rules. Later, XML came to address SGML’s inconsistency. HTML was an SGML application. XHTML was developed to address HTML’s inconsistency before HTML5. Finally, HTML5 was introduced as a more modern, living standard.
Album of the Day: