A client-side macro processor, particularly one that could include arbitrary Web documents, would almost certainly be a useful Web enhancement.
Although cumbersome at times, Emacs seems to be the hands down winner for automating document conversion. What other program lets you convert a document, examine the result, undo the operation with a single command, modify the LISP code driving the conversion, then try it again with nary a single wasted keystroke?
For document conversion purposes, the most useful change to EMACS would be an enhancement or replacement for regular expressions. In particular, it would nice if regular expressions could count, to more easily interpret varying indentation levels.
I developed a prototype for such a system by taking Frolic (a Prolog-like language written in Common Lisp), and porting it to EMACS Lisp. After making a few enhancements, I was able to write grammar-like constructions to describe blocks of text. See prolog.el and match.el. Performance is absolutely horrible. This example describes one way to satisfy the match-contents-lines rule:
(*- (match-contents-lines _Start _End) (match-contents-line _Start _X) (match "\\s-*<BR>\n" _X _Y) (match-contents-lines _Y _End))
I plan to try this again sometime, only with a external Prolog engine (like SB-Prolog), modified to perform RPC interactions over a controlling TCP connection, which will be driven by EMACS Lisp networking code.
Simply establishing that URL addresses point somewhere would be a big help to many Web sites. For larger Web problems, with hundreds or thousands of Web pages, conformance with a standard format is essential if the full potential of tools like sed is to be realized.
Until powerful, standard search tools are easily available, build your own search engine along the way. Don't wait until the end to add the search facilities, since a good search tools will aid construction just as much as browsing.
HTML's stated design objective is as a language to describe content, but it is used to describe presentation.
HTML is not a good content description language, for the following reasons:
How can one content language describe a court ruling full of legal references, a repair procedure (with diagrams) for a car engine, a mathematical paper, and a Spanish story with an English translation running parallel to it? No one content language can do this, unless it is extremely feature-loaded, which HTML is not.
It is particularly difficult to imagine how a language lacking some sort of extension facility, such as macro definition, can claim to represent content. Content is much more in the mind of the author than in some boilerplate template. What if the author is documenting a new computer language, and wants to define a format for sample code in which the new language's keywords are hyperlinked to their definitions? Unless the language is malleable enough to be adapted to each author's needs, it can hardly represent content well.
How is a bibliographic reference represented? How about a numbered list that spans several Web pages? Or for that matter, a numbered header that updates automatically as new sections are added or old ones removed? How is a hyperlink to a footnote represented different from a link to a glossary item? For that matter, isn't the whole concept of a hyperlink inherently presentation-dependent?
HTML is not a good presentation language, for the following reasons:
HTML 2 is being standardized. HTML 3 is in the works. There is talk of an HTML 3+. There will undoubtedly be an HTML 4, an HTML 5, and so on. This is because the language is incomplete. It can not describe an purely arbitrary page layout. Therefore, it will have to be continually revised as people find more and more things they want to do with it, but can not.
There is no definitive description for how particular HTML code is to be rendered. Thus, I know that to create an indented paragraph using the Netscape browser, I enclose the text in an otherwise empty list. There is no assurance that this or any other technique will work with a different browser.
HTML is poorly suited for constructing large hypertext systems, mainly because it lacks any features to relate multiple Web pages into a coherent entity. Unless you want to deal with a tangled mess of 10,000 Web pages (all different), you must create a standardized structure of your own - and then enforce it.
A better solution might start by acknowledging the functional separation between content and presentation. The Internet community should standardize on a presentation language, and leave the choice of content representation to the author. The standard presentation language should be complete, be described by a standards-track protocol document that clearly defines its exact interpretation, and allow for easy growth and extensibility.
One such a language already exists - Postscript. Defining a hypertext link could be achieved by adding a single primitive. A workable free software implementation of the language exists in the form of Ghostscript.
Almost every modern word processing system can generate Postscript. The amount of programming effort required to extend such systems to generate "Web Postscript" would presumably be negligible.
Netscape, the Encyclopedia's standard browser, can generate Postscript, as can most other Web browsers. Therefore, it would seem that HTML-to-Postscript conversion should also require a trivial effort. Backwards compatibility would be achieved, and those who which to continue using HTML could do so, requiring only the extra step of converting their HTML to Postscript before installing it on the Web.