UverseWiki – a universal text markup
  1. 1. Uniform markup
  2. 2. The UverseWiki’s way
  3. 3. A word about existing wiki engines

Roughly a year ago I’ve met (actually, re-met after 3 years) Freeman who among other crazy ideas had the idea of a global text markup language used wordwide, international and extensible.
Well, this idea isn’t new and he certainly didn’t invent all of it. Neither did I.But together we’ve formed something I’m personally proud of – the ObjectWacko markup language.

ObjectWacko is based on a markup language initially created for WakkaWiki which was later forked into WackoWiki which has been abandoned nearly 7 years ago and is now retaken by another team.

Core principles of a human-writable markup are:

  1. Less to none hacks and tricks. I mean, look at any Wikipedia’s articles with more or less complex formatting (or using templates) – everything beyond a simple table with no col/rowspans or embedded language that differs from the one the page’s written in will resort to using HTML.
  2. Human-readable semantics. Let’s look at MediaWiki again: it uses 2 apostrophes to denote an italic text and 3 (three) apostrophes to specify a bold text. To make text bold & italic at the same time you need to use 5 (five) apostrophes. How intuitive this is?
    DokuWiki is a bit better: it does use things like asterisks for bold (**bold**) and slashes for italic (//italic//) texts but when it comes to strikethru text it uses… <del> tag. The question arises: why not continue the intuitive sequence of tokens and use double dash (--) to denote a deleted text? Perhaps because the processor could confuse it with a typographic em-dash?
  3. Extensibility. Let’s take a step away from normal blogging and forum posting and see how markups usually deal with formulae – they, again, use HTML and/or TeX or similar markup. I don’t want to say that TeX is a no – it’s most welcome in math expressions, I’m not suggesting to invent another wheel. But the way in which markup is extended (and using formulae is an extension because they’re not meant for regular users and most text markups are fine without it – look at BB-codes) is wrong. Let’s take a look at MediaWiki again: ...the open disk of center <math>(a, b)</math> and radius ''R''....
    So math expression is placed inside (HTML-style again but let’s bear with it for now) <math> tags. But what if we need some C++ code? Or Delphi? <c>main();</c>, <delphi>program MyApp;</delphi>, <css>rule.class { font-weight: bold; }</css>… What about language named B? <b>bold?!</b>.
    One can say that this problem is made up but it hints at the root: traditional markup languages are not extended, «extensions» are part of the markup. In other words, there’s no uniform way to call an arbitrary extension. There’s no such concept of extensions to begin with.
  4. Recursive, context-dependent text processing. I daresay that most of current markup problems originate from the fact that there’s not enough tokens on the keyboard to represent all necessary formatting. For example, if we take double dash (--) to indicate strikethru text then how we’ll deal with typographic em-dash? Okay, we can make a regular expression like PHP/(?<=\s)--(?=\s)/ → "—" and consider -- having both left and right spaces as em-dash, otherwise it’s a formatted text block. But what to do with typographic dash "--" is...? Include " in the regexp? But if we use «*» to format bold text (em-dash "**--**" is used to...) – how to differentiate between strikethru token and the dash? Aww, so many questions…
    One way is to escape formatting. And it’s a good way. Let’s say we’ll put a tilde (~) before any token and it’ll turn into a plain text (not processed). But it’s extremely hard to implement as the approach modern formatters traditionally use is doing regexp-based replacements – and that’s probably why so few of them actually support token quoting – because pulling this trick on a regular expression-based formatter requires a lot of headache. Even classic WackoWiki formatter supporting tilde still fails on certain token combinations.
    The solution exists but is more complex: create a modelling text processor that will work the same as program code compilers do – with recursion and based on context rather than tags/tokens in a document character «stream». This makes it possible to do text replacements in text nodes (like in JavaScript DOM) rather than everywhere – and this means there can be no «false» tokens as text nodes contain only text.

Uniform markup

I’ve always felt like examples are better than words so I’ve made up several little dialogues to illustrate how uniformity helps getting things done.

An imaginary dialogue between a user and a J2ME programmer.

Silence.

An imaginary dialogue between two Wacko markup users.

Another imaginary dialogue between two forum members.

What I was trying to say here is that text in this or that form is used pretty much everywhere in the Internet – blogs, forums, e-mail messages, photography commets, image descriptions in a web gallery, documentation, books, office documents, etc., etc. However, even if text is omnipresent there’s no uniform syntax for formatting it – although in 90% of cases you’ll need all the same formatting for forum and blog, IM chat and program source comments, bugtracker notes and commit messages.

What is that formatting? First of all it’s semantics: **bold**, //italic//, __underlined__ & --strikethru--, and maybe also ^^superscript^^ and ++small++ texts. That’s all.

But today most applications use totally different syntaxes or no syntax at all although most of them could format texts. Blogs traditionally use WYSIWYG editors that submit HTML source to the backend; forums historically use BB-codes which are more like HTML but with different kind of braces; rare IM programs and chats use simplified markup – highlighting links (http://...) and *emphasised* texts; some e-mail applications highlight quoted lines (> Quotation.).
Could they all make use of bold and italic text? Of course!

Wiki software brings an improvement to this situation because it’s based on wiki-editing approach which at least means there’ll be no tag-looking tokens. Well, at least not for basic markup. Well, at least not for all basic markup.

As we can see even among Wiki solutions there are differences in most of their markup languages – not talking about features that one has and another doesn’t, even those which are present everywhere (links, basic formatting) differ. Some examples:

I’ve got an irrepressible imagination and I can’t help thinking of a better Internet – where we can write on forum in the same manner as on our blog, format javadoc comments using wiki tokens instead of HTML, seamlessly include commit messages into bugtracker notes – that is, without further processing, and so on.
I don’t dare to say it’ll all work under UverseWiki – mind you, it might bepossible and I’d love to see it that way but that’s not the main point. The main point is to bring organization to the chaos of all present text formatting means and if it is done using a dozen different formatters working under the same RFC or formal specification – I’d love to see this evolution as well.

The UverseWiki’s way

So how our markup, ObjectWacko, and its implementation, UverseWiki, deal with all those problems outlined above?

##(notice)
  == Attention users of v1.x branch ==
  Our software has been entirely reworked and
  support for the old data files is now very limited...
##

A word about existing wiki engines

We’re realists and we understand that it’s impossible for a large group of people (save the entire Internet) to switch to something new in one moment. But this isn’t necessary. UverseWiki is a framework independent of everything, designed to run stand-alone (and it does run stand-alone – by means of a simple console wrapper – to generate our documentation upon new commits to the doc repo).

How an existing project no matter how it’s large can benefit from new markup at the same time maintaining maximum (or any other required level of) backward compatibility? As mentioned above the framework is markup-independent meaning that there’s a common API no matter what the document’s markup i.

Let’s say an existing web wiki engine has a number of users running their wikis. The team can implement a new markup module for UverseWiki that will process documents with the old syntax. There might be an option for a wiki owner to switch markups – for example, new articles will be created using new (ObjectWacko) syntax and old ones will still be processed by the compatibility markup module.

What’s important is that everything happens seamlessly for the host application (i.e. the wiki engine itself apart from its text formatting routines) – no matter if a document uses new syntax, old syntax or even some 3rd party markup extension that site owner has installed. Users also are not concerned too much because they don’t have to rewrite old documents using new syntax unless they want to.

What’s more, markup-indenpendent and, thus, standardized Wiki document model mean that actions that don’t rely on a particular markup module are markup-indepentent as well. There’s no difference for the TOC action if it outputs headings for a wiki document or a BB-code document (taking [font] as headings), right? Then it will work even if the markup module current document uses was created later than the action itself!

If you have any questions or suggestions feel free to drop a line in the comments. We’re always happy to see new people interested to organize the texts!

~ Proger_XP.