UverseWiki – a universal text markup

Roughly a year ago I’ve met (actually, re-met after 3 years) Freeman who among other crazy ideas had the idea of a global text markup language used wordwide, international and extensible.
Well, this idea isn’t new and he certainly didn’t invent all of it. Neither did I.But together we’ve formed something I’m personally proud of – the ObjectWacko markup language.

ObjectWacko is based on a markup language initially created for WakkaWiki which was later forked into WackoWiki which has been abandoned nearly 7 years ago and is now retaken by another team.

Core principles of a human-writable markup are:

Less to none hacks and tricks. I mean, look at any Wikipedia’s articles with more or less complex formatting (or using templates) – everything beyond a simple table with no col/rowspans or embedded language that differs from the one the page’s written in will resort to using HTML.
Human-readable semantics. Let’s look at MediaWiki again: it uses 2 apostrophes to denote an italic text and 3 (three) apostrophes to specify a bold text. To make text bold & italic at the same time you need to use 5 (five) apostrophes. How intuitive this is?
DokuWiki is a bit better: it does use things like asterisks for bold (**bold**) and slashes for italic (//italic//) texts but when it comes to ~~strikethru text~~ it uses… <del> tag. The question arises: why not continue the intuitive sequence of tokens and use double dash (--) to denote a deleted text? Perhaps because the processor could confuse it with a typographic em-dash?
Extensibility. Let’s take a step away from normal blogging and forum posting and see how markups usually deal with formulae – they, again, use HTML and/or TeX or similar markup. I don’t want to say that TeX is a no – it’s most welcome in math expressions, I’m not suggesting to invent another wheel. But the way in which markup is extended (and using formulae is an extension because they’re not meant for regular users and most text markups are fine without it – look at BB-codes) is wrong. Let’s take a look at MediaWiki again: ...the open disk of center <math>(a, b)</math> and radius ''R''....
So math expression is placed inside (HTML-style again but let’s bear with it for now) <math> tags. But what if we need some C++ code? Or Delphi? <c>main();</c>, <delphi>program MyApp;</delphi>, <css>rule.class { font-weight: bold; }</css>… What about language named B? <b>bold?!</b>.
One can say that this problem is made up but it hints at the root: traditional markup languages are not extended, «extensions» are part of the markup. In other words, there’s no uniform way to call an arbitrary extension. There’s no such concept of extensions to begin with.
Recursive, context-dependent text processing. I daresay that most of current markup problems originate from the fact that there’s not enough tokens on the keyboard to represent all necessary formatting. For example, if we take double dash (--) to indicate strikethru text then how we’ll deal with typographic em-dash? Okay, we can make a regular expression like PHP/(?<=\s)--(?=\s)/ → "—" and consider -- having both left and right spaces as em-dash, otherwise it’s a formatted text block. But what to do with typographic dash "--" is...? Include " in the regexp? But if we use «*» to format bold text (em-dash "**--**" is used to...) – how to differentiate between strikethru token and the dash? Aww, so many questions…
One way is to escape formatting. And it’s a good way. Let’s say we’ll put a tilde (~) before any token and it’ll turn into a plain text (not processed). But it’s extremely hard to implement as the approach modern formatters traditionally use is doing regexp-based replacements – and that’s probably why so few of them actually support token quoting – because pulling this trick on a regular expression-based formatter requires a lot of headache. Even classic WackoWiki formatter supporting tilde still fails on certain token combinations.
The solution exists but is more complex: create a modelling text processor that will work the same as program code compilers do – with recursion and based on context rather than tags/tokens in a document character «stream». This makes it possible to do text replacements in text nodes (like in JavaScript DOM) rather than everywhere – and this means there can be no «false» tokens as text nodes contain only text.

Uniform markup

I’ve always felt like examples are better than words so I’ve made up several little dialogues to illustrate how uniformity helps getting things done.

An imaginary dialogue between a user and a J2ME programmer.

I have a Java application for my Nokia cellphone which I also need on my Samsung. I’ve tried copying it over but even if it says «Java» in the startup banner for some reason it crashes…
Let me see the code… Ah, that’s it: Samsung’s Java machine has this function named differently but this can be fixed by simple Search & Replace… Hm, too bad it was written for Nokia, Samsung doesn’t have these standard classes so they’ll need to be rewritten… Oh, and Nokia’s permission mechanics completely differ from that of Samsung, since you program uses GPRS this part will need to be rewritten as well.

Silence.

But overall it’s a working application, it just needs a few parts fixed and it’ll run on Samsung.
Erm… but what if I also need it on a colleague’s LG smartphone?
Let me check it again…

An imaginary dialogue between two Wacko markup users.

Hey, I’m thinking of upgrading my blog to a faster engine, they say they’ve recently released a database version of UverseBlog and I want to try it but don’t know where to start.
You could just copy over your posts (you have them in your blog’s public directory, right?) and import them when you’ve done the upgrade.
Right, I think you’re talking about those WIKI files… but I use a lot of crosslinking between posts and now that this new engine uses database instead of files I’ll have to relink everything?
Of course not, you can still refer to posts by their title or whatever method you’ve used. Actually, I’ve read that there’s no files and folders when you’re writing a post.
Seriously? Well, what about post descriptions, date I’ve written them and everything? You’re saying I’m importing just those WIKI files and do nothing else?
Sure, those files contain that stuff you need – synopsis, author’s name, post title and so on.
What about my comments then? Are they also in those files?
Nope, this is a bit more tricky – you need to go to your control panel and Export all comments on your blog. It’ll make an archive with WIKI files so you can import them just like posts.
Oh, seems like all I have to do is just import my old texts?
Indeed.
Very neat, thanks pal!

Another imaginary dialogue between two forum members.

I’m thinking about posting parts of my blog post about recently released CPU/GPU chips and their news here but I have a problem: I update that post on a dialy basis and I can’t get myself to reposting it each time here. However, I’ve seen quotations from your blog and I’m curious if you have simply copy-pasted them?
Nope, there’s a trick: my blog uses the same markup as this forum does so I simply embed one of my blog pages into a post and it fixes all links and pictures to new path automatically.
Seriously? And how you do that?
Well, something like {{Include http://my-blog.net/AJAX_basics?wiki}}.
And that’s all?
Yup. Just don’t forget to fix permissions in your blog’s control panel so it’ll allow inclusion from this host.
I can include half of my posts this way! Thanks, mate!

What I was trying to say here is that text in this or that form is used pretty much everywhere in the Internet – blogs, forums, e-mail messages, photography commets, image descriptions in a web gallery, documentation, books, office documents, etc., etc. However, even if text is omnipresent there’s no uniform syntax for formatting it – although in 90% of cases you’ll need all the same formatting for forum and blog, IM chat and program source comments, bugtracker notes and commit messages.

What is that formatting? First of all it’s semantics: **bold**, //italic//, __underlined__ & --strikethru--, and maybe also ^^superscript^^ and ++small++ texts. That’s all.

But today most applications use totally different syntaxes or no syntax at all although most of them could format texts. Blogs traditionally use WYSIWYG editors that submit HTML source to the backend; forums historically use BB-codes which are more like HTML but with different kind of braces; rare IM programs and chats use simplified markup – highlighting links (http://...) and *emphasised* texts; some e-mail applications highlight quoted lines (> Quotation.).
Could they all make use of bold and italic text? Of course!

Wiki software brings an improvement to this situation because it’s based on wiki-editing approach which at least means there’ll be no tag-looking tokens. Well, at least not for basic markup. Well, at least not for all basic markup.

As we can see even among Wiki solutions there are differences in most of their markup languages – not talking about features that one has and another doesn’t, even those which are present everywhere (links, basic formatting) differ. Some examples:

MediaWiki has ''italic'' and '''bold''', [[Link|Caption]], == 1st level heading == and Inline <ref>references</ref>.
DokuWiki has //italic//, **bold**, __underline__ and <del>strikethru</del>, [[Link|Caption]] and ====== 1st level heading ======.
Creole has //italic// and **bold**, [[Link|Caption]] and = 1st level heading.
UverseWiki has //italic//, **bold**, __underline__, --strikethru--, ^^superscript^^ and ++small++, ((Link Caption)) or [[Link Caption]], == 1st level heading == and Iniine ((*reference)).

I’ve got an irrepressible imagination and I can’t help thinking of a better Internet – where we can write on forum in the same manner as on our blog, format javadoc comments using wiki tokens instead of HTML, seamlessly include commit messages into bugtracker notes – that is, without further processing, and so on.
I don’t dare to say it’ll all work under UverseWiki – mind you, it might bepossible and I’d love to see it that way but that’s not the main point. The main point is to bring organization to the chaos of all present text formatting means and if it is done using a dozen different formatters working under the same RFC or formal specification – I’d love to see this evolution as well.

The UverseWiki’s way

So how our markup, ObjectWacko, and its implementation, UverseWiki, deal with all those problems outlined above?

Less to none hacks and tricks – we’re firmly decided on letting no HTML slip through our fingers into the page markup. Thanks to markup extensibility a special syntax construction allows new features to be accessed by means of formatters without interfering with the original markup. Example: %%(php) echo "Hello, world!";%% – here «php» inside parenthesis is the name of extension to pass the text between a pair of %% to.
Human-readable semantics – we’ve had an intense e-mail exchange (≈500 messages for us both) and then no less intense forum discussion (≈700 posts for us both) before we’ve settled on all tokens we’ve included into the markup language. For most used things most easy accessible symbols have been selected which all follow «Two Symbols’ Rule» – a piece of text put between double pair of the same symbol is formatted in some specific way. Examples (this is not a complete syntax reference):
1. Two asterisks = **bold**
2. Two forward slashes = //italic//
3. Two underscores = __underline__
4. Two dashes = --strikethru--
5. Two circumflexes = ^^superscript^^
6. Two plus signs = ++small++
7. Two percent symbols = %%monospaced (unformatted) text%% – also serves as an extension wrapper: %%(html) <a href="http://uverse.i-forge.net">UverseWiki</a>%%
8. Two exclamation marks = !!Attention!! – also serves as a custom style token: !!(bugfix) Bug #0000144 fixed!!
9. Two question marks = ??comment (invisible text)?? – can also mean offtopic/dubious text that’s visible but displayed in a different style (customizable).
10. Two «at» symbols = Nihongo (@@jp_JP 日本語@@) means "Japanese". – embedding different languages into the same document.
11. Two square or round braces = ((http://google.com Google)), [[/Main Index page]].
  - Both types of braces are complete synonyms and the reason why round braces were added is trivial: for English writers there’s not a problem to enter either type of braces – even angular (< >) and curly ({ }) only require holding Shift. On Russian and probably most other keyboard layouts this, however, is extremely annoying because to enter any kind of bracket except round ones one must switch keyboard layout to English, enter 2 symbols, switch it back, enter the necessary text, switch again to close the link and switch yet again to continue writing. Things get more funny if a link needs custom caption that is traditionally separated from its URL by pipe (|) – in this case in addition to switching layout to enter square braces a Russian typics would also need to switch it twice to enter the pipe symbol. In ObjectWacko pipe is replaced with double equals signs (==): See ((/Recent changes/2011-08-31 == the changelog)).
    When writing large amounds of text (such as documentation) this layout juggling is unnerving, you feel like riding on hills which distracts you from actual writing. Sure, one might come up with autoreplace rules in some desktop application or first write all rounds braces and Search/Replace them with square ones once the article is finished – but why invent a workaround if it can be avoided?
12. Two angle braces to align paragraphs = >>right>>, <<left<<, <<justify>>, >>center<<
13. Inline quotations follow the Internet tradition and use «>» symbols at the beginning of a line – each successive «>» refers to an older citation: >> A quote from a post preceding the previous post.
14. Two curly braces = {{Image pic.jpg, 800×600, align=right}} – another way to call a markup extension (so-called actions, see below).

Extensibility – UverseWiki is designed to be markup-indenpendent meaning that the module implementing ObjectWacko formatting is a regular extension with no privileges any other markup module cannot have. Both UverseWiki as a framework and ObjectWacko as one of its modules contain airholes for easy and break-free extensibility:
1. Page abstraction layer – no matter how odd or complex underlying file system is (FTP, SVN repository, regular files & folders, database, etc.) it’s all the same for the writer because he can address pages in a uniform way and even move articles around his projects (forum, blog, wiki resource) without changing a single link – the system cares it for him.
2. Module API – allows writing new modules that will join existing UverseWiki document tree harmonically.
3. Actions & formatters – ObjectWacko’s way of extending original markup:
  - actions are external routines called to generate some content – like Table Of Contents (TOC), Recent Changes block, an embedded Image or Video, etc. They’re inserted between a pair of double curly braces: {{TOC}}.
  - formatters (sometimes also called highlighters) are another kind of markup modules that get user’s input and output it in a way they decide – apart from ObjectWacko. Formatters can be used to print source codes, formulae, some specialized tables, program «vCards» (info cards), etc. Because they’re unprocessed by original document’s markup module (wiki in our case) they don’t look weird (remember those smileys appearing inside some forums’ [CODE] blocks). Example: %%(ruby) p array.reject { |x| return x < 10 }%%.
    - This contept is similar to MediaWiki’s templates except it’s much more powerful because it’s not just a set of variables to be replaced but a module written using regular PHP (or other supported language).
4. Block, paragraph & text styles – ObjectWacko makes it easy to highlight a particular phrase, a whole paragraph or an entire block (with any regular markup like headings and quotations) in a specific way by placing one or more style names (containing Latin or other language’s word characters, digits and underscore) between a pair of round brackets. In HTML style name corresponds to CSS styles so adding a new style is as simple as writing necessary CSS rules and using it in the article right after this. Examples:
  1. Text: !!(plot spoiler) That kid's actually the main character in his childhood!!.
  2. Paragraph: .(NB) You can also do that by pressing !!(hotkey)F9!! or using your mouse.
  3. Block:

##(notice)
  == Attention users of v1.x branch ==
  Our software has been entirely reworked and
  support for the old data files is now very limited...
##

Recursive, context-dependent text processing – unlike most (or all) known web text formatters UverseWiki does not rely on regular expression-based Find & Replace, it does the complete disassembly of the source document, then rejects its source and uses the document tree (DOM) it has built on parse phase. This tree can be dumped (serialized in a binary form) for transmission or storage, rendered into different formats (plain text, HTML or something like FB2 or RTF) or operated upon (nodes replaced, partial tree rendering, etc.). The DOM is designed to be markup-independent so document can contain other documents (of different markups), embedded source codes, videos, flash applications, TeX formulae or images – as long as those nodes know how to output themselves in necessary representation the framework can provide proper API to deal with them in a uniform way.

A word about existing wiki engines

We’re realists and we understand that it’s impossible for a large group of people (save the entire Internet) to switch to something new in one moment. But this isn’t necessary. UverseWiki is a framework independent of everything, designed to run stand-alone (and it does run stand-alone – by means of a simple console wrapper – to generate our documentation upon new commits to the doc repo).

How an existing project no matter how it’s large can benefit from new markup at the same time maintaining maximum (or any other required level of) backward compatibility? As mentioned above the framework is markup-independent meaning that there’s a common API no matter what the document’s markup i.

Let’s say an existing web wiki engine has a number of users running their wikis. The team can implement a new markup module for UverseWiki that will process documents with the old syntax. There might be an option for a wiki owner to switch markups – for example, new articles will be created using new (ObjectWacko) syntax and old ones will still be processed by the compatibility markup module.

What’s important is that everything happens seamlessly for the host application (i.e. the wiki engine itself apart from its text formatting routines) – no matter if a document uses new syntax, old syntax or even some 3^rd party markup extension that site owner has installed. Users also are not concerned too much because they don’t have to rewrite old documents using new syntax unless they want to.

What’s more, markup-indenpendent and, thus, standardized Wiki document model mean that actions that don’t rely on a particular markup module are markup-indepentent as well. There’s no difference for the TOC action if it outputs headings for a wiki document or a BB-code document (taking [font] as headings), right? Then it will work even if the markup module current document uses was created later than the action itself!

If you have any questions or suggestions feel free to drop a line in the comments. We’re always happy to see new people interested to organize the texts!

~ Proger_XP.