Wild pipeline API thoughts

(Note: this is a long post, and most certainly the syntax highlighter will make it look funny on your aggregator. You might want to visit the web page to get a better grasp of it)

In 2006 it will be roughly 6 years since I started juggling with XML pipelines. As my few fellow readers might remember, I’m starting to hate XML languages with a passion, but once again this doesn’t mean I don’t like XML anymore and, even more, this doesn’t mean my love for Cocoon is fading. I’m still convinced that pipeline-based processing is the way to go: the road to complex yet maintainable results clearly goes through decomposing the problem in a set of easy step to be performed sequentially and incrementally.

I’m also still convinced that XML is here to stay, for a number of valid reasons, yet I think that the overall scenario has changed since the original Cocoon vision. XML is possibly the most important player out there, but didn’t manage to pursue its Borgish ambition to assimilate everything else: there is a growing party of people who are realizing how the idea that everything could (and should!) be represented as XML is pretentious at least, and stupid at most.

This leaves us, however, with two important concepts: we need pipelines, and we need to steer clear of XML when it doesn’t make sense. To achieve the first goal what we need is a generic, easy and intuitive pipeline API. And it should be a programmatic API, because we need pipelines everywhere, and we need them to be easy enough to grasp for the average programmer (think Facade on steroids): what bugs me with the currently available pipeline API is how they tend to be clumsy and counter-intuitive. Think SAX as the perfect example of why we need a more generic and easier pipeline API and machinery: in the SAX world if you want to pipe events from foo to bar you just do this:

[java]
foo.setContentHandler(bar);
[/java]

Things however get complicated when baz and boo enter the picture. Now you have to:

[java]
baz.setContentHandler(boo);
bar.setContentHandler(baz);
foo.setContentHandler(bar);
[/java]

Which, counter-intuitively, means building the pipeline starting from the last component and going all the way to the first one. In addition to that, those statements are usually interspersed on code that contains other statements such as creation and configuration of the various pieces. Moreover, from a functional point of view the above code could be rewritten as:

[java]
foo.setContentHandler(bar);
bar.setContentHandler(baz);
baz.setContentHandler(boo);
[/java]

Which, even if it looks better from the user point of view (the pipeline steps are now ordered) it has no relation with the underlying model. In fact, you could actually obfuscate stuff when considering switching jobs:

[java]
foo.setContentHandler(bar);
baz.setContentHandler(boo);
bar.setContentHandler(baz);
[/java]

The above lines will still work as expected, but I dare anyone to understand who sends events to whom in a real life scenario where those statements might be ten lines away from each other. To me, this just doesn’t sound right.

Now enter Cocoon and see how, in its declarative sitemap, it shines from the user’s point of view:

[xml]

[/xml]

This just sounds right: pipeline steps are listed sequentially, as they should be, and everyone now understands who’s first and who’s next. But, heck, this is a domain-specific language by all means, moreover written in XML. No way this stuff can be reused in different context, and the “strong typing” nature of the Cocoon pipeline (where everything starts with a Generator and ends with a Serializer, assuming not only that the whole world will talk XML but actually that the whole world will talk SAX) makes things even more difficult.

Finally, consider what the Unix genius have brought us:

[code]
$ grep index.html access.log | awk ‘{ print $1 }’ | sort | uniq | wc -l
[/code]

I think there’s no better comment for the above solution than this:

“A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.” (Antoine de Saint-Exupery)

Expressing the pipeline concept with just one character (the | sign) is a clear indicator of what could be achieved when thinking about simplicity: the concept is powerful, yet the user space view of it is as simple as it can get (admittedly, a bit opaque but it doesn’t take much to get used).

So, what do the above snippets bring to us? In this quest for a simple pipeline API we learn that simplicity is key and that the principle of least surprise suggests that the pipeline declaration should happen at once and in an ordered way. Sticking to Java, this leaves us with something like

[java]
pipeline.setupPipeline(List components);
[/java]

or (uglier, but sometimes just effective enough):

[java]
pipeline.setupPipeline(PipelineComponent[] components);
[/java]

Actually I’d much rather see the setup happening during the construction phase, but for the sake of interface design I’ll leave the convenience method for now. This means that our Pipeline interface becomes something like:

[java]
interface Pipeline extends PipelineComponent {

void setupPipeline(List components);

void start();

}
[/java]

Easy and effective: whoever knows the pipeline concept should be able to grasp how this API works in minutes. We need to talk about the PipelineComponent interface though, and this is related to the next wild idea: pipeline machinery.

(warning: shaky ground ahead, this is the part which needs *much* more thinking)

In the OO world things aren’t quite as simple as in a CLI environment, where all you have are basic pipeline contracts such as “whatever byte streams comes from the left side is pumped to the right side”. We have objects here, and I don’t want this generic pipeline API to be strongly typed as in being able to work just with – say – SAX events or other XML gibberish. What I want is an API which is able to work with many formats in a way that’s transparent to the user: this means that the various pipeline stages should be able to express their contracts in terms of required input and output format. It’s up to the pipeline machinery providing adapters and bridges so that, say, a pipeline component working with SAX events might be able to cooperate with another component working with streams. This could be accomplished either through some kind of PipelineDescriptor, with annotations or just through different interfaces: whatever keeps things simple, makes me happy.

Another nice solution comes (again!) from a chat with Sylvain reminding me of the IAdaptable approach in Eclipse. This solution fits like hand in glove with a world of interchangeable and heterogeneous pipeline component stages, even though I have a few concerns thinking about the added complexity for pipeline component writers in implementing an Adaptable strategy: if the first and foremost objective of this API is simplicity, then writing components should be as easy as possible.

Anyway, the final outcome of all this would be something like this during the pipeline assembly phase:

[java]
PipelineComponent current;
PipelineComponent next;

if (next.accepts(current)) {

current.setNext(next);

} else {

PipelineComponent adapter = next.getAdapter(current.class);

current.setNext(adapter);
adapter.setNext(next);

}
[/java]

With this mechanism in place, in theory, the pipeline is much more versatile: sticking to the XML world it would be possible to build pipelines using whatever mix of SAX, DOM, StAX, AXIOM and YouNameWhat. Moreover, it would be easy enough to provide adapters to the stream world, tees and nested pipelines (there is a reason for Pipeline to extend PipelineComponent after all).

Of course I expect the pipeline machinery to do much more than just adaptation: caching, logging, monitoring and management are vital to the pipeline deployer. But the real point of this effort is, once again, simplicity. Once I’m able to do this:

[java]
// Get the default pipeline implementation
Pipeline pipeline = PipelineFactory.getPipeline();

// Set up the pipeline with an array of PipelineComponents
pipeline.setupPipeline({reader, transform1, transform2, streamAdapter });

// grab the InputStream from the latest component
InputStream result = streamAdapter.getInputStream();

// start processing
pipeline.start();

// enjoy results
is.read()…
[/java]

or this:

[java]
// This time we use SAX events straight away
pipeline.setupPipeline({reader, transform1, transform2 });

// connect to the pipeline result
transform2.setContentHandler(myContentHandler);

// start processing and handle events coming in
pipeline.start();
[/java]

or, why not:

[java]
pipeline.setupPipeline({reader, transform1, transform2 });

anotherPipeline.setupPipeline({something, pipeline, somethingElse});

anotherPipeline.start();
[/java]

Then I could do this from within Cocoon:

[javascript]
function handlePage() {

var pipeline = cocoon.newPipeline({ file(“something.xml”, xslt(“foo.xsl”), forms(), i18n() });

cocoon.sendPipelineAndWait(pipeline);
}
[/javascript]

but also, when Cocoon is not an option:

[xml]
< %@ taglib uri="http://jakarta.apache.org/taglibs/pipeline" prefix="pipeline" %>
[/xml]

Conclusion: if you managed to survive this far, well, congrats and thanks for sticking: it’s been a bumpy ride and there are certainly a ton of rough edges, but the more I think about it, the more I’m convinced that a simple, painless and easy to use Pipeline API could be an invaluable tool. I’d love to use the incredible experience of Cocoon in building solid pipelines to factor out a new and fresh approach that allows anyone to enjoy the power of pipeline-based processing: it’s not going to be easy, but the goal is indeed worth the effort. Now, finding the time to make it happen is a totally different question…

Comments

comments

1 thought on “Wild pipeline API thoughts”

  1. Dear Gianugo,

    This is exactly what I need. I’ve been a fan of Cocoon. However Cocoon can be used only as a framework and the projects, I am working on, are build already on other frameworks. Some time ago I considered to extract the pipeline code from Cocoon and use it as a library, but looking at the source I realized that it is not possible. As you wrote in the article, it’s a lot of effort to implement the idea. If you decide to start an open source project I am willing to help.

    Martin Rusnak

Comments are closed.