LucidWorks Big Data & Oozie Workflow With VizOozie
In this post we will discuss how to create a visualized workflow graph for Oozie. Oozie is a workflow management system for Hadoop jobs. Oozie Workflow jobs are DAG (Directed Acyclical Graphs) of actions: http://oozie.apache.org
At LucidWorks we use Oozie in our LucidWorks Big Data product. The workflows which we provide with the platform are configured and run with Oozie. Developers create workflow.xml, workflow definition files for Oozie, and deploy them to Hadoop. A good explanation of how this works is provided here:http://www.infoq.com/articles/oozieexample
Some workflows get complicated pretty quickly and may include subworkflows, forks and joins and other actions which are hard to follow in xml. A visualization tool then would help streamlining workflow designs and quickly grasp the gist of what the workflow does.
VizOozie is an open source tool which helps converting your static xml workflow definitions into dot files, which can be used by graphviz dot program to create pdf or other formats: http://www.graphviz.org/
You will need a Unix like environment, python, and graphviz dot installed to run this.
Check it out from github and run:
python vizoozie/vizoozie.py example/workflow.xml example/workflow.dot
or use your own Oozie workflow xml file.
This will generate a dot file which can be easily converted to pdf with dot:
dot -Tpdf example/workflow.dot -o example/workflow.pdf
Standard workflow shapes are used for the start, end, process, join, fork and decision nodes. The action node backfill colors are configurable in the vizoozie.properties file (e.g. java action is in blue).
The code is pretty simple, it takes each node type and converts xml to dot string using xml.dom.minidom and writes it out. For example, given an XML snippet:
<path start="complex-math" />
<path start="more-complex" />
<path start="geek-candy-process" />
the code for a fork node looks like this:
def processFork(self, doc):
output = ''
for node in doc.getElementsByTagName("fork"):
name = self.getName(node)
output += '\n' + name.replace('-', '_') + " [shape=octagon];\n"
for path in node.getElementsByTagName("path"):
start = path.getAttribute("start")
output += '\n' + name.replace('-', '_') + " -> " + start.replace('-', '_') + ";\n"
In this method, there is just some node name normalization with name.replace('-', '_') as well specific node shape insertion (shape=octagon). Then, it just looks for the fork's start paths like these: <path start="complex-math" />. From our example above, this method will produce an output like this:
post_process -> complex_math;
post_process -> more_complex;
post_process -> geek_candy_process;
When used with dot program, it will generate a fork node with three children nodes. I hope you find this explanation useful.
LucidWorks transforms the way people access information to enable data-driven decisions. By combining Search with Big Data, the LucidWorks product suite provides real-time access to multi-structured data in motion.