2012-10-11

Rethinking JVM & System configuration languages

Rethinking JVM & System configuration languages

I've been busy in Apache Whirr, with a complete service that installs HDP-1 on a set of cluster nodes -WHIRR-667; the source all up on Github for people to play with.. As a result someone asked me why I'm not using SmartFrog to provision Hadoop clusters
Having used it as a tool for a number of years, I'm aware of its flaws:

Specification language
  • Hard to track down where something gets defined
  • x-reference syntax a PITA to use and debug
  • Fuzzy distinction about LAZY eval vs. pre-deploy evaluation (LAZY is interpreted at deployment, but 'when' is ambiguous)
Implementation-wise
  •  RMI is wrong approach: brittle, often undertested in real world situations, & doesn't handle service restarts as references break.
  •  Wire-format serialized Java objects; the Object->Text->Parse->Object serialization proved surprisingly problematic (not defining the text encoding didn't help)
  •  Security so fiddly that we would often turn it off.
  •  Doesn't work unless Java is installed and network up -so no so good for basic machine setup from inside the machine itself, only outside-in (which is partly what Whirr does.
  •  Java doesn't let you get at many of the OS-specific details (permissions, process specifics); you end up hacking execs to do this.
  •  The way you imported other templates (#import keyword) was C-era -multiple imports would take place, the order in which they were loaded mattered.
  •  Shows its age -doesn't use dependency injection and becomes hard to work with (NB: whirr doesn't inject either)
In defence:
  •    it's not WS-*
  •    language better than XML (especially spring XML)
  •    good for writing distributed tests in
  •    Most XML languages insert variable/x-ref syntaxes in different ways (ant, maven, XSD, ...); SF has a formal reference syntax that doesn't change.
  • Being able to x-ref to dynamic data as well as static is powerful, albeit dangerous as the values can vary depending on where you resolve the values, as well as changing per run. And they stop you doing more static analysis of the specification.
  • Being able to refer to string & int constants in java source convenient too (classpath issues notwithstanding). Example, I could say :
    
serviceName: CONSTANT org.smartfrog.package.Myclass.SERVICE;

    The constant would then be grabbed from source. This may seem minor, but consider how often string constants are replicated in configuration files as well as source -and how a typo on either side creates obscure bugs. Eliminating that duplication reduces problems.
Looking at Whirr I can see how the two-level property file config design has limits (all extended services need to have their handlers declared in every config that uses them); templates of some form or other would correct this.

Ignoring the specific issue of VM setup (I need to write a long blog there criticising the entire concept of VM configuration as it is today, as it's like linking a C++ app by hand), I'd do things differently.
I think we need a post-properties, post-SF language, language: a strict superset of JSON, to which it could be compiled down to, property expansion in x-refs, ability to declare what attributes to inject/are mandatory, some Prolog  & Erlang-style list syntax to make list play easier. No dynamic values, because that prevents evaluation in advance.

"org.apache.whirr.hdp.Hdp1": org.apache.whirr.hadoop.Hadoop {
  "install":"install-hdp",
  "configure":"configure-hdp",
  "port": 50070,
  "user":"mapred",
  "logdir": "/var/log/${user}",
  //Extend the list of things to inject
  "org.smartfrog.inject": ["logdir" |super:"org.smartfrog.inject"]
}


The template being extended would be this:
"org.apache.whirr.hadoop.Hadoop": {
  "install":"install-hdp",
  "configure":"configure-hdp",
  "timeout": 60000,
  "port": 50070,
  "description": "hadoop",
  "org.smartfrog.class": "org.apache.whirr.service.hadoop.HadoopClusterAction",
  "org.smartfrog.inject": ["timeout", "port","install" "configure","user"],
  "org.smartfrog.require": [install", "configure"]
}


This would compile down to an expanded piece of JSON; as it would expand out, you could use it as a pre-JSON anywhere.
"org.apache.whirr.hdp.Hdp1":  {
  "install":"install-hdp",
  "configure":"configure-hdp",
  "timeout": 60000,
  "port": 50070,
  "description": "hadoop",
  "user":"mapred",
  "logdir": "/var/log/mapred",
  "org.smartfrog.inject": ["logdir" ,"timeout", "port","install" "configure","user"],
  "org.smartfrog.class": "org.apache.whirr.service.hadoop.HadoopClusterAction",
  "org.smartfrog.require": [install", "configure"]
}


  1. Importing is a troublespot -if you required fully qualified template references that mapped to specific package & file names, then you could just have a directory path tree (a la Python), possibly with zip file/JAR file bundling, and have the templates located there.
  2. I'm avoiding worrying about references; you'd need a syntax outside of strings to do this. It'd be a lot simpler than the SF one -fully qualified refs again, up/down the current tree, and to the super-template.
  3. No runtime references.
This syntax would be parseable in multiple languages; expandable to pure JSON would be the serialization format.
 A Java interpreter could take that and execute it, doing attribute injection where requested, failing if a required value is missing. Behind the scenes you'd have things that do stuff. I'd also look very closely about using Java at all, not just because I'm enjoying living in a half-post-Java world (Groovy for tests, GUIs &c), but because it

One other possibility here is that given it's JSON, embrace JavaScript more fully. What if you have not only the configuration params, but the option of adding .JS code in there too; you could have some fun there.

A cluster would be defined from this, here using  the same role-name concept that whirr uses with something like
"1 hadoop-namenode+hadoop-jobtracker, 512 hadoop-tasktracker+hadoop-datanode"

In a JSON template language you'd split things up more & use lists. It's more verbose, yet tunable.
Your cluster templates would extend the basic ones, so a cluster targeting EC2 could extend "org.apache.whirr.hdp.Hdp1" and add the EC2 options of AMI location, AWS cluster (West Coast 2, obviously), as well as authentication details, -or leave that to the end.  (There's some thoughts on mixins arising here, let's not go there, but I can see the value)

stevecluster:  ClusterSpec org.apache.whirr.hdp.Hdp1{
 "ec2-ami":"us-west2/ami5454"
 "templates" : {
    "manager": {
       "Services": ["hadoop-namenode", "hadoop-jobtracker"],
       "Count": "1"
     }
    "worker": {
       "Services": ["hadoop-tasktracker", "hadoop-datanode"],
       "Count": "255"
     }
    }
}

 A template without the login facts would need to be given the final properties on startup, props that could be injected as system properties.  (launch-cluster —conf stevecluster.jsx -Dstevecluster.ec2-ami=us-west2/ami5454). Properties set this way would automatically override anything set. That is, unless there is (somehow) support for a final attribute, which Hadoop likes to stop end users overwriting some of the admin-set config values with their own.  Without going into per-key attributes, you could have a special key, final, which took a list of which of the peer attributes were final. Actually, thinking about it more, @final would be better. Which would be hard to turn into JSON…

I could imagine using the same template language to generate compatible properties files today; this JSON-template stuff would just be a preprocess operation to generate a .properties file. That's making me thing of XSLT, which is even scarier than mixins.

I have no plans to do anything like this.

I just think a template-extension to JSON would be very handy, that some aspects of the SmartFrog template language are very powerful & convenient, irrespective of how they are used.
If someone were to do this, the obvious place in Apache-land would be in commons-configuration, as then everything which read its config that way would get the resolved config. That framework is built with hierarchical property files -think log4.properties, so resolves everything to a string and then converts to numbers afterwards. Lists and subtrees are likely to be trouble here -albeit fantastic.

No comments:

Post a Comment

Comments are usually moderated -sorry.