I've been a busy bunny writing what has grown into a fairly large Spark patch: SPARK-1537, integration with the YARN timeline server. What starts as a straightforward POST event, GET event list, GET event code, grows once you start taking into account Kerberos, transient failures of the endpoints, handling unparseable events (fail? Or skip that one?), compatibility across versions. Oh, and testing all of this; I've got tests which spin up the YARN ATS and the Spark History server in the same VM, either generate an event sequence and verify it all works -or even replay some real application runs.
And in the process I have learned a lot of Scala and some of the bits of spark.
What do Iike?
- Type inference. And not the pretend inference of Java 5 or groovy
- The match/case mechanism. This maps nicely to the SML case mechanism, with the bonus of being able to add conditions as filters (a la Erlang).
- Traits. They took me while to understand, until I realised that they were just C++ mixins with a structured inheritance/delegation model. And once so enlightened, using them became trivial. For example, in some of my test suites, the traits you mix in define what it is services bring up for the test cases.
- Lists and maps as primary language structures. Too much source is frittered away in Java creating those data structures.
- Tuples. Again, why exclude them from a language?
- Getting back to functional programming. I've done it before, see.
What am I less happy about?
- The Scala collections model. Too much, too complex.
- The fact that it isn't directly compatible with Java lists and maps. Contrast with Groovy.
- Scalatest. More the runner than the tests, but the ability to use arbitrary strings to name a test case, means that I can't run (at least via maven) a specific test case within a class/suite by name. Instead I've been reduced to commenting out the other tests, which is fundamentally wrong.
- I think it's gone overboard on various features...it has the, how do I say it, C++ feel.
- The ability to construct operators using all the symbols on the keyboard may lead to code less verbose than java, but, when you are learning the specific classes in question, it's pretty impenetrable. Again, I feel C++ at work.
- Having to look at some SBT builds. Never use "Simple" in a spec, it's as short-term as "Lightweight" or "New". I think I'll use "Complicated" in the next thing I build, to save time later.
- Type inference. Even though its type inference is a kind that Milner wouldn't approve of, it's better than not having one.
- Semicolons being mostly optional. Its so easy to keep writing broken code.
- val vs var. I know, Java has "final", but its so verbose we all ignore it.
- Variable expansion in strings. Especially debugging ones.
- Closures. I look forward to the leap to Java 8 coding there.
- Having to declare exceptions. Note than in Hadoop we tend to say "throws IOException", which is a slightly less blatant way of saying everything "throws Exception". We have to consider Java's explicit exception naming idea one not to repeat on the grounds it makes maintenance a nightmare, and precludes different implementations of an interface from having (explicitly) different failure modes.
When I go back to java, what don't I miss?
- A compiler that crawls. I don't know why it is so slow, but it is. I think the sheer complexity of the language is a likely cause.
- Chained over-terseness. Yes, I can do a t.map.fold.apply chain in Spark, but when you see a stack trace, having one action per line makes trying to work out what went wrong possible. It's why I code that way in Java, too. That said, I find myself writing more chained operations, even at the cost of stack-trace debuggability. Terseness is corrupting.
Am I going to embrace Scala as the one-and-true programming language? No. I don't trust it enough yet, and I'd need broader use of the language to be confident I was writing things that were the right architecture.
What about Scala as a data-engineering language? one stricter than Python, but nimble enough to use in notebooks like Zepplin?
I think from a pure data-science perspective, I'd say "Work at the Python level". Python is the new SQL: something which, once learned, can be used broadly. Everyone should know basic python. But for that engineering code, where you are hooking things up, mixing in existing Java libraries, Hadoop API calls and using things like Spark's RDDs and Dataframes, Scala works pretty well.