The inevitable “Task not serializable” SparkException

The good old:

org.apache.spark.SparkException: Task not serializable

usually surfaces at least once in a spark developer’s career, or in my case, whenever enough time has gone by since I’ve seen it that I’ve conveniently forgotten its existence and the fact that it is (usually) easily avoided.

Here is the scenario: I need a class that filters an RDD, for example:

class AreWeSerializableYet(sc: SparkContext) {

val rdd = sc.parallelize(1 to 10)
val numberTwo = 2

def doFilter() = {
 val filtered = rdd.filter(defEvens) //not serializable
      filtered.collect()
 }
 def defEvens(a:Int) = a % 2 == 0
}

Let’s instantiate the class in a test and call that method:

class ShortTest extends FlatSpec with Matchers {

val sc = getSpark().sparkContext
 val expectedOutput = (2 to 10 by 2).toList

"AreWeSerializableYet" should "blow up (or not) for demo purposes" in {
 val subject = new AreWeSerializableYet(sc)
 val result = subject.doFilter()
 result should be (expectedOutput)
 }

def getSpark(appName:String = "testerApp") = {
 SparkSession
 .builder()
 .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
 .master("local[*]")
 .appName(appName)
 .getOrCreate()
 }

}

We run our tests in the IDE and are disappointed to find this in the output:

org.apache.spark.SparkException: Task not serializable

Trogging further down the stacktrace we see the cause:

Caused by: java.io.NotSerializableException: com.ehur.AreWeSerializableYet
Serialization stack:
 - object not serializable (class: com.ehur.AreWeSerializableYet, value: com.ehur.AreWeSerializableYet@4ba380c7)
 - field (class: com.ehur.AreWeSerializableYet$$anonfun$1, name: $outer, type: class com.ehur.AreWeSerializableYet)
 - object (class com.ehur.AreWeSerializableYet$$anonfun$1, <function1>)

The problem? Spark is attempting to serialize the entire instance of the AreWeSerializableYet class (you can see in the above trace that the method, blah.blah.$$anonfun$1, keeps a reference to the class it belongs to in a field called $outer.) But our class is not serializable, and will remain so even if we slap a “with Serializable” on it. For one thing, one of its instance variables is a spark context, and trying to serialize a spark context is just not done.

The solution? Try passing a function to the filter instead of a method call. The function does not retain any handle on the class instance and can be serialized into the cluster as is:

def doFilter() = {
  val filtered = rdd.filter(valEvens)  //yes serializable
  filtered.collect()
}

val valEvens = (a:Int) => a % 2 == 0

What if you need to reference an (immutable) instance variable of the class in your function? Recall that:

val numberTwo = 2

is defined in the class. This too will fail with the Task not serializable exception, because the numberTwo variable again has a handle on the instance it belongs to, which is not a serializable object:

def doFilter() = {
  val filtered = rdd.filter(isDivisibleByTwo)   //not serializable
  filtered.collect()
}
val isDivisibleByTwo:(Int) => Boolean = _ % numberTwo == 0

One work around this is to wrap that instance variable in some serializable case class that in turn contains the function:

def doFilter() = {
  val serializableWrappedVar = SerializableThing(numberTwo)
  val filtered = rdd.filter(serializableWrappedVar.isDivisibleByMe)   //yes serializable
  filtered.collect()
}
case class SerializableThing(num:Int) {
  def isDivisibleByMe(a:Int) = a % num == 0
}

Yes, lots of hoops to jump through, but we need to filter those RDDs somehow…

 

Posted in code Tagged with: , , , ,

A week in ML part 2: algorithms

The previous post established the need for a thorough understanding of the data. Assuming we know our data, we can now consider the algorithms.

Our chosen problem was a prediction question: clearly a supervised learning exercise, and as we expected the output to be continuous values, a regression problem. We tried a few off the shelf libraries that provided implementations of linear regression, and learned lots of little details about the process as we went.

During our subsequent retrospective meeting, (because we retrospect on everything around here, like this article, where I am retrospecting on the retrospective…), it was pointed out that our approach to the problem had “no rigor” – there was no attempt to dig into the assumptions inherent in the regression model, and how our chosen dataset might invalidate those assumptions. Treating the algorithm as a sort of wizard’s cauldron – tossing in our input data, waving the algorithmic wand, and expecting magical predictions to emerge, without any understanding as to why this might actually work, did not seem to be a good thing.

I would counter that the naïve approach we took was part of the point of the exercise. We hoped to accomplish some level of machine learning without having to first become either a mathematician or a wizard. And we did learn:

  • metrics:
    • we now know how to interpret regression metrics, and what are reasonable metrics for our particular problem. Evaluating metrics is one of the key step in the machine learning process: metrics that come with the implementation of the chosen algorithm, and sometimes metrics that we might develop ourselves for our particular problem.
  • pipelines
    • programmatic aids like spark’s ml-pipelines allow us to automate the repetitive steps: from data preparation, to parameter tweaking, to metrics assessment. Thus, experimenting with a variety of algorithms and parameters becomes very manageable.
  • tuning
    • While our problem was a simple enough one, we could improve it with a few adjustments:
      • adjusting the inputs: providing more observations for training, more independent predictor variables; give ourselves more time to understand time series decomposition.
      • adjusting the parameters: regularization, fit intercept and so on.

Given more time and more experience, we will in the future be able to work up to more complex questions, and know when to ask things like: is my data highly non-linear? Do I need to do some feature engineering? Is it time to bring out the big guns: deep learning/PCA? Should I be using an SVN with a custom kernel function? Maybe I should only run my algorithm by the light of the first full moon after the spring equinox?

But that’s not the place to start. I don’t know about mathematicians, but it takes years to become a good wizard…

Here are some resources on choosing algorithms that I find helpful:

Posted in machine learning, Uncategorized Tagged with: , , ,

A week in ML

A group of us at CJ engineering recently got the opportunity to set aside all other project and squad work for 1 week, to focus exclusively on a machine learning exercise. The objective of this effort was not so much to come up with a great prediction algorithm that would yield a fountain of money for CJ and for our clients: rather it was to get stuck in with a machine learning challenge, and to experience all the pitfalls and promises, the trials and triumphs that are part and parcel of practical machine learning. And if that fountain-of-money-yielding-algorithm did emerge from the exercise, so much the better…

The nuts and bolts about the problem statement, steps followed, tools and technology used are left for a separate posting: in this series of 3 posts I will share the key lessons I learned that I plan to take with me to my next ML endeavor:

1. know your data
2. know your algorithm(s)
3. know your architecture

This post will delve into the first of these:

know your data

Data – your new best friends

The ML success story of my dreams goes something like this: take a bunch of attributes, toss them all into some algorithm, click play, and – voilà – a robust prediction model materializes, our clients’ revenue goes through the roof, and I am awarded the Nobel prize for economics. While we await this moment of glory (which hasn’t happened yet, but there’s still time), what we can do in the meantime is get cosily familiar with our features: our input data. They are our new best friends: get to know their size and shape, take them to the pub for a beer: let them open up to you and share their insights.

There’s a plot in here somewhere…

If going to the pub turns out not to be an option, another way to get that insight from your features is to plot them against the target variable in the training data, and even plot them against each other (although that sounds a bit subversive). Zeppelin notebooks makes this easy, by the way. In general, being able to visualize your features in various ways in the context of your ML question is very effective in informing your approach to that problem.

Less More is more.

Make sure you have enough data: enough samples to split into meaningful training and test datasets, enough to cover all your data variability. When working with time series data, make sure you have data that goes back sufficiently in time: for example if you know your model will be influenced by seasonality, make sure your data covers all the seasons. But, you ask, how much is enough – what’s the magic number? I don’t know, it will depend. In one experiment we tried to infer seasonality with less than 2 years of data. It seems to me that we ought to have had at least 3 full years of data to train with.

Ask pertinent questions to avoid impertinent blowouts

In a linear regression model to predict future revenue for your clients, is it really appropriate to train your model for a dataset that includes all your clients’ data? Or should you be training a model per client? Beware of categorical feature blowout – a relatively moderate input dataset size can translate to massive amounts of computation if you use an input with high cardinality as a categorical feature. Our program had the impertinence to run out of memory before our data had even made it into the algorithm. It’s also easy to overfit with this kind of data.

 

Coming soon: know your algorithm(s).

Posted in machine learning Tagged with: , , ,

SCNA 2016

Software Craftsmanship North America – a community of software professionals focused on the skills and craft required to produce enduring proficiency in software development – hosted their annual conference at the Town and Gown facility at USC on Friday October 21st, 2016, followed by a “day of code” event on Saturday October 22nd. Luckily for me, my company, the engineering team at CJ Affiliate, has always emphasized this aspect of software engineering excellence, and was one of the main sponsors of the event, so I was able to attend the Friday conference.

In the rush to cover the many new tools and techniques that are released every few months, there aren’t so many conferences or meetups that dedicate time to the art and mastery of writing code. So it was refreshing to come to this gathering, where software craftsmanship was important to everyone there.

Some thoughts gleaned from the morning talks:

Michael Feathers made the charming suggestion that we “anthropomorphize” our code – look at it fondly and ask “what does this code need right now?” By appreciating and empathizing with what we create, we will naturally be moved to write better code. I might give that a try.

Audrey Trout presented a very illuminating talk on learning together. Did you know that anywhere from 20% to 70% of project time is taken up with learning? Learning effectively ought to be of prime concern to any project team.

Mined Minds revealed yet another take on expanding software skills to a diverse and unlikely community – that of former coal miners. The point was not just that these people get a chance to pull themselves out of poverty and unemployment – which they do, and that by itself is a laudable achievement. On top of that, in introducing these individuals to the software development profession, that profession is gaining a much-needed injection of diversity – diversity of thought, and diversity of backgrounds. If 80% of a profession is made up of relatively affluent white males, who chose it because they like playing video games, then that profession cannot escape being limited in its creativity, blinkered in its vision, and lacking in its understanding of the world and the people that live in it. Who would want to work in a limited, blinkered and lacking profession?

Carina Zona‘s talk: the Consequences of an Insightful Algorithm, was a timely and revealing reminder that we writers of algorithms and interpreters of data are not infallible, we are not prescient and we are definitely not precise. Ignoring this fact can lead to “Inadvertent Algorithmic Cruelty” among other faux pas (examples: the mislabeling of images from Nazi concentration camps as children’s playground equipment, or Facebook’s “on this day” feature that invokes the memory of some terrible personal tragedy, with no way to turn it off). Such outrageous algorithmic blunders might have elicited a ranting, righteous screed; but Carina’s delivery was more effective for being measured and empathetic.

This subject matter suggested to me that I should occasionally look beyond the lines of code and ponder other aspects of our profession. But, while I found that refreshing, some of my colleagues who had attended last year expressed disappointment in this year’s content. Last year’s conference was a more code-based focus on craftsmanship, so some people naturally had expectations that this year would be the same. I didn’t mind. In software conferences, just as in our daily software development endeavors, sometimes it is necessary to look up from the IDE, and look out at the world we are attempting to change with lines of code.

Posted in code, talks, women in tech Tagged with: ,

Apache Zeppelin 0.5.6 : spark 1.6.1 client : hadoop 2.6.0-cdh5.5.1 in remote cluster

Some trials, some errors, some success…

Downloaded latest release which at this time is 0.5.6 from here and unzipped into incubator-zeppelin folder. Built using this command:

mvn clean package -DskipTests -Pspark-1.5 -Dspark.version=1.6.1 -Dhadoop.version-2.6.0-cdh5.5.1 -Phadoop-2.6 -Pyarn

caveat coder:

  • -Pspark-1.5 corresponds to the profile in the pom, not my actual spark version
  • I specifically used a spark build that used scala 2.10: I ran into ClassNotFoundExceptions when trying to run zeppelin with a spark built on scala 2.11.
  • zeppelin seems to require java7 at present

Configured zeppelin_env.sh:

export JAVA_HOME=/Users/lhurley/software/jdk1.7.0_79
export HADOOP_CONF_DIR=/Users/lhurley/software/hadoop/etc/hadoop
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0-cdh5.5.1"
export SPARK_HOME=/Users/lhurley/software/spark
export PATH=$PATH:$SPARK_HOME/bin

Ran the zeppelin daemon:

./bin/zeppelin-daemon.sh start

caveat coder:

  1. trying to run the daemon in a tmux shell failed:
"Zeppelin process died  [FAILED]")

and you see this in the log file:

"nohup: can't detach from console: No such file or directory"

so: don’t run it in tmux…

2. and running over vpn using remote cluster I got this:

java.net.UnknownHostException: lhurley-mac: nodename nor servname provided, or not known.

I had to add my hostname /etc/hosts:

127.0.0.1 localhost lhurley-mac

Here is my sample notebook based on a music recommender from Advanced Analytics with Spark.

Posted in code Tagged with: , ,

TDD demo at SBHS Computer Science Academy

In March this year I roped in a few of my colleagues from CJ Affiliate to demo a little Test Driven Development to the students at Santa Barbara High School’s Computer Science Academy. The students were impressive – engaged, curious and brave enough to come up and pair with us in front of the class!

 

Here’s the long version of the demo – will update with an edited version soon…

Posted in code, talks, women in tech Tagged with: , , ,

I’m geeky?

(This originally appeared on the cj engineering blog on February 9, 2016)

She’s Geeky 2016 “Unconference” was held at the Computer History Museum in Mountain View on Saturday, January 30. I don’t typically label myself a geek, but I seized the opportunity to attend, and seized #1daughter to accompany me, on the “bring-your-daughter” companion ticket. Why, if I am not a self-described geek, did I wish to attend? Well…

  • mainly – for interaction and conversation with other women who work in technology
  • possibly – for connections that might steer more female candidates toward our hiring efforts
  • hopefully – to get inspiration for #1daughter…
  • and – bonus – it was the last chance to see the Babbage Difference Engine before it was relocated out of the museum!

Saturday, which was in fact the second day of the event, saw 200+ attendees forfeit their weekend morning lie-in in favor of this geeky gathering. Introductions and opening ceremonies completed, we swarmed at the agenda board with our suggested topics and, as happens in the OpenSpace method, a motley collection of hand-written pages metamorphosed into a coherent and colorful agenda for the day.

A sampling of the session topics: follow your passion, HTML5 and css, the classroom of the future, IoT recipes, diversity gaps, mentoring 360, hacker events, arduino starter kits, and much more. There are too many to delve into, but a few are worth calling out here. For more details, check out the blog at shesgeeky.org.

Impostor Syndrome: an engaging discussion. It made me realize that women working in tech commonly encounter discrimination, disparagement of their efforts and abilities, and unfairness. While it was disheartening to hear this first-hand from some who had experienced it, I felt grateful and a bit proud that this is not something I encounter here at CJ.

How to access a computer when you cannot use your hands: My feeble writing will not do justice to the courage and fortitude of this speaker. A software engineer who suffers from ALS spoke to us by means of the tablet, camera and software wired and configured into her wheelchair. She told us how these technologies enabled her to deal with, first, the loss of her dominant hand; soon, the loss of both hands, and, inevitably, the frightening loss of her own voice. Dasher, a zooming predictive text entry system, with a camera and eye-tracking software, is used for formulating speech. Invented some years ago at Cambridge University, England, development on Dasher stopped about 5 years ago. Frustrated with some of its shortcomings, this engineer cloned the git repo to work on her own enhancements. She and her team are poised to launch the first new release in years. Interestingly, Dasher is hopeless for writing code – an on-screen keyboard is better suited.

She too brought her daughter. It was one of the more profound sessions of the event.

Arduinos for all: A practical and useful hands on demo of arduino. I learned that arduinos are both cheaper and simpler than I had supposed. I will definitely invest in a kit in the near future. The next pedagogical experiment to try on my kids, perhaps?

Oh – and speaking of pedagogical experiments – how did #1daughter get on? Well, it seems the women in tech community did not terrify her after all. In fact she had this to say:

“Instead of a group of very intelligent people talking about very intelligent things in a very intelligent way, I got to meet a group of very intelligent people who were open, kind, and interesting.”

I’ll take it. It’s not every day a teenager will use those four adjectives to describe people like her mother.

Posted in women in tech Tagged with: , , , ,