We run daily Spark jobs that generate various reports: retention, cohorts, user activity, life time value, etc., and recently we've found ourselves in a situation that one person must wear operator's hat for watching this process, and re-starting tasks once they are failed or (which is much worse) are stuck.
Why this happened? There are many reasons for that. If you follow our tech blog you probably know that we run scheduled Spark jobs in Mesos on data stored in S3 either in SequenceFile or in Parquet format (here's good write-up about our process). It's funny that Spark was created as a proof of concept framework for Mesos, but it really seemed that these two guys are not playing together well enough. We tried enabling dynamic allocation, but we ended up running important jobs without it as they failed in 50% of cases because of some weird errors regarding lost shuffle blocks.
The most important issue was that sometimes Spark job would stuck on a last phase, and there was no way to revive it but enter the machine running this stuck executor via SSH, and kill the process. And this was part of the "operator" man's work. We suspected that this was related to Mesos somehow, but there was no proof.
Another problematic part in our pipeline was Chronos. Here are some issues that we encountered on daily basis:
- Scheduled tasks were not executed. What??? Isn't it a scheduling framework? Right. But, sometimes, jobs depending on some parent task, which has finished successfully didn't run. Not that they didn't run on time - they didn't run at all.
- The same job was executed twice. This event was rare, but if it happened for some task that wrote to Redshift DB it could lead to really bad user experience to those trying to query at the same time.
- There is an option to specify multiple parent tasks, but this option simply didn't work.
And there were tens of other less important, but still annoying issues. Frankly, it looks like Chronos is not in intensive development anymore, or it has reached its level of perfection, but we're not aware of this fact :-)
So, to kill two birds with one stone, we've decided to move to Yarn (as a replacement for Mesos) and to Airflow (as a replacement for Chronos) at once. Configuring Yarn is not a simplest task, especially if your target is having a cluster that runs both Spark and Hadoop jobs on it (Hadoop is still required for indexing data into Druid). But after reading numerous articles, and looking how Amazon EMR is configured we've reached working configuration of Yarn cluster with HA, which allows Spark/Hadoop to utilize resources fully using dynamic allocation. Needless to say that no jobs are stuck anymore.
As for Airflow, the most important advantages over Chronos for us are:
- Tasks in Airflow are defined programmatically, so it's easier to generate dynamic workflows when we want to rebuild some data for a given time frame.
- Airflow written in Python, so it's really easy to hack and adapt it to your needs, or at least understand why something doesn't work as expected.
- Multiple parents are allowed.
- It's more stable.
- It's in active development.
We are still testing this constellation of Yarn and Airflow, but for now it looks like it works much much better.
Enjoy our production workflow screenshot as a complement to this post :)