At LakeTide, we constantly evaluate technologies that can deploy computation-heavy code to resource clusters and DC/OS is one of the heavyweight contenders in the container orchestration space right now. Though it has a smaller community than Docker, Kubernetes, CoreOS and other ‘clustering’ technologies, it has a few tricks up its sleeve and the Mesos scheduler is at the center of it.
Mesos began as a research project in the UC Berkeley to create a fine-grained resource scheduler for data-centers. It abstracts underlying hardware such as CPUs, memory, GPU, disk and just exposes the resources. It contains primitives for writing distributed applications using Message Passing, Task Execution, and more. In fact, Apache Spark was originally developed as a proof-of-concept for Mesos! What distinguishes Mesos from other schedulers is that it provides a two stage scheduling mechanism called resource-offers. This allows existing frameworks like YARN, MPI, and others to share resources, achieve data locality, and more on the same set of datacenter hardware.
DC/OS combines Mesos with Marathon to enable the ‘modern’ way of packing and distributing code to a cluster of computers using Linux containers. This combo is similar in spirit to Google’s secretive Omega scheduler and Kubernetes orchestrator. Together with DNS, a load balancer, and service monitoring, DC/OS provides a pretty complete cluster management solution in a single package. The GUI provides a nice dashboard and allows you to easily deploy and monitor jobs, but is more for looks than for administration.

Having used DC/OS for product development at LakeTide for a few months now, my impression is that it is a bit more hands-on than other frameworks. Not quite as “turnkey”. Not quite as forgiving. It’s not necessarily a bad thing, in fact, I’d say it is that way by design. Perhaps the following diagram from Apache Mesos can help clarify:

In the image above, the first thing that sticks out to me as a former BSD kernel developer is the challenge of getting multiple resource schedulers (like YARN and MPI) to cooperate on the same hardware. You need to solve issues like priority inversion, synchronous I/O being interrupted at the wrong time, context switching latency, and locality of reference for data being stored and processed. Mesos was built to address these issues and more in the datacenter, where multiple clustering programs might want to use the same hardware, the same data, and run at the same time. It’s no coincidence that DC/OS stands for “datacenter operating system”.
The minimal DC/OS setup that would make sense in production is 7 nodes, but really, if your cluster is that small it’s like killing a fly with a sledge hammer. If you want to do things like set up a containerized website with a load balancer, two web front-ends, and a MongoDB backend, then you might be better off with just a container orchestrator. If you are developing a cluster computing application (like Spark), or you are trying to homogenize your compute resources to allow multi-tenancy and resource sharing in your datacenter, then DC/OS is a proven and enterprise ready framework with commercial backing from Mesosphere. Microsoft Azure provides DC/OS as a managed service which can be a nice way to try it out before deploying it on your own resources.