Schedulers – some notes

Networking: csfq, wfq,drr, wf2q, sfq , Proportinal Fair, I-CSDPS

OS: Linux cfs, proportional sharing, lottery

Datacenters/GP Cluster: Hadoop ecosystem(Presto-Cloudera-YARN-SPARK) fair sched, capacity sched, QuincyLSF , condor (Ignore MPI folks)
Criteria
Scalability. (response time, number of machines)
Flexibility (heterogeneous mix of jobs)
Usability/grokability
Isolation – Fault isolation, Resource Isolation
Utilization(Achieve high cluster resource utilization. e.g., cpu utilization, memory utilization)  – Balance the hosts  – Meet the constraints of host

Service or Batch Jobs?

** who|process dominates, which resource to give priority to
** how to catch cheaters?
** How do you pre-empt
** In case of multiple schedulers – make them aware of each other – shared state to avoid one scheduler/workload dominating

Types

Global Scheduler needs to have state
Policies + resource availability
How much do we know about job/tasks
Job requirements (throughput, response time? ,availability)
Job Plan (Dag of tasks or what?,  I/O needs, User affinity )
Estimates of duaration?, Input Size? , TX
Single vs Multiple scheduler agents + cluster state replicated into the nodes
Monlothic Platform LSF, Maui, Moab (HPC community)
Multi-step – Partition resources or dynamically allocate them (Mesos NSDI 2011) – can reject the offer
*** How long job has to wait
*** Fair sharing ()
Partition and resolve dependencies (Omega EuroSys 2013)
Issue – Upgrade/patching of scheduler or substrate

References

BarrelFish http://www.barrelfish.org/peter-phd-multicore-resource-mgmt.pdf

Tetris – http://research.microsoft.com/en-us/UM/redmond/projects/tetris/index.html

Click to access tetris_sigcomm14.pdf

Corona – https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
( fair-share scheduling)
“This scheduler is able to provide better fairness guarantees because it has access to the full snapshot of the cluster and jobs when making scheduling decisions. It also provides better support for multi-tenant usage by providing the ability to group the scheduler pools into pool groups. A pool group can be assigned to a team that can then in turn manage the pools within its pool group. The pool group concept gives every team fine-grained control over their assigned resource allocation.”

Docker (Fleet/Citadel/Sampi/Mesos)
sampi – https://github.com/mshamber/sampi/blob/master/example/scheduler.go
citadel – https://github.com/citadel/citadel/blob/master/scheduler/resource_manager.go

Docker Swarm
https://github.com/docker/swarm/tree/master/scheduler/strategy (CPU|RAM vs random !)
https://github.com/docker/swarm/tree/master/scheduler/filter (what are the constraints – same storage, same area, some tags?)

Mesos (http://mesos.berkeley.edu/mesos_tech_report.pdf)
https://github.com/apache/mesos/blob/master/include/mesos/scheduler.hpp
containerization – https://github.com/apache/mesos/blob/master/src/slave/containerizer/linux_launcher.cpp
Filtering (nw |fs – ) https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp
**** MesosContainerizerProcess::isolate
(strings::contains(isolation, “cgroups”) ||
strings::contains(isolation, “network/port_mapping”) ||
strings::contains(isolation, “filesystem/shared”) ||
strings::contains(isolation, “namespaces”))
**** process::reap
**** executorEnvironment (
MesosContainerizerProcess::_launch
MesosContainerizer::recover
Isolation –  https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp
(Fair scheduler dependent on the co-ordination)
* Mesos/Yarn resource managers have a master-slave architecture.  IS it me or they both have have adopted an SMPD MPI rank-x style job control ? Sort of Push (Mesos) and Pull (Yarn ) model.

Others
Slurm – http://slurm.schedmd.com/ seems to be deployed for v. large clusters. – http://slurm.schedmd.com/gang_scheduling.html – the interactivity is pretty cool
Torque – http://www.adaptivecomputing.com/products/open-source/torque/ (http://www.pbsworks.com/)

Profiling – http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36575.pdf

Yarn – Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of SoCC, 2013.
Condor – http://research.cs.wisc.edu/htcondor/
Quincy – http://research.microsoft.com/apps/pubs/default.aspx?id=81516
Vector Bin Packing – http://research.microsoft.com/apps/pubs/default.aspx?id=147927
Proactive-Inria – http://proactive.activeeon.com/
Omega – http://research.google.com/pubs/pub41684.html
Clustera – http://research.cs.wisc.edu/clustera/
Dryad –
http://research.microsoft.com/en-us/projects/dryad/

Click to access eurosys07.pdf

REservation based scheduling – http://research.microsoft.com/pubs/204122/rayon.pdf
X-Flex – Alternative to DRF – http://viswa.engin.umich.edu/wp-content/uploads/sites/169/2014/08/X-FlexMW14.pdf
Stoica – http://www.cs.cmu.edu/~istoica/csfq/
WF2Q – http://spazioinwind.libero.it/andreozzi/thesis/resources/schedulers/WF2Q.pdf
OS
VTRR – https://www.usenix.org/legacy/event/usenix01/full_papers/nieh/nieh.pdf (O(1) in less than 100 lines of code? )
– Order the clients in the run queue from largest to smallest share
Lottery – https://www.usenix.org/legacy/publications/library/proceedings/osdi/full_papers/waldspurger.pdf (randomized – based on drawl of ticket for the client)
Pegasus – http://www.utdallas.edu/~cxl137330/courses/spring14/AdvRTS/protected/slides/17.pdf
WFQ – Clients are ordered in a queue sorted from smallest to largest Virtual finish time

Encyclopedia – http://www.cs.huji.ac.il/~feit/papers/SchedSurvey97TR.pdf
Time stamped Scheduler – comparision – http://www.cse.iitk.ac.in/users/bpssahoo/paper/19_A_Comparative_Analysis_of_Time_Stamped__.pdf

Virtualization
VMware – http://www.vmware.com/resources/techresources/10345 (gang or co-scheduling)
– related – http://www.cs.huji.ac.il/~feit/papers/FlexCosched03IPDPS.pdf
HyperV – http://blogs.msdn.com/b/virtual_pc_guy/archive/2011/02/14/hyper-v-cpu-scheduling-part-1.aspx
Xen – http://www.hpl.hp.com/personal/Lucy_Cherkasova/papers/per-3sched-xen.pdf

 

Advertisement
Schedulers – some notes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s