Networking: csfq, wfq,drr, wf2q, sfq , Proportinal Fair, I-CSDPS
OS: Linux cfs, proportional sharing, lottery
Datacenters/GP Cluster: Hadoop ecosystem(Presto-Cloudera-YARN-SPARK) fair sched, capacity sched, QuincyLSF , condor (Ignore MPI folks)
Criteria
Scalability. (response time, number of machines)
Flexibility (heterogeneous mix of jobs)
Usability/grokability
Isolation – Fault isolation, Resource Isolation
Utilization(Achieve high cluster resource utilization. e.g., cpu utilization, memory utilization) – Balance the hosts – Meet the constraints of host
Service or Batch Jobs?
** who|process dominates, which resource to give priority to
** how to catch cheaters?
** How do you pre-empt
** In case of multiple schedulers – make them aware of each other – shared state to avoid one scheduler/workload dominating
Types
Global Scheduler needs to have state
Policies + resource availability
How much do we know about job/tasks
Job requirements (throughput, response time? ,availability)
Job Plan (Dag of tasks or what?, I/O needs, User affinity )
Estimates of duaration?, Input Size? , TX
Single vs Multiple scheduler agents + cluster state replicated into the nodes
Monlothic Platform LSF, Maui, Moab (HPC community)
Multi-step – Partition resources or dynamically allocate them (Mesos NSDI 2011) – can reject the offer
*** How long job has to wait
*** Fair sharing ()
Partition and resolve dependencies (Omega EuroSys 2013)
Issue – Upgrade/patching of scheduler or substrate
References
BarrelFish http://www.barrelfish.org/peter-phd-multicore-resource-mgmt.pdf
Tetris – http://research.microsoft.com/en-us/UM/redmond/projects/tetris/index.html
Click to access tetris_sigcomm14.pdf
Corona – https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
( fair-share scheduling)
“This scheduler is able to provide better fairness guarantees because it has access to the full snapshot of the cluster and jobs when making scheduling decisions. It also provides better support for multi-tenant usage by providing the ability to group the scheduler pools into pool groups. A pool group can be assigned to a team that can then in turn manage the pools within its pool group. The pool group concept gives every team fine-grained control over their assigned resource allocation.”
Docker (Fleet/Citadel/Sampi/Mesos)
sampi – https://github.com/mshamber/sampi/blob/master/example/scheduler.go
citadel – https://github.com/citadel/citadel/blob/master/scheduler/resource_manager.go
Docker Swarm
https://github.com/docker/swarm/tree/master/scheduler/strategy (CPU|RAM vs random !)
https://github.com/docker/swarm/tree/master/scheduler/filter (what are the constraints – same storage, same area, some tags?)
Mesos (http://mesos.berkeley.edu/mesos_tech_report.pdf)
https://github.com/apache/mesos/blob/master/include/mesos/scheduler.hpp
containerization – https://github.com/apache/mesos/blob/master/src/slave/containerizer/linux_launcher.cpp
Filtering (nw |fs – ) https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp
**** MesosContainerizerProcess::isolate
(strings::contains(isolation, “cgroups”) ||
strings::contains(isolation, “network/port_mapping”) ||
strings::contains(isolation, “filesystem/shared”) ||
strings::contains(isolation, “namespaces”))
**** process::reap
**** executorEnvironment (
MesosContainerizerProcess::_launch
MesosContainerizer::recover
Isolation – https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp
(Fair scheduler dependent on the co-ordination)
* Mesos/Yarn resource managers have a master-slave architecture. IS it me or they both have have adopted an SMPD MPI rank-x style job control ? Sort of Push (Mesos) and Pull (Yarn ) model.
Others
Slurm – http://slurm.schedmd.com/ seems to be deployed for v. large clusters. – http://slurm.schedmd.com/gang_scheduling.html – the interactivity is pretty cool
Torque – http://www.adaptivecomputing.com/products/open-source/torque/ (http://www.pbsworks.com/)
Profiling – http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36575.pdf
Yarn – Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of SoCC, 2013.
Condor – http://research.cs.wisc.edu/htcondor/
Quincy – http://research.microsoft.com/apps/pubs/default.aspx?id=81516
Vector Bin Packing – http://research.microsoft.com/apps/pubs/default.aspx?id=147927
Proactive-Inria – http://proactive.activeeon.com/
Omega – http://research.google.com/pubs/pub41684.html
Clustera – http://research.cs.wisc.edu/clustera/
Dryad –
http://research.microsoft.com/en-us/projects/dryad/
REservation based scheduling – http://research.microsoft.com/pubs/204122/rayon.pdf
X-Flex – Alternative to DRF – http://viswa.engin.umich.edu/wp-content/uploads/sites/169/2014/08/X-FlexMW14.pdf
Stoica – http://www.cs.cmu.edu/~istoica/csfq/
WF2Q – http://spazioinwind.libero.it/andreozzi/thesis/resources/schedulers/WF2Q.pdf
OS
VTRR – https://www.usenix.org/legacy/event/usenix01/full_papers/nieh/nieh.pdf (O(1) in less than 100 lines of code? )
– Order the clients in the run queue from largest to smallest share
Lottery – https://www.usenix.org/legacy/publications/library/proceedings/osdi/full_papers/waldspurger.pdf (randomized – based on drawl of ticket for the client)
Pegasus – http://www.utdallas.edu/~cxl137330/courses/spring14/AdvRTS/protected/slides/17.pdf
WFQ – Clients are ordered in a queue sorted from smallest to largest Virtual finish time
Encyclopedia – http://www.cs.huji.ac.il/~feit/papers/SchedSurvey97TR.pdf
Time stamped Scheduler – comparision – http://www.cse.iitk.ac.in/users/bpssahoo/paper/19_A_Comparative_Analysis_of_Time_Stamped__.pdf
Virtualization
VMware – http://www.vmware.com/resources/techresources/10345 (gang or co-scheduling)
– related – http://www.cs.huji.ac.il/~feit/papers/FlexCosched03IPDPS.pdf
HyperV – http://blogs.msdn.com/b/virtual_pc_guy/archive/2011/02/14/hyper-v-cpu-scheduling-part-1.aspx
Xen – http://www.hpl.hp.com/personal/Lucy_Cherkasova/papers/per-3sched-xen.pdf