MiniKube¶
Running on a local MiniKube cluster
Flux Cloud (as of version 0.1.0) can run on MiniKube! There are two primary use cases for using flux-cloud:
apply is good for many larger experiments that require different container bases and / or take a longer time to run.
submit is good for smaller experiments that might use the same container bases and / or take a shorter time to run.
For the latter (submit) we will bring up the minimum number of MiniClusters required (unique based on container image size) and launch all jobs across them, using Flux as a scheduler. As of version 0.2.0 both commands both use the fluxoperator Python SDK, so we only use bash scripts to bring up and down cloud-specific clusters.
Pre-requisites¶
You should first install minikube and kubectl.
Run Experiments¶
Let’s start with a simple experiments.yaml
file, where we have defined a number of different
experiments to run on MiniKube. flux-cloud submit
relies entirely on this experiment file,
and programmatically generates the MiniCluster custom resource definitions
for you, so you don’t need to provide any kind of template.
How does it work?
A YAML file (such as the experiments.yaml) can be serialized to JSON, so each section under “jobs” is also json, or actually (in Python) a dictionary of values. Since the values are passed to the Flux Operator Python SDK, we can map them easily according to the following convention. Let’s say we have a job in the experiments listing:
jobs:
# This is the start of the named job
reaxc-hns:
# These are attributes for the MiniCluster (minus repeats)
command: 'lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite'
image: ghcr.io/rse-ops/lammps:flux-sched-focal-v0.24.0
repeats: 5
working_dir: /home/flux/examples/reaxff/HNS
The content under the job name “reaxc-hns” would be mapped to the MiniCluster container as follows:
from fluxoperator.models import MiniClusterContainer
container = MiniClusterContainer(
image="ghcr.io/rse-ops/lammps:flux-sched-focal-v0.24.0",
working_dir="/home/flux/examples/reaxff/HNS",
command="lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite",
run_flux=True,
)
Note that in the above, since Go is in camel case and the Python SDK turns it into snake case,
workingDir
is changed to working_dir
.
Let’s start with this set of experiments. Note that we’ve provided the same container for all of them, meaning that we will only be creating one MiniCluster with that container. If you provide jobs with separate containers, they will be brought up as separate clusters to run (per each unique container, with all jobs matched to it).
# This is intended for MiniKube, so no machine needed.
# We will create a MiniKube cluster of size 2
matrix:
size: [2]
# Flux Mini Cluster experiment attributes
minicluster:
name: submit-jobs
namespace: flux-operator
# Each of these sizes will be brought up and have commands run across it
size: [2]
# Each of command and image are required to do a submit!
jobs:
reaxc-hns:
command: 'lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite'
image: ghcr.io/rse-ops/lammps:flux-sched-focal-v0.24.0
repeats: 5
working_dir: /home/flux/examples/reaxff/HNS
sleep:
command: 'sleep 5'
image: ghcr.io/rse-ops/lammps:flux-sched-focal-v0.24.0
repeats: 5
working_dir: /home/flux/examples/reaxff/HNS
hello-world:
command: 'echo hello world'
image: ghcr.io/rse-ops/lammps:flux-sched-focal-v0.24.0
repeats: 5
working_dir: /home/flux/examples/reaxff/HNS
Each experiment is defined by the matrix and variables in an experiment.yaml
, as shown above.
Note that the easiest way to get started is to use an existing example, or run:
$ flux-cloud experiment init --cloud minikube
In the example above, we are targeting minikube.
Apply / Run¶
Ideal if you need to run multiple jobs on different containers
This apply/run workflow will create a new MiniCluster each time (pods up and down) and not use Flux as a scheduler proper. A workflow might look like:
$ flux-cloud up --cloud minikube
$ flux-cloud apply --cloud minikube
$ flux-cloud down --cloud minikube
Or achieve all three with:
$ flux-cloud run --cloud minikube
Let’s run this with our experiments.yaml
above in the present working directory,
and after having already run up
:
# Also print output to the terminal (so you can watch!)
$ flux-cloud --debug apply --cloud minikube
# Only save output to output files
$ flux-cloud apply --cloud minikube
At the end of the run, you’ll have an organized output directory with all of your output logs, along with saved metadata about the minicluster, pods, and nodes.
Submit¶
Ideal for one or more commands across the one or more containers and MiniCluster sizes
The idea behind a submit is that we are going to create the minimal number of MiniClusters you need (across the set of unique sizes and images) and then submit all jobs to Flux within the MiniCluster. The submit mode is actually using Flux as a scheduler and not just a “one job” running machine. A basic submit workflow using the config above might look like this:
$ flux-cloud up --cloud minikube
$ flux-cloud submit --cloud minikube
$ flux-cloud down --cloud minikube
Instead of running one job at a time and waiting for output (e.g., apply) we instead submit all the jobs, and then poll every 30 seconds to get job statuses.
View full output of submit command
$ flux-cloud --debug submit --cloud minikube
No experiment ID provided, assuming first experiment k8s-size-4-n1-standard-1.
Job experiments file generated 1 MiniCluster(s).
🌀 Bringing up MiniCluster of size 2 with image ghcr.io/rse-ops/lammps:flux-sched-focal-v0.24.0
All pods are in states "Running" or "Completed"
💾 Creating output directory /home/vanessa/Desktop/Code/flux/flux-cloud/examples/up-submit-down/data/minikube
MiniCluster created with credentials:
FLUX_USER=fluxuser
FLUX_TOKEN=d467215d-d07d-4c32-b2b9-41643cda3d7d
All pods are in states "Running" or "Completed"
Found broker pod lammps-job-0-ng8pz
Waiting for http://lammps-job-0-ng8pz.pod.flux-operator.kubernetes:5000 to be ready
🪅️ RestFUL API server is ready!
.
Port forward opened to http://lammps-job-0-ng8pz.pod.flux-operator.kubernetes:5000
Submitting reaxc-hns-1-minicluster-size-2: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
Submitting reaxc-hns-2-minicluster-size-2: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
Submitting reaxc-hns-3-minicluster-size-2: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
Submitting reaxc-hns-4-minicluster-size-2: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
Submitting reaxc-hns-5-minicluster-size-2: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
Submitting sleep-1-minicluster-size-2: sleep 5
Submitting sleep-2-minicluster-size-2: sleep 5
Submitting sleep-3-minicluster-size-2: sleep 5
Submitting sleep-4-minicluster-size-2: sleep 5
Submitting sleep-5-minicluster-size-2: sleep 5
Submitting hello-world-1-minicluster-size-2: echo hello world
Submitting hello-world-2-minicluster-size-2: echo hello world
Submitting hello-world-3-minicluster-size-2: echo hello world
Submitting hello-world-4-minicluster-size-2: echo hello world
Submitting hello-world-5-minicluster-size-2: echo hello world
Submit 15 jobs! Waiting for completion...
15 are active.
lmp is in state RUN
lmp is in state RUN
lmp is in state SCHED
lmp is in state SCHED
lmp is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
15 are active.
lmp is finished COMPLETED in 28.64 seconds.
lmp is finished COMPLETED in 29.1 seconds.
lmp is in state RUN
lmp is in state RUN
lmp is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
13 are active.
lmp is in state RUN
lmp is in state RUN
lmp is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
sleep is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
13 are active.
lmp is finished COMPLETED in 36.56 seconds.
lmp is finished COMPLETED in 35.89 seconds.
lmp is in state RUN
sleep is finished COMPLETED in 5.02 seconds.
sleep is finished COMPLETED in 5.02 seconds.
sleep is finished COMPLETED in 5.02 seconds.
sleep is in state RUN
sleep is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
echo is in state SCHED
8 are active.
lmp is finished COMPLETED in 24.6 seconds.
sleep is finished COMPLETED in 5.02 seconds.
sleep is finished COMPLETED in 5.02 seconds.
echo is finished COMPLETED in 0.01 seconds.
echo is finished COMPLETED in 0.02 seconds.
echo is finished COMPLETED in 0.02 seconds.
echo is finished COMPLETED in 0.01 seconds.
echo is finished COMPLETED in 0.01 seconds.
All jobs are complete! Cleaning up MiniCluster...
All pods are terminated.
After submit, you will still have an organized output directory with job output files and metadata.
$ tree -a data/minikube/
data/minikube/
└── k8s-size-4-n1-standard-1
├── hello-world-1-minicluster-size-2
│ └── log.out
├── hello-world-2-minicluster-size-2
│ └── log.out
├── hello-world-3-minicluster-size-2
│ └── log.out
├── hello-world-4-minicluster-size-2
│ └── log.out
├── hello-world-5-minicluster-size-2
│ └── log.out
├── meta.json
├── reaxc-hns-1-minicluster-size-2
│ └── log.out
├── reaxc-hns-2-minicluster-size-2
│ └── log.out
├── reaxc-hns-3-minicluster-size-2
│ └── log.out
├── reaxc-hns-4-minicluster-size-2
│ └── log.out
├── reaxc-hns-5-minicluster-size-2
│ └── log.out
└── .scripts
└── minicluster-size-2-lammps-job-ghcr.io-rse-ops-lammps-flux-sched-focal-v0.24.0.json