Repeatable notebook example (2020-02)ΒΆ
Batsim: Impact of smart pointers on Batsimβs memory usage
This notebook is an example of a repeatable experiment from one of Batsimβs documentation tutorial.
Simulation instances preparation
Here, we want to run two simulations with the same inputs but with a different Batsim version. This can be done by executing the prepare-instances.bash
script in its dedicated environment:
nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command './prepare-instances.bash'
This creates the following files:
tree ./expe
## ./expe
## βββ new
## βββ new.yaml
## βββ old
## βββ old.yaml
##
## 2 directories, 2 files
Getting simulation inputs
Here we will simulate the old KTH SP2 workload from the parallel workloads archive. The generate-workload.R
script downloads the raw logs, extracts a month in the middle of the trace then generate a batsim workload from it. It is called in the input preparation environment:
nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command '(cd ./expe && ../generate-workload.R)'
We will use a platform with enough resources from the Batsim repository. Platform caracteristics do not matter much here, as we use delay profiles that are not sensitive to the jobs execution context.
nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command 'curl -k -o ./expe/cluster.xml https://framagit.org/batsim/batsim/raw/346e0de311c10270d9846d8ea418096afff32305/platforms/cluster512.xml'
## % Total % Received % Xferd Average Speed Time Time Time Current
## Dload Upload Total Spent Left Speed
##
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 802 100 802 0 0 5241 0 --:--:-- --:--:-- --:--:-- 5241
Running simulations
This is done by executing robin on the instance files in their dedicated environments. Please note that separating these two environments is mandatory, as a different Batsim version is defined in each environment.
nix-shell env-check-memuse-improvement.nix -A simulation_env --command 'robin ./expe/new.yaml'
nix-shell env-check-memuse-improvement.nix -A simulation_old_env --command 'robin ./expe/old.yaml'
## time="2021-05-17 23:31:18.171" level=info msg="Waiting for valid context" batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/out'" extracted socket endpoint="tcp://localhost:28000" ready timeout (seconds)=10
## time="2021-05-17 23:31:18.188" level=info msg="Starting simulation" batsim cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/batsim.bash batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/out'" batsim logfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/batsim.log scheduler cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/sched.bash scheduler command="batsched -v easy_bf_fast" scheduler logfile (err)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.err.log scheduler logfile (out)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.out.log simulation timeout (seconds)=604800
## time="2021-05-17 23:32:31.276" level=info msg="Simulation subprocess succeeded" command="batsched -v easy_bf_fast" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/sched.bash process name=Scheduler stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.err.log stdout file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.out.log
## time="2021-05-17 23:32:31.276" level=info msg="The second process might be killed soon..." potential victim name=Batsim success timeout (seconds)=3600
## time="2021-05-17 23:32:31.447" level=info msg="Simulation subprocess succeeded" command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/out'" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/batsim.bash process name=Batsim stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/batsim.log stdout file=/dev/null
## time="2021-05-17 23:32:33.543" level=info msg="Waiting for valid context" batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/out'" extracted socket endpoint="tcp://localhost:28000" ready timeout (seconds)=10
## time="2021-05-17 23:32:33.551" level=info msg="Starting simulation" batsim cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/batsim.bash batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/out'" batsim logfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/batsim.log scheduler cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/sched.bash scheduler command="batsched -v easy_bf_fast" scheduler logfile (err)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.err.log scheduler logfile (out)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.out.log simulation timeout (seconds)=604800
## time="2021-05-17 23:34:34.477" level=info msg="Simulation subprocess succeeded" command="batsched -v easy_bf_fast" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/sched.bash process name=Scheduler stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.err.log stdout file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.out.log
## time="2021-05-17 23:34:34.477" level=info msg="The second process might be killed soon..." potential victim name=Batsim success timeout (seconds)=3600
## time="2021-05-17 23:34:35.454" level=info msg="Simulation subprocess succeeded" command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/out'" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/batsim.bash process name=Batsim stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/batsim.log stdout file=/dev/null
Analyzing results
First, we can visually check that the simulation results are similar.
library(tidyverse)
library(viridis)
theme_set(theme_bw())
# batsim-generated summaries
old_schedule = read_csv('./expe/old/out_schedule.csv') %>% mutate(instance='old')
new_schedule = read_csv('./expe/new/out_schedule.csv') %>% mutate(instance='new')
schedules = bind_rows(old_schedule, new_schedule)
schedules %>% tbl_df %>% rmarkdown::paged_table()
# jobs data
old_jobs = read_csv('./expe/old/out_jobs.csv') %>% mutate(instance='old')
new_jobs = read_csv('./expe/new/out_jobs.csv') %>% mutate(instance='new')
jobs = bind_rows(old_jobs, new_jobs) %>% mutate(color_id=job_id%%5)
jobs_plottable = jobs %>%
mutate(starting_time = starting_time / (60*60*24),
finish_time = finish_time / (60*60*24)) %>%
separate_rows(allocated_resources, sep=" ") %>%
separate(allocated_resources, into = c("psetmin", "psetmax"), fill="right") %>%
mutate(psetmax = as.integer(psetmax), psetmin = as.integer(psetmin)) %>%
mutate(psetmax = ifelse(is.na(psetmax), psetmin, psetmax))
jobs_plottable %>%
ggplot(aes(xmin=starting_time,
ymin=psetmin,
ymax=psetmax + 0.9,
xmax=finish_time,
fill=color_id)) +
geom_rect(alpha=0.9, color="black", size=0.1, show.legend = FALSE) +
scale_fill_viridis() +
facet_wrap(~instance, ncol=1) +
labs(x='Simulation time (day)', y="Resources") +
ggsave('./gantts.pdf', width=15, height=9)
Aggregated metrics are the same and the Gantt charts look similar.
Let us now give a look at Batsimβs memory footprint over time for both runs.
massif-to-csv ./expe/old/massif.out{,.csv}
massif-to-csv ./expe/new/massif.out{,.csv}
old_massif = read_csv('./expe/old/massif.out.csv') %>% mutate(instance='old')
new_massif = read_csv('./expe/new/massif.out.csv') %>% mutate(instance='new')
massif = bind_rows(old_massif, new_massif) %>% mutate(
total=(stack+heap+heap_extra) / 1e6,
time=time/1e3)
massif %>%
ggplot(aes(x=time, y=total)) +
geom_step() +
facet_wrap(~instance, ncol=1) +
labs(x='Real time (s)', y="Batsim process's memory consumption (Mo)") +
ggsave('./memuse_over_time.png', width=15, height=9)
Well okay, memory usage pattern did not change much but the overall performance improved a lot.