Repeatable notebook example (2020-02)ΒΆ

notebook.utf8

Batsim: Impact of smart pointers on Batsim’s memory usage

This notebook is an example of a repeatable experiment from one of Batsim’s documentation tutorial.

Simulation instances preparation

Here, we want to run two simulations with the same inputs but with a different Batsim version. This can be done by executing the prepare-instances.bash script in its dedicated environment:

nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command './prepare-instances.bash'

This creates the following files:

tree ./expe
## ./expe
## β”œβ”€β”€ new
## β”œβ”€β”€ new.yaml
## β”œβ”€β”€ old
## └── old.yaml
## 
## 2 directories, 2 files

Getting simulation inputs

Here we will simulate the old KTH SP2 workload from the parallel workloads archive. The generate-workload.R script downloads the raw logs, extracts a month in the middle of the trace then generate a batsim workload from it. It is called in the input preparation environment:

nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command '(cd ./expe && ../generate-workload.R)'

We will use a platform with enough resources from the Batsim repository. Platform caracteristics do not matter much here, as we use delay profiles that are not sensitive to the jobs execution context.

nix-shell env-check-memuse-improvement.nix -A input_preparation_env --command 'curl -k -o ./expe/cluster.xml https://framagit.org/batsim/batsim/raw/346e0de311c10270d9846d8ea418096afff32305/platforms/cluster512.xml'
##   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
##                                  Dload  Upload   Total   Spent    Left  Speed
## 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   802  100   802    0     0   5241      0 --:--:-- --:--:-- --:--:--  5241

Running simulations

This is done by executing robin on the instance files in their dedicated environments. Please note that separating these two environments is mandatory, as a different Batsim version is defined in each environment.

nix-shell env-check-memuse-improvement.nix -A simulation_env --command 'robin ./expe/new.yaml'
nix-shell env-check-memuse-improvement.nix -A simulation_old_env --command 'robin ./expe/old.yaml'
## time="2021-05-17 23:31:18.171" level=info msg="Waiting for valid context" batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/out'" extracted socket endpoint="tcp://localhost:28000" ready timeout (seconds)=10
## time="2021-05-17 23:31:18.188" level=info msg="Starting simulation" batsim cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/batsim.bash batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/out'" batsim logfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/batsim.log scheduler cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/sched.bash scheduler command="batsched -v easy_bf_fast" scheduler logfile (err)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.err.log scheduler logfile (out)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.out.log simulation timeout (seconds)=604800
## time="2021-05-17 23:32:31.276" level=info msg="Simulation subprocess succeeded" command="batsched -v easy_bf_fast" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/sched.bash process name=Scheduler stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.err.log stdout file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/sched.out.log
## time="2021-05-17 23:32:31.276" level=info msg="The second process might be killed soon..." potential victim name=Batsim success timeout (seconds)=3600
## time="2021-05-17 23:32:31.447" level=info msg="Simulation subprocess succeeded" command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/out'" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/cmd/batsim.bash process name=Batsim stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/new/log/batsim.log stdout file=/dev/null
## time="2021-05-17 23:32:33.543" level=info msg="Waiting for valid context" batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/out'" extracted socket endpoint="tcp://localhost:28000" ready timeout (seconds)=10
## time="2021-05-17 23:32:33.551" level=info msg="Starting simulation" batsim cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/batsim.bash batsim command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/out'" batsim logfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/batsim.log scheduler cmdfile=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/sched.bash scheduler command="batsched -v easy_bf_fast" scheduler logfile (err)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.err.log scheduler logfile (out)=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.out.log simulation timeout (seconds)=604800
## time="2021-05-17 23:34:34.477" level=info msg="Simulation subprocess succeeded" command="batsched -v easy_bf_fast" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/sched.bash process name=Scheduler stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.err.log stdout file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/sched.out.log
## time="2021-05-17 23:34:34.477" level=info msg="The second process might be killed soon..." potential victim name=Batsim success timeout (seconds)=3600
## time="2021-05-17 23:34:35.454" level=info msg="Simulation subprocess succeeded" command="valgrind --tool=massif --time-unit=ms --massif-out-file='/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/massif.out' batsim -p '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/cluster.xml' -w '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/kth_month.json' --mmax-workload -e '/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/out'" command file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/cmd/batsim.bash process name=Batsim stderr file=/home/carni/proj/batsim/docs/tuto-reproducible-experiment/expe/old/log/batsim.log stdout file=/dev/null

Analyzing results

First, we can visually check that the simulation results are similar.

library(tidyverse)
library(viridis)
theme_set(theme_bw())

# batsim-generated summaries
old_schedule = read_csv('./expe/old/out_schedule.csv') %>% mutate(instance='old')
new_schedule = read_csv('./expe/new/out_schedule.csv') %>% mutate(instance='new')
schedules = bind_rows(old_schedule, new_schedule)
schedules %>% tbl_df %>% rmarkdown::paged_table()
# jobs data
old_jobs = read_csv('./expe/old/out_jobs.csv') %>% mutate(instance='old')
new_jobs = read_csv('./expe/new/out_jobs.csv') %>% mutate(instance='new')
jobs = bind_rows(old_jobs, new_jobs) %>% mutate(color_id=job_id%%5)

jobs_plottable = jobs %>%
    mutate(starting_time = starting_time / (60*60*24),
           finish_time = finish_time / (60*60*24)) %>%
    separate_rows(allocated_resources, sep=" ") %>%
    separate(allocated_resources, into = c("psetmin", "psetmax"), fill="right") %>%
    mutate(psetmax = as.integer(psetmax), psetmin = as.integer(psetmin)) %>%
    mutate(psetmax = ifelse(is.na(psetmax), psetmin, psetmax))

jobs_plottable %>%
    ggplot(aes(xmin=starting_time,
               ymin=psetmin,
               ymax=psetmax + 0.9,
               xmax=finish_time,
               fill=color_id)) +
    geom_rect(alpha=0.9, color="black", size=0.1, show.legend = FALSE) +
    scale_fill_viridis() +
    facet_wrap(~instance, ncol=1) +
    labs(x='Simulation time (day)', y="Resources") +
    ggsave('./gantts.pdf', width=15, height=9)

Aggregated metrics are the same and the Gantt charts look similar.

Let us now give a look at Batsim’s memory footprint over time for both runs.

massif-to-csv ./expe/old/massif.out{,.csv}
massif-to-csv ./expe/new/massif.out{,.csv}
old_massif = read_csv('./expe/old/massif.out.csv') %>% mutate(instance='old')
new_massif = read_csv('./expe/new/massif.out.csv') %>% mutate(instance='new')
massif = bind_rows(old_massif, new_massif) %>% mutate(
    total=(stack+heap+heap_extra) / 1e6,
    time=time/1e3)

massif %>%
    ggplot(aes(x=time, y=total)) +
    geom_step() +
    facet_wrap(~instance, ncol=1) +
    labs(x='Real time (s)', y="Batsim process's memory consumption (Mo)") +
    ggsave('./memuse_over_time.png', width=15, height=9)

Well okay, memory usage pattern did not change much but the overall performance improved a lot.