Reset garbage SLUID from older versioned slurmd registration messages
When slurmd is preparing a slurm_node_registration_status_msg_t, it
creates a list of all slurm_step_id_t's that it detects on the node.
For slurmd's that are 25.05 or older, they will send uninitialized
garbage data in the slurm_step_id_t as part of
slurm_node_registration_status_msg_t, and slurmctld will interpret this
as an invalid sluid and abort the job.
This commit ensures that the garbage sluid is set to 0, which other
logic in slurmctld is designed to recognize as an older protocol version
job.
Ticket: 24312
Changelog: Fix issue with jobs running on slurmd's with version 25.05.x
or older getting aborted when slurmd re-registers with slurmctld.
diff --git a/src/common/slurm_protocol_pack.c b/src/common/slurm_protocol_pack.c
index 7186d62..b0441e2 100644
--- a/src/common/slurm_protocol_pack.c
+++ b/src/common/slurm_protocol_pack.c
@@ -1135,10 +1135,12 @@
goto unpack_error;
safe_xcalloc(node_reg_ptr->step_id, node_reg_ptr->job_count,
sizeof(*node_reg_ptr->step_id));
- for (i = 0; i < node_reg_ptr->job_count; i++)
+ for (i = 0; i < node_reg_ptr->job_count; i++) {
safe_unpack_step_id_members(&node_reg_ptr->step_id[i],
buffer,
smsg->protocol_version);
+ node_reg_ptr->step_id[i].sluid = 0;
+ }
safe_unpack16(&node_reg_ptr->flags, buffer);
@@ -1193,10 +1195,12 @@
goto unpack_error;
safe_xcalloc(node_reg_ptr->step_id, node_reg_ptr->job_count,
sizeof(*node_reg_ptr->step_id));
- for (i = 0; i < node_reg_ptr->job_count; i++)
+ for (i = 0; i < node_reg_ptr->job_count; i++) {
safe_unpack_step_id_members(&node_reg_ptr->step_id[i],
buffer,
smsg->protocol_version);
+ node_reg_ptr->step_id[i].sluid = 0;
+ }
safe_unpack16(&node_reg_ptr->flags, buffer);