Issue: While executing a Nomad job, Nomad allocations are getting failures as the container process related to the exec driver and container initialization error across various RHEL and Nomad versions. This article provides a detailed analysis of the error message, explores potential causes, and offers troubleshooting steps to resolve the issue effectively.
Error description:
Driver Failure: failed to launch command with executor: rpc error: code = Unknown desc = unable to start container process: error during container init: read init-p: connection reset by peer
The error indicates that the command intended for execution within the container failed to launch due to an unspecified Remote Procedure Call (RPC) error. This issue occurred during the container's initialization phase and involved a network-related problem where the connection was unexpectedly closed by the peer.
Reproduction steps: Implement the reproduction scenario's actual and expected behavior, and test it across various versions of RHEL and Nomad.
Step 1: Create a job file with the name httpd.hcl.
job "httpd" {
group "httpd" {
task "httpd" {
driver = "exec"
config {
command = "bash"
args = ["-c", "while true; do sleep 500; done"]
}
}
}
}
Step 2: Run the Nomad job using the below command.
nomad job run httpd.hcl
Step 3: Check the status of the Nomad job from the Nomad UI or using the below command on CLI.
nomad job status httpd
Job Failure screenshot from UI:
Job successfully running screenshot:
Actual behavior:
Run the nomad file and it throws the below error of exec driver failure.
Error message:
Driver Failure: failed to launch command with executor: rpc error: code = Unknown desc = unable to start container process: error during container init: read init-p: connection reset by peer
Expected behavior:
The job will run without any errors.
Steps to mitigate this issue:
Use cases |
Linux flavor |
Nomad Version |
Status |
Comments |
Scenario 1 |
RHEL 8 |
1.8.0 |
Not Running |
Either upgrade to RHEL 9 or Nomad 1.8.1 version.
Issue occurs due to exec driver failure. |
Scenario 2 |
RHEL 8 |
1.8.1 |
Running |
In Nomad 1.8.1, there is a fix for bug where exec driver tasks would fail on older versions of glibc [GH-23331]
Driver: Fixed a bug where the exec, java, and raw_exec drivers would not configure cgroups to allow access to devices provided by device plugins [GH-22518]
|
Scenario 3 |
RHEL 8 |
1.6.8 |
Running |
The Go version for Nomad 1.6.8 is set to 1.21.6 but the first version of Nomad with the bug is 1.6.9 and we can see it's set to 1.22.1.
GH-20212
|
Scenario 4 |
RHEL 9 |
1.8.1 |
Running |
Driver: Fixed a bug where the exec, java, and raw_exec drivers would not configure cgroups to allow access to devices provided by device plugins [GH-22518] |
Conclusion:
The error indicates that the attempt to launch a command with the executor failed due to an unknown exec driver failure, container initialization error, and RPC error. Specifically, the container process couldn't start because the connection was reset by the peer during the initialization phase. Issue is tested on different versions of RHEL and Nomad, and detailed test results are provided. If customers encounter this error, they can refer to these recommendations for solutions.
Reference Documents:
Nomad upgrade document
Nomad Specific version upgrade
RHEL 8 to RHEL 9 upgrade steps