> ## Documentation Index
> Fetch the complete documentation index at: https://docs.porter.run/llms.txt
> Use this file to discover all available pages before exploring further.

# Running GPU workloads

> Create fixed GPU node groups with NVIDIA-enabled instances and deploy GPU-accelerated machine learning or inference workloads on Porter

This guide walks you through setting up GPU infrastructure on Porter, from
creating a GPU node group to deploying your first GPU-accelerated application.

You can deploy GPU-enabled workloads on Porter by creating a fixed node group
and selecting a GPU-enabled instance type. Note that this has to be the
**second** node group in your cluster as the default node group is reserved for
CPU workloads.

<Info>
  GPUs are only supported on fixed node groups because cost optimization does not currently support GPU instances. This means you select a specific GPU instance type and Porter scales by adding or removing instances of that exact type.
</Info>

## Creating a GPU Node Group

<Steps>
  <Step title="Navigate to Infrastructure">
    From your Porter dashboard, click on the **Infrastructure** tab in the left sidebar.
  </Step>

  <Step title="Select your cluster">
    Click on **Cluster** to view your cluster configuration and existing node groups.
  </Step>

  <Step title="Add a node group">
    Click **Add an additional node group** to open the node group configuration panel.
  </Step>

  <Step title="Configure the GPU node group">
    Configure your GPU node group with the following settings:

    | Setting           | Description                                                        |
    | ----------------- | ------------------------------------------------------------------ |
    | **Instance type** | Select a GPU-enabled instance type (see table below)               |
    | **Minimum nodes** | Select minimum number of nodes that will be available at all times |
    | **Maximum nodes** | The upper limit for autoscaling based on demand                    |

    <img src="https://mintcdn.com/porter/bja7Zm50xP-m5T8X/images/provisioning-infrastructure/cost-opt-1.png?fit=max&auto=format&n=bja7Zm50xP-m5T8X&q=85&s=21d2b7ae56f8e0a2bd899241d8bb1863" alt="Fixed Node Group Configuration" width="3208" height="1220" data-path="images/provisioning-infrastructure/cost-opt-1.png" />

    <Warning>
      GPU instances are significantly more expensive than standard instances.
    </Warning>
  </Step>

  <Step title="Save and provision">
    Click **Save** to create the node group. Porter will provision the GPU nodes in your cluster. This may take a few minutes as GPU nodes often require additional driver installation.
  </Step>
</Steps>

## Deploying a GPU Application

Once your GPU node group is ready, you can deploy applications that use GPU resources.

<Steps>
  <Step title="Create or select your application">
    Navigate to your application in the Porter dashboard, or create a new one if you haven't already.
  </Step>

  <Step title="Go to the Services tab">
    Click on the **Services** tab to view your application's services.
  </Step>

  <Step title="Select the service">
    Click on the service that needs GPU access (e.g., your inference worker or training job).
  </Step>

  <Step title="Assign to the GPU node group">
    Under **General**, find the **Node group** selector and choose your GPU node group from the dropdown.
  </Step>

  <Step title="Configure GPU resources">
    Under **Resources**, configure your GPU requirements:

    | Setting | Recommended Value                                                   |
    | ------- | ------------------------------------------------------------------- |
    | **GPU** | Number of GPUs needed (typically 1)                                 |
    | **CPU** | Match your workload needs (GPU instances have fixed CPU/RAM ratios) |
    | **RAM** | Match your workload needs                                           |

    <Info>
      Request only the GPUs you need. Each GPU request reserves an entire GPU.
    </Info>
  </Step>

  <Step title="Save and deploy">
    Save your changes and redeploy the application. Porter will schedule your workload on the GPU node group.
  </Step>
</Steps>

## Troubleshooting

<AccordionGroup>
  <Accordion title="Pod stuck in Pending state">
    This usually means there are no available GPU nodes. Check:

    * The node group has scaled up (check Infrastructure → Cluster)
    * Your GPU request doesn't exceed available GPUs on the instance type
    * The node group maximum nodes hasn't been reached
  </Accordion>

  <Accordion title="Out of memory errors">
    GPU memory errors indicate your model or batch size is too large:

    * Reduce batch size
    * Use a larger GPU instance with more VRAM
  </Accordion>

  <Accordion title="Slow cold starts">
    GPU nodes take longer to start than CPU nodes due to driver initialization:

    * Keep minimum nodes at 1 for latency-sensitive workloads
    * Consider keeping a warm pool of nodes during peak hours
  </Accordion>
</AccordionGroup>
