Sunday 2 May 2010

Faster Memory Transfers

NVidia provide a mechanism to to allocate non-paged ('pinned') memory on the host, which can significantly improve host-to-GPU transfer performance.  But does it help in practice?

The main bottleneck in GPU processing is the PCIe bus, which has a relatively low bandwidth.  For many trivial operations this data transfer overhead dominates the overall execution time, negating any benefit of using the GPU.  For normal host-to-GPU data transfers using the cuMemcpy function a bandwidth of around 2.0-2.5GB/sec is about average for a 16-lane PCI express bus.  This represents about half the theoretical maximum bandwidth of the PCIe v1.1 bus, and introduces about 1ms overhead to transfer an 1920x1080 greyscale image.


Figure 1A normal cuMemcpy from host-to-device runs at about 2GB/sec

If we use the NVidia cuMemAllocHost function to allocate non-paged memory on the host, we can almost double the bandwidth when copying this buffer to the GPU device memory, achieving nearer 4GB/sec on most systems.  If you are able to write your capture code so that the frame grabber driver will DMA image data directly into one of these page-locked buffers then that is a worthwhile thing to do.  Unfortunately, thats not always possible.

Page-Locked Intermediate Buffer
Sometimes, the frame-grabber acquires images into a host memory buffer without giving us the option to acquire directly into our CUDA allocated page-locked memory.  In this situation, we can either copy our captured image directly to the device memory as in Figure1, or choose to memcpy into a page-locked buffer prior to transfer across the PCIe bus as in Figure2.  Since a host memcpy takes time, this erodes some of the benefit of using the page-locked buffer. 



Figure 2.  Using a page-locked buffer as a staging post before transfer can still increase performance, despite introducing an additional host memcpy operation from the acquire buffer to the page-locked buffer.

Using a page-locked transfer buffer as shown in figure2 is only worthwhile when the cost of the host memcpy operation is low - which requires a relatively high performance chipset (e.g. ICH10) with fast DDR2 (6.4GB/sec) or DDR3 (8.5GB/sec) memory.  At a minimum, the host-to-host copy must execute faster than 4GB/sec otherwise the direct copy in figure1 is usually faster.  As an example, the approximate time taken to transfer 1GB using paged memory is:

1GB / (2GB/sec) =500ms

When using the scheme in figure2, the total time taken to transfer 1GB from host to the page-locked buffer and then onto the GPU is approximately:
  1GB/(8GB/sec) = 125ms
+1GB/(4GB/sec) = 250ms

= 375ms


This is an improvement over the straight copy, so it would appear that non-paged memory does help even in this non-ideal situation.  When using a newer P45 chipset with PCIexpress v2.0 the maximum achievable transfer bandwidth is higher.  In theory, the PCIe bus on the newer Intel P45 and P35 chipsets will handle 16GB/sec and 8GB/sec respectively, but are limited by main memory bandwidth, reducing host-to-GPU bandwidth to something between 5 and 6GB/sec.


The conclusion is that if at all possible, acquire directly into a pinned, page-locked memory.  If that isn't possible, using an intermediate page-locked buffer is still worthwhile, provided the host chipset and memory performance is good.

Direct FrameGrabber-to-GPU DMA
It would be really great to be able to DMA directly from a frame grabber into GPU device memory, avoiding the CPU and main memory entirely, but I don't believe this is possible.  It may be achieved using driver-level transfers akin to DirectShow drivers, but it is not currently possible to get a physical address of GPU device memory using CUDA.

Simon Green, from NVidia says:
"A lot of people have asked for this. It is technically possible for other PCI-E devices to DMA directly into GPU memory, but we don't have a solution yet. We'll keep you posted." - Sep 2009
This is a capability worth waiting for, but don't hold your breath.



Vision Experts

2 comments:

  1. This is very useful information for my final project.I am still very new to CUDA dev. environment.I hope you'll help whenever I contact you.
    Thanks
    owoyalefemi@gmail.com

    ReplyDelete
  2. You are provide the information with help of these images.Thanks for this nice information.

    Medical Product Design

    ReplyDelete