Devices

Interface

ComputableDAGs.AbstractDevice — Type

AbstractDevice

Abstract base type for every device, like GPUs, CPUs or any other compute devices. Every implementation needs to implement various functions.

source

ComputableDAGs.Machine — Type

Machine

A representation of a machine to execute on. Contains information about its architecture (CPUs, GPUs, maybe more). This representation can be used to make a more accurate cost prediction of a DAG state.

See also: device_types, get_devices

source

ComputableDAGs.get_devices — Function

get_devices(t::Type{T}; verbose::Bool) where {T <: AbstractDevice}

Interface function that must be implemented for every subtype of AbstractDevice. Returns a Vector{Type} of the devices for the given AbstractDevice Type available on the current machine.

source

ComputableDAGs.kernel — Function

kernel(gpu_type::Type{<:AbstractGPU}, graph::DAG, instance)

For a GPU type, a DAG, and a problem instance, return an Expr containing a function of signature compute_<id>(input::<GPU>Vector, output::<GPU>Vector, n::Int64), which will return the result of the DAG computation of the input on the given output vector, intended for computation on GPUs. Currently, CUDAGPU and ROCmGPU are available if their respective package extensions are loaded.

The generated kernel function accepts its thread ID in only the x-dimension, and only as thread ID, not as block ID. The input and output should therefore be 1-dimensional vectors. For detailed information on GPU programming and the Julia packages, please refer to their respective documentations.

A simple example call for a CUDA kernel might look like the following:

@cuda threads = (32,) always_inline = true cuda_kernel!(cu_inputs, outputs, length(cu_inputs))

Note

Unlike the standard get_compute_function to generate a callable function which returns a RuntimeGeneratedFunction, this returns an Expr that needs to be eval'd. This is a current limitation of RuntimeGeneratedFunctions.jl which currently cannot wrap GPU kernels. This might change in the future.

Size limitation

The generated kernel does not use any internal parallelization, i.e., the DAG is compiled into a serialized function, processing each input in a single thread of the GPU. This means it can be heavily parallelized and use the GPU at 100% for sufficiently large input vectors (and assuming the function does not become IO limited etc.). However, it also means that there is a limit to how large the compiled function can be. If it gets too large, the compilation might fail, take too long to complete, the kernel might fail during execution if too much stack memory is required, or other similar problems. If this happens, your problem is likely too large to be compiled to a GPU kernel like this.

Compute Requirements

A GPU function has more restrictions on what can be computed than general functions running on the CPU. In Julia, there are mainly two important restrictions to consider:

Used data types must be stack allocatable, i.e., isbits(x) must be true for arguments and local variables used in ComputeTasks.
Function calls must not be dynamic. This means that type stability is required and the compiler must know in advance which method of a generic function to call. What this specifically entails may change with time and also differs between the different target GPU libraries. From experience, using the always_inline = true argument for @cuda calls can help with this.

Warning

This feature is currently experimental. There are still some unresolved issues with the generated kernels.

source

ComputableDAGs.measure_device! — Function

measure_device!(device::AbstractDevice; verbose::Bool)

Interface function that must be implemented for every subtype of AbstractDevice. Measures the compute speed of the given device and writes into it.

source

Detect

ComputableDAGs.get_machine_info — Method

get_machine_info(verbose::Bool)

Return the Machine currently running on. The parameter verbose defaults to true when interactive.

source

Measure

ComputableDAGs.measure_devices! — Method

measure_devices(machine::Machine; verbose::Bool)

Measure FLOPS, RAM, cache sizes and what other properties can be extracted for the devices in the given machine.

source

ComputableDAGs.measure_transfer_rates! — Method

measure_transfer_rates(machine::Machine; verbose::Bool)

Measure the transfer rates between devices in the machine.

source

Implementations

General

ComputableDAGs._gen_let_statement — Method

_gen_let_statement(symbol::Symbol)

Return a let-Expr like <symbol> = <symbol>.

source

ComputableDAGs._gen_local_init — Method

_gen_local_init(symbol::Symbol, type::Type)

Return an Expr that initializes the symbol in the local scope. The result looks like local <symbol>::<type>.

source

ComputableDAGs.cpu_st — Method

cpu_st()

A function returning a Machine that only has a single thread of one CPU. It is the simplest machine definition possible and produces a simple function when used with get_compute_function.

source

ComputableDAGs.device_types — Method

device_types()

Return a vector of available and implemented device types.

NUMA

ComputableDAGs.NumaNode — Type

NumaNode <: AbstractCPU

Representation of a specific CPU that code can run on. Implements the AbstractDevice interface.

source

ComputableDAGs.get_devices — Method

get_devices(deviceType::Type{T}; verbose::Bool) where {T <: NumaNode}

Return a Vector of NumaNodes available on the current machine. If verbose is true, print some additional information.

source

GPUs

ComputableDAGs.CUDAGPU — Type

CUDAGPU <: AbstractGPU

Representation of a specific CUDA GPU that code can run on. Implements the AbstractDevice interface.

Note

This requires CUDA to be loaded to be useful.

source

ComputableDAGs.ROCmGPU — Type

ROCmGPU <: AbstractGPU

Representation of a specific AMD GPU that code can run on. Implements the AbstractDevice interface.

Note

This requires AMDGPU to be loaded to be useful.

source

ComputableDAGs.oneAPIGPU — Type

oneAPIGPU <: AbstractGPU

Representation of a specific Intel GPU that code can run on. Implements the AbstractDevice interface.

Note

This requires oneAPI to be loaded to be useful.

source