Devices
Interface
ComputableDAGs.AbstractDevice
— TypeAbstractDevice
Abstract base type for every device, like GPUs, CPUs or any other compute devices. Every implementation needs to implement various functions.
ComputableDAGs.Machine
— TypeMachine
A representation of a machine to execute on. Contains information about its architecture (CPUs, GPUs, maybe more). This representation can be used to make a more accurate cost prediction of a DAG
state.
See also: Scheduler
ComputableDAGs.DEVICE_TYPES
— ConstantDEVICE_TYPES::Vector{Type}
Global vector of available and implemented device types. Each implementation of a AbstractDevice
should add its concrete type to this vector.
See also: device_types
, get_devices
ComputableDAGs._gen_access_expr
— Function_gen_access_expr(device::AbstractDevice, symbol::Symbol)
Interface function that must be implemented for every subtype of AbstractDevice
. Return an Expr
or QuoteNode
accessing the variable identified by [symbol
].
ComputableDAGs._gen_local_init
— Function_gen_local_init(fc::FunctionCall, device::AbstractDevice)
Interface function that must be implemented for every subtype of AbstractDevice
. Return an Expr
or QuoteNode
that initializes the access expression returned by _gen_access_expr
in the local scope. This expression may be empty. For local variables it should be local <variable_name>::<Type>
.
ComputableDAGs.get_devices
— Functionget_devices(t::Type{T}; verbose::Bool) where {T <: AbstractDevice}
Interface function that must be implemented for every subtype of AbstractDevice
. Returns a Vector{Type}
of the devices for the given AbstractDevice
Type available on the current machine.
ComputableDAGs.kernel
— Functionkernel(gpu_type::Type{<:AbstractGPU}, graph::DAG, instance)
For a GPU type, a DAG
, and a problem instance, return an Expr
containing a function of signature compute_<id>(input::<GPU>Vector, output::<GPU>Vector, n::Int64)
, which will return the result of the DAG computation of the input on the given output vector, intended for computation on GPUs. Currently, CUDAGPU
and ROCmGPU
are available if their respective package extensions are loaded.
The generated kernel function accepts its thread ID in only the x-dimension, and only as thread ID, not as block ID. The input and output should therefore be 1-dimensional vectors. For detailed information on GPU programming and the Julia packages, please refer to their respective documentations.
A simple example call for a CUDA kernel might look like the following:
@cuda threads = (32,) always_inline = true cuda_kernel!(cu_inputs, outputs, length(cu_inputs))
Unlike the standard get_compute_function
to generate a callable function which returns a RuntimeGeneratedFunction
, this returns an Expr
that needs to be eval
'd. This is a current limitation of RuntimeGeneratedFunctions.jl
which currently cannot wrap GPU kernels. This might change in the future.
Size limitation
The generated kernel does not use any internal parallelization, i.e., the DAG is compiled into a serialized function, processing each input in a single thread of the GPU. This means it can be heavily parallelized and use the GPU at 100% for sufficiently large input vectors (and assuming the function does not become IO limited etc.). However, it also means that there is a limit to how large the compiled function can be. If it gets too large, the compilation might fail, take too long to complete, the kernel might fail during execution if too much stack memory is required, or other similar problems. If this happens, your problem is likely too large to be compiled to a GPU kernel like this.
Compute Requirements
A GPU function has more restrictions on what can be computed than general functions running on the CPU. In Julia, there are mainly two important restrictions to consider:
- Used data types must be stack allocatable, i.e.,
isbits(x)
must betrue
for arguments and local variables used inComputeTasks
. - Function calls must not be dynamic. This means that type stability is required and the compiler must know in advance which method of a generic function to call. What this specifically entails may change with time and also differs between the different target GPU libraries. From experience, using the
always_inline = true
argument for@cuda
calls can help with this.
This feature is currently experimental. There are still some unresolved issues with the generated kernels.
ComputableDAGs.measure_device!
— Functionmeasure_device!(device::AbstractDevice; verbose::Bool)
Interface function that must be implemented for every subtype of AbstractDevice
. Measures the compute speed of the given device and writes into it.
Detect
ComputableDAGs.get_machine_info
— Methodget_machine_info(verbose::Bool)
Return the Machine
currently running on. The parameter verbose
defaults to true when interactive.
Measure
ComputableDAGs.measure_devices!
— Methodmeasure_devices(machine::Machine; verbose::Bool)
Measure FLOPS, RAM, cache sizes and what other properties can be extracted for the devices in the given machine.
ComputableDAGs.measure_transfer_rates!
— Methodmeasure_transfer_rates(machine::Machine; verbose::Bool)
Measure the transfer rates between devices in the machine.
Implementations
General
ComputableDAGs.cpu_st
— Methodcpu_st()
A function returning a Machine
that only has a single thread of one CPU. It is the simplest machine definition possible and produces a simple function when used with get_compute_function
.
ComputableDAGs.device_types
— MethodComputableDAGs.entry_device
— Methodentry_device(machine::Machine)
Return the "entry" device, i.e., the device that starts CPU threads and GPU kernels, and takes input values and returns the output value.
ComputableDAGs.gen_access_expr
— Methodgen_access_expr(fc::FunctionCall)
Dispatch from the given FunctionCall
to the interface function _gen_access_expr
(@ref).
ComputableDAGs.gen_local_init
— Methodgen_local_init(fc::FunctionCall)
Dispatch from the given FunctionCall
to the interface function _gen_local_init
(@ref).
NUMA
ComputableDAGs.NumaNode
— TypeNumaNode <: AbstractCPU
Representation of a specific CPU that code can run on. Implements the AbstractDevice
interface.
ComputableDAGs._gen_access_expr
— Method_gen_access_expr(device::NumaNode, symbol::Symbol)
Interface implementation, dispatched to from gen_access_expr
.
ComputableDAGs._gen_local_init
— Method_gen_local_init(fc::FunctionCall, device::NumaNode)
Interface implementation, dispatched to from gen_local_init
.
ComputableDAGs.get_devices
— Methodget_devices(deviceType::Type{T}; verbose::Bool) where {T <: NumaNode}
Return a Vector of NumaNode
s available on the current machine. If verbose
is true, print some additional information.
GPUs
ComputableDAGs.CUDAGPU
— TypeCUDAGPU <: AbstractGPU
Representation of a specific CUDA GPU that code can run on. Implements the AbstractDevice
interface.
This requires CUDA to be loaded to be useful.
ComputableDAGs.ROCmGPU
— TypeROCmGPU <: AbstractGPU
Representation of a specific AMD GPU that code can run on. Implements the AbstractDevice
interface.
This requires AMDGPU to be loaded to be useful.
ComputableDAGs.oneAPIGPU
— TypeoneAPIGPU <: AbstractGPU
Representation of a specific Intel GPU that code can run on. Implements the AbstractDevice
interface.
This requires oneAPI to be loaded to be useful.