How to define intermediate result when using schedules? #503

hulihan-start · 2023-06-28T00:55:01Z

hulihan-start
Jun 28, 2023

I followed your test code on: https://github.com/roastduck/FreeTensor/blob/master/test/70.program/test_gpu_conv2d.py
I'm not sure if 'cache' is the correct keyword for this case, but a CUDA error was found:
ptxas warning : Value of threads per SM for entry kernel0 is out of range. .minnctapersm and .maxntid will be ignored
CUDA error in file '/root/.freetensor/o17vag/run.cu' in line 73 : invalid argument.
Traceback (most recent call last):
File "/data/not_backed_up/lihhu/FreeTensor_experiments/TransR_scheduler.py", line 87, in
transr()
File "/data/not_backed_up/lihhu/FreeTensor_experiments/TransR_scheduler.py", line 84, in transr
result = eval(func, True, True)
File "/data/not_backed_up/lihhu/FreeTensor_experiments/TransR_scheduler.py", line 50, in eval
t1, _ = driver.time()
RuntimeError: cuda error

Here is my code:

import freetensor as ft
import torch
device = ft.GPU(0)
target = device.target()
host = ft.CPU()

h = torch.randint(0, 4096, (4096, ), dtype=torch.int64).cuda(0)
t = torch.randint(0, 4096, (4096, ), dtype=torch.int64).cuda(0)
r = torch.randint(0, 4096, (4096, ), dtype=torch.int64).cuda(0)
eemb = torch.rand(93773, 512).cuda(0)
remb = torch.rand(51, 512).cuda(0)
proj = torch.rand(51, 512, 512).cuda(0)
res = torch.rand(4096, 512).cuda(0)

batch_size = h.shape[0]
dim = eemb.shape[1]
enode = eemb.shape[0]
rnode = remb.shape[0]

def transr():
    def eval(func, print_code=False, time=False):
        func = ft.lower(func, target)
        if print_code:
            print(func, flush=True)
        code = ft.codegen(func, target)
        if print_code:
            print(code, flush=True)
        driver = ft.build_binary(code, device)
        res = torch.zeros(batch_size,).cuda(0)
        
        head = ft.Array(h)
        tail = ft.Array(t)
        relation = ft.Array(r)
        entemb = ft.Array(eemb)
        relemb = ft.Array(remb)
        pemb = ft.Array(proj)
        res = ft.Array(res)
        
        driver.set_args(heads=head, tails=tail, relations=relation, entemb=entemb, relemb=relemb, pemb=pemb, result=res)
        if time:
            t1, _ = driver.time()
            print("time: %s ms" % t1)
        else:
            driver.run()
        B_np = res.torch()
        return B_np
    
    @ft.transform
    def score_func(heads, tails, relations, entemb, relemb, pemb, result):
        heads: ft.Var[(batch_size, ), "int64", "input", "gpu/global"]
        tails: ft.Var[(batch_size, ), "int64", "input", "gpu/global"]
        relations: ft.Var[(batch_size, ), "int64", "input", "gpu/global"]
        entemb: ft.Var[(enode, dim, ), "float32", "input", "gpu/global"]
        relemb: ft.Var[(enode, dim, ), "float32", "input", "gpu/global"]
        pemb: ft.Var[(enode, dim, dim, ), "float32", "input", "gpu/global"]
        result: ft.Var[(batch_size, ), "float32", "output", "gpu/global"]
        inter: ft.Var[(batch_size, dim,), "float32", "cache", "gpu/global"]

        # inter = ft.empty((batch_size, dim), "float32")

        #! label: bx
        for bb in range(batch_size):
            #! label: ty
            for dd in range(dim):
                #! label: tx
                for kk in range(dim):
                    inter[bb, dd] += (entemb[heads[bb], kk] - entemb[tails[bb], kk]) * pemb[relations[bb], kk, dd]
                result[bb] += ft.abs(inter[bb, dd] + relemb[relations[bb], dd])

    s = ft.Schedule(score_func)
    s.parallelize("bx", "blockIdx.x")
    s.parallelize("ty", "threadIdx.y")
    s.parallelize("tx", "threadIdx.x")
    func = s.func()
    result = eval(func, True, True)


transr()

Can you help me to fix this issue? Thank you so much!

roastduck · 2023-06-28T02:21:56Z

roastduck
Jun 28, 2023
Maintainer

You have too many CUDA threads per CUDA blocks. In this code, you mapped both tx and ty to CUDA threads, it will be 512 * 512 threads. For typical NVIDIA GPU, this number should be kept no more than 1024 (refer to NVIDIA's documents for details).

As in the test_gpu_conv2d.py example, a typical way to deal with it is to tile the loops with split and reorder schedules. You need to make the loops to be like this (boundary check for integer division is omitted):

for bb_out in range(batch_size // tile_size_for_bb):
  for dd_out in range(dim // tile_size_for_dd):
    for kk_out in range(dim // tile_size_for_kk):
      for bb_in in range(tile_size_for_bb):
        for dd_in in range(tile_size_for_dd):
          for kk_in in range(tile_size_for_kk):

where you control the *tile_size_* not too large, and map all the *_in loops to threads and all the *_out loops to blocks.

7 replies

roastduck Jul 18, 2023
Maintainer

This is a low-level error return from CUDA. Maybe it means FreeTensor generated some incorrect code. If you like, you can set verbose=1 in optimize or lower or codegen to print the code to have a check.

roastduck Jul 18, 2023
Maintainer

Oh, I know what had happened. The grammar of XXX: ft.Var[...] is only used for declare the type of function parameters. For intermediate variables, you should use the grammar like XXX = ft.empty(your_shape, your_scalar_data_type, your_memory_type), for example inter = ft.empty((batch_size, dim,), "float32", "gpu/shared"). Setting "cache" in ft.Var is actually not a correct usage. I will check it in the frontend.

hulihan-start Jul 18, 2023
Author

It works! Thank you so much!

hulihan-start Jul 18, 2023
Author

Another question is how to resolve the write-after-read dependence. I followed the conv2d test and used ft.MoveToSide but the issue still happened.

Error info is:
The reason is: /data/not_backed_up/lihhu/FreeTensor_experiments/FreeTensor/src/schedule/parallelize.cc:74: Dependence WRITE inter[nn.0 * 8 + nn.1, dd] = 0 after READ inter[nn.0 * 8 + nn.1, dd_1.0 * 8 + dd_1.1] in result[bb, nn.0 * 8 + nn.1, dd_1.0 * 8 + dd_1.1] += inter[nn.0 * 8 + nn.1, dd_1.0 * 8 + dd_1.1] + relemb[relations[bb], (dd_1.0 * 8 + dd_1.1)] along #10 cannot be resolved

`@ft.transform
def score_func(heads, tails, relations, entemb, relemb, pemb, result):
heads: ft.Var[(batch_size, ), "int64", "input", "gpu/global"]
tails: ft.Var[(batch_size, neg_size, ), "int64", "input", "gpu/global"]
relations: ft.Var[(batch_size, ), "int64", "input", "gpu/global"]
entemb: ft.Var[(enode, dim, ), "float32", "input", "gpu/global"]
relemb: ft.Var[(rnode, dim, ), "float32", "input", "gpu/global"]
pemb: ft.Var[(rnode, dim, dim, ), "float32", "input", "gpu/global"]
result: ft.Var[(batch_size, neg_size, dim, ), "float32", "output", "gpu/global"]

    inter = ft.empty((neg_size, dim,), "float32", "gpu/shared")

    #! label: Bx
    for bb in range(batch_size):
        #! label: Ty
        for nn in range(neg_size):
            #! label: init
            for dd in range(dim):
                inter[nn][dd] = 0

            #! label: Tx
            for dd in range(dim):
                #! label: Tz
                for kk in range(dim):
                    inter[nn][dd] += (entemb[heads[bb]][kk] - entemb[tails[bb][nn]][kk]) * pemb[relations[bb]][kk][dd]
                result[bb][nn][dd] += inter[nn][dd] + relemb[relations[bb]][dd]

bx, ty, tx, tz = "Bx", "Ty", "Tx", "Tz"
s = ft.Schedule(score_func)
ty, ly = s.split(ty, nparts=8)
tx, lx = s.split(tx, nparts=64)

s.reorder([bx, ty, ly, tx, lx, tz])
s.move_to("init", ft.MoveToSide.Before, tx)
s.parallelize(bx, "blockIdx.x")
s.parallelize(ty, "threadIdx.y")
s.parallelize(tx, "threadIdx.x")
func = s.func()
result = eval(func, True, True)`

roastduck Jul 19, 2023
Maintainer

One of the key ideas in FreeTensor is to ensure every schedule not to break dependence, so the final program is correct. When FreeTensor raises a exception about dependence, it will first prints the AST (before the word "the reason is"), where you can have a look at what dependence blocks the schedule. The error message may contain ID like #10, which refers to a statement, and you can find it in the printed AST. If there is truly a dependence, you just can't do the schedule, or the program will go wrong.

However, the dependence may be inaccurate, and FreeTensor will play safe and reject schedules when it does not have enough information. This is especially common when there are indirect access in the program, like in your case. For example, FreeTensor does not know whether items in heads are unique (for example whether heads[0] == heads[1], which may lead to a dependence), and it has to assume there may be one. If FreeTensor thinks there is a dependence but you believe there is not, you can explicitly mark some loops as dependence-free. See https://roastduck.github.io/FreeTensor/guide/hint/#hint-free-of-dependence-by-no_deps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to define intermediate result when using schedules? #503

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to define intermediate result when using schedules? #503

Uh oh!

Uh oh!

hulihan-start Jun 28, 2023

Replies: 1 comment · 7 replies

Uh oh!

roastduck Jun 28, 2023 Maintainer

Uh oh!

roastduck Jul 18, 2023 Maintainer

Uh oh!

roastduck Jul 18, 2023 Maintainer

Uh oh!

hulihan-start Jul 18, 2023 Author

Uh oh!

hulihan-start Jul 18, 2023 Author

Uh oh!

roastduck Jul 19, 2023 Maintainer

hulihan-start
Jun 28, 2023

Replies: 1 comment 7 replies

roastduck
Jun 28, 2023
Maintainer

roastduck Jul 18, 2023
Maintainer

roastduck Jul 18, 2023
Maintainer

hulihan-start Jul 18, 2023
Author

hulihan-start Jul 18, 2023
Author

roastduck Jul 19, 2023
Maintainer