Added simple tile size calculation by cod3monk · Pull Request #2 · AnyDSL/stincilla

cod3monk · 2017-02-24T12:00:40Z

It is currently tailored to 2D boxed stencils, but can be generalized

It is based on img.width and available cache sizes

richardmembarth · 2017-02-27T08:03:15Z

This seems to block only for L1 cache: if cur_lvl == 0 { ... tile(cur_lvl + 1, ...) and is almost 4x slower on my machine compared to what is used currently :)

I guess you have to pass cur_lvl to get_tile_dims and iterate only over the cache sizes relevant for the current level, starting from L3 down to L1. It seems also that get_tile_dims() gets not partially evaluated (lower2cff hangs). @leissa can you have a look at this?

For debugging, you can define a Boolean debug_tiling and set it to false in production mode.

cod3monk · 2017-02-27T14:04:00Z

@with the code in 9720ea8 I get 478ms meadian(27) and with my version in c097a4f I get 493 ms. We are talking about gaussian with iteration_advanced, right? If I increase the threshold for loop-overhead vs. blocking efficiency to 2000, it forces the calculation to block for L2 and yields 487 ms. So I can hardly find any performance increase with L1 vs L2 blocking.

You mention blocking for multiple levels, with the given image size of 4096x4096, the L1 block_size needs to be <= 1638 elements, for all other caches the layer conditions are fulfilled this the original size (L2 would need blocking with <=13107 and L3 <= 1M elements). Did I misunderstand what you mean by blocking for other levels?

richardmembarth · 2017-03-01T09:36:16Z

mapping_cpu.impala

-        //print_string(", ");
-        //print_int(mask.size_y);
-        //print_string(")\n");
+        if debug_tiling @{        


No @ is required here - the frontend will already discard the conditional code if debug_tiling is false.

richardmembarth · 2017-03-01T09:40:52Z

Your current logic starts tile with cur_lvl = 0, adds tiling for one cache level and calls then tile with cur_lvl = 1. However, the code checks only for cur_lvl == 0 and emits otherwise the body. Hence, there will be at most tiling for one cache level.

richardmembarth · 2017-03-01T09:47:44Z

To make debugging programs using PE easier, we will add an instruction to show the "status" during PE, something like pe_info("message", var), which will print the value of var during PE if var is static and a warning if var is dynamic, that is, not partially evaluated.

richardmembarth · 2017-03-01T09:58:01Z

Regarding speed: your version takes 286 ms and the version on master takes only 76 ms on my laptop (gaussian & iteration_advanced).
My guess is that the calls to get_tile_dims() are not partially evaluated and this gets executed at runtime. Once we have the pe_info() instruction such cases are easy to debug.

richardmembarth · 2017-03-09T18:56:00Z

Ok, we've added the pe_info() instruction, see here how to use it: 91af79f
You need the latest version of impala / thorin / runtime for it to work and you also want the latest changes from Stincilla.

Adding pe_info() to your code:

let (xtile_dim, ytile_dim) = @get_tile_dims(mask, img);

pe_info("xtile_dim", xtile_dim);
pe_info("ytile_dim", ytile_dim);

Shows that ytile_dim is a constant (4096 of type int), but xtile_dim is not (variable _20511 and _47437):

I:stincilla/mapping_cpu.impala:205 col 9 - 41: pe_info: xtile_dim: _20511
I:stincilla/mapping_cpu.impala:206 col 9 - 41: pe_info: ytile_dim: qs32 4096
I:stincilla/mapping_cpu.impala:205 col 9 - 41: pe_info: xtile_dim: _47437
I:stincilla/mapping_cpu.impala:206 col 9 - 41: pe_info: ytile_dim: qs32 4096

That is, some condition in get_tile_dims() is not partially evaluated.
Talking to Roland, it turned out that we generate load and stores for mutable variables, e.g. for let mut max_distance = (0, 0);

There are two solution:

functional-style programming to avoid mutable variables
folding of loads and stores on our side

leissa · 2017-03-10T02:06:43Z

To be clear:
Having mutable variables is fine for the partial evaluator as long as the address is not taken from the variable. But this happens implicitly when the variable is captured in a closure.

Examples:

```
let mut x = 42;
let p = &x;
x += 3;
// ...
```
Because the address of x is taken here impala generates loads and store which - unless another optimization kicks in - the partial evaluator will not fold at the moment.
```
let mut x = 42;
|| { /*...*/ x /*...*/ }
```
Same thing here because the address of x is implicitly taken within the closure.
```
let mut x = 42;
for i in range(a, b) { /*...*/ x /*...*/ }
```
Same as above since the block expression is just a closure in this case.

That being said, note that usually, other optimizations in thorin will fold those loads & stores afterwards. So folding these loads & stores is only critical if control flow depends on such variables.

I think we should fold loads & store during partial evaluation. This is, however, more complicated than it sounds. So, don't expect it to be fixed next week :(

cod3monk · 2017-03-16T09:22:08Z

That was a a measurement mistake (last deleted comment). Is there a way to parse command line arguments? I would like to switch between the different versions dynamically as to avoid such mistakes.

madmann91 · 2017-03-16T16:22:53Z

You can:

Pass directly the result of parsing the command line arguments to some Impala function and call that Impala function from C:

fn impala_main(do_thing: i32) -> () {
  if do_thing == 42 {
    // ...
  }
}

extern "C" void impala_main(int);

int main(int argc, char** argv) {
  int do_thing = parse_args(argc, argv);
  impala_main(do_thing);
}

Parse the arguments in impala using strcmp, for example:

extern "C" {
  fn strcmp(&[u8], &[u8]) -> i32;
}

fn main(argc: i32, argv: &[&[u8]]) -> i32 {
  for i in range(1, argc) {
    if !strcmp(argv(i), "-h") {
      usage();
      return(0)
    } else if !strcmp(argv(i), ...) {
      // ... and so on
    }
  }
}

richardmembarth · 2017-03-16T16:49:14Z

Some of our tests in Impala use argc and argv, for example nbody

I'm going to have a look at your changes tomorrow.

richardmembarth · 2017-03-17T11:22:27Z

I had a look at the code, first a minor remark:

you can have pe_info() also in non-debug builds, no need to check for debug_tiling. Actually pe_info() is a no-op and removed by the compiler after printing.

Looking at execution times:

iteration: ~207 ms
iteration_bounds: ~58 ms
iteration_advanced: ~313 ms
iteration_advanced (master): ~57 ms

So it seems something is going wrong here ...
Adding the following code in addition to the pe_info("x|y tile dim", ...)

pe_info("x lower | upper", (xl, xu));
pe_info("y lower | upper", (yl, yu));

shows the problem:

...
x lower | upper: ((qs32, qs32) tuple qs32 2, qs32 4094)
y lower | upper: ((qs32, qs32) tuple qs32 2, qs32 4094)
...
x lower | upper: _39537
y lower | upper: _39540
...

xl and xu are only known when we enter tile() the first time. For the second call to tile(), xl and xu are not know since min() and max() can't be evaluated. That is, the loops for cur_lvl == 0 should be partially evaluated, but not those when calling body().
Applying these changes gives the same execution time as iteration_bounds: ~58 ms

I can push these changes to your fork if you give me the permissions.

richardmembarth · 2017-03-20T13:38:14Z

I've pushed my changes.

Maybe also good to know: pe_info() messages will be only printed if you increase the log-level of Impala to info. This is done by Stincilla. By default, log-level is only set to error.

richardmembarth · 2017-03-21T17:37:45Z

Enabling fast-math gives:

iteration_bounds: ~56 ms
iteration_advanced: ~51 ms

cod3monk added 2 commits February 24, 2017 12:41

Added simple tile size calculation

a8674dc

It is based on img.width and available cache sizes

Merge branch 'master' of github.com:cod3monk/stincilla

c097a4f

leissa assigned richardmembarth Feb 24, 2017

added debug flag for tiling

1751ddd

richardmembarth reviewed Mar 1, 2017

View reviewed changes

cod3monk added 2 commits March 2, 2017 12:18

removed PE on debug output

48cc401

added support for any stencil shape

366e4b1

cod3monk added 3 commits March 10, 2017 16:27

Merge branch 'master' of github.com:AnyDSL/stincilla

290493a

removed for-loops to allow partial evaluation

cf8a53c

Embarrassing little performance issue fix

47465c9

richardmembarth added 2 commits March 20, 2017 14:31

Partially evaluate the loop for tiling.

8221a71

Analyze stencil only once.

7abdb58

richardmembarth added 2 commits November 21, 2017 23:33

Merge remote-tracking branch 'upstream/master'

4087dd1

Porting to new PE syntax.

23a6ef4

richardmembarth deleted the branch AnyDSL:impala September 5, 2022 11:06

richardmembarth closed this Sep 5, 2022

richardmembarth reopened this Sep 5, 2022

richardmembarth changed the base branch from master to impala September 5, 2022 11:14

Conversation

cod3monk commented Feb 24, 2017

Uh oh!

richardmembarth commented Feb 27, 2017

Uh oh!

cod3monk commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richardmembarth Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

richardmembarth commented Mar 1, 2017

Uh oh!

richardmembarth commented Mar 1, 2017

Uh oh!

richardmembarth commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richardmembarth commented Mar 9, 2017

Uh oh!

leissa commented Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cod3monk commented Mar 16, 2017

Uh oh!

madmann91 commented Mar 16, 2017

Uh oh!

richardmembarth commented Mar 16, 2017

Uh oh!

richardmembarth commented Mar 17, 2017

Uh oh!

richardmembarth commented Mar 20, 2017

Uh oh!

richardmembarth commented Mar 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cod3monk commented Feb 27, 2017 •

edited

Loading

richardmembarth commented Mar 1, 2017 •

edited

Loading

leissa commented Mar 10, 2017 •

edited

Loading