Conversation
It is based on img.width and available cache sizes
|
This seems to block only for L1 cache: I guess you have to pass For debugging, you can define a Boolean |
|
@with the code in 9720ea8 I get 478ms meadian(27) and with my version in c097a4f I get 493 ms. We are talking about gaussian with You mention blocking for multiple levels, with the given image size of 4096x4096, the L1 block_size needs to be <= 1638 elements, for all other caches the layer conditions are fulfilled this the original size (L2 would need blocking with <=13107 and L3 <= 1M elements). Did I misunderstand what you mean by blocking for other levels? |
mapping_cpu.impala
Outdated
| //print_string(", "); | ||
| //print_int(mask.size_y); | ||
| //print_string(")\n"); | ||
| if debug_tiling @{ |
There was a problem hiding this comment.
No @ is required here - the frontend will already discard the conditional code if debug_tiling is false.
|
Your current logic starts |
|
To make debugging programs using PE easier, we will add an instruction to show the "status" during PE, something like |
|
Regarding speed: your version takes 286 ms and the version on master takes only 76 ms on my laptop (gaussian & iteration_advanced). |
|
Ok, we've added the Adding let (xtile_dim, ytile_dim) = @get_tile_dims(mask, img);
pe_info("xtile_dim", xtile_dim);
pe_info("ytile_dim", ytile_dim);Shows that That is, some condition in There are two solution:
|
|
To be clear: Examples:
That being said, note that usually, other optimizations in thorin will fold those loads & stores afterwards. So folding these loads & stores is only critical if control flow depends on such variables. I think we should fold loads & store during partial evaluation. This is, however, more complicated than it sounds. So, don't expect it to be fixed next week :( |
|
That was a a measurement mistake (last deleted comment). Is there a way to parse command line arguments? I would like to switch between the different versions dynamically as to avoid such mistakes. |
|
You can:
fn impala_main(do_thing: i32) -> () {
if do_thing == 42 {
// ...
}
}extern "C" void impala_main(int);
int main(int argc, char** argv) {
int do_thing = parse_args(argc, argv);
impala_main(do_thing);
}
extern "C" {
fn strcmp(&[u8], &[u8]) -> i32;
}
fn main(argc: i32, argv: &[&[u8]]) -> i32 {
for i in range(1, argc) {
if !strcmp(argv(i), "-h") {
usage();
return(0)
} else if !strcmp(argv(i), ...) {
// ... and so on
}
}
} |
|
Some of our tests in Impala use argc and argv, for example nbody I'm going to have a look at your changes tomorrow. |
|
I had a look at the code, first a minor remark:
Looking at execution times:
So it seems something is going wrong here ... shows the problem:
I can push these changes to your fork if you give me the permissions. |
|
I've pushed my changes. Maybe also good to know: |
|
Enabling fast-math gives:
|
It is currently tailored to 2D boxed stencils, but can be generalized