Replace Inline::Python with a light Python function server#402
Replace Inline::Python with a light Python function server#402
Conversation
|
Great and quick idea. It seems to me that the concerns are speed and security. Speed seems reasonable. For security, I like that it "only allow functions in the public API of starcheck module". Do we also need to either limit access to the server with key or obscurity? And/or run the Python in a sandbox or container, or as unprivileged user? |
|
Maybe I'm being naive, but I think an attack would need this:
It also just occurred to me that the starcheck script could generate a random key and pass that to the server on creation and then require that key for commands, so that might pretty much make this lock tight. |
|
Yeah, that's what I meant by "key" as the first choice. It wouldn't need to be SSL (it would just be a startup generated "password") . And you'd add obscurity by using an available port instead of the hardcoded one (which you mention as the third box in the todo items anyway). |
|
I also wasn't sure about IPC vs TCP for this kind of app. |
|
And despite my "It wouldn't need to be SSL" comment, I don't know how much slower this would be if it used standard key security via SSL/TLS. |
|
OK, now the communication between client and server is secured by a shared random key that generated by Also the port is now randomly selected by the OS. |
|
@jeanconn - apart from details like logging, this is ready for real review. In order to improve performance I made some code changes that you need to review carefully. Along the way I reformatted sections of code that I was touching since I just couldn't see the logic otherwise. It might just be worth running |
| foreach my $pixel (@bad_pixels) { | ||
| my $dy = abs($yag-$pixel->{yag}); | ||
| my $dz = abs($zag-$pixel->{zag}); | ||
| my $dy = abs($pixel_row-$pixel->{row}) * 5; |
There was a problem hiding this comment.
For this comparison, was the previous logic incorrect? We convert the bad pixels to yag, zag and confirm yag zag of those pixels not within dither + 25 arcsecs of star position. Multiplying the pixel values by 5 is not that close to the previous values. I'm wondering if it would make sense to leave this alone for this release and then either update to match the proseco logic (work in pixel space instead) or put that bad pixel check in sparkles (to be integrated in starcheck #385 ).
There was a problem hiding this comment.
The previous logic was correct and I believe the new logic is also correct. Note that the scale factor is being applied to a delta row/col, so the worst case change is likely < 0.05 arcsec on a check of 16 + 25 arcsec. From the perspective of actual ACA operational requirements that level of difference is inconsequential. Multiplying by the mean scale factor (around 4.996?) would reduce that even more.
My only concern is if this can cause a mismatch with sparkles. But at first glance it doesn't look like sparkles checks that, so that probably can't happen. And it sounds like you know that proseco works in pixel space, so then this change is already accomplishing a good thing, no?
There was a problem hiding this comment.
I missed that this was a delta pixel space check in this version and was just seeing pixels * 5. I'm blaming the lack of whitespace around an arithmetic operator (which wasn't a change...).
There was a problem hiding this comment.
In regression weeks I was surprised to see a change of > .5 arcsec. This one (with hrc dither) put us above the tolerance in flight and passing the check in master. I still think this is OK but not consistent with 0.05 arcsec. Also, I think maybe this check should have been done previously with pix_zero_loc='center' for yagzag to pixels for the star positions? But still in the noise overall I think.
In [17]: from chandra_aca.transform import (pixels_to_yagzag, yagzag_to_pixels)
In [18]: star_yag, star_zag = (1617.65, -1557.65)
In [19]: pix_row, pix_col = (-319, -299)
In [20]: star_row, star_col = yagzag_to_pixels(star_yag, star_zag)
In [21]: pix_yag, pix_zag = pixels_to_yagzag(pix_row, pix_col)
In [22]: pix_zag - star_zag
Out[22]: 44.61728930771119
In [23]: (pix_col - star_col) * 5
Out[23]: 45.14387887280691
There was a problem hiding this comment.
So an open question for you @taldcroft . Does this match your expectations and look OK to you?
There was a problem hiding this comment.
Interesting, so I am changing my expectations of variations in the local plate scale with this:
In [9]: y1, z1 = pixels_to_yagzag(-300, -300)
In [10]: y2, z2 = pixels_to_yagzag(-300, -310)
In [11]: (z1 - z2) / 10
Out[11]: 4.9470113374697124
So what matters to me really is whether starcheck can fail a catalog that proseco selected as OK. I'm not quite sure off hand. @jeanconn maybe you have been looking at the code and the threshold that proseco applies?
At the end of the day there is a fundamental point that bad pixels are defined in pixel space and stars are located in yag/zag space, so no matter how you define the distance metric there will be changes depending on the local plate scale.
There was a problem hiding this comment.
Right. With regard to bad pixels defined in pixel space and stars in yag zag space, agreed. I was just thinking that the converter mention that 'center' is often more what you have in mind and for this check that is based on "star distance to a pixel" the distance to pixel center is probably more right.
But to your point about how consistent this is with proseco, the guide star exclusion for bad pixels is a little opaque (sorry) https://github.com/sot/proseco/blob/7a75ae329a3fde4070fbe808633db2abdcf19344/proseco/guide.py#L1330 . The candidate star positions are converted to row/col with edge https://github.com/sot/proseco/blob/master/proseco/core.py#L1253 . It looks to me like we're adding 4 to 5 pixels padding to define the exclusion range, so I would think this could be 1 pixel less conservative than the starcheck check for some cases.
So worst case would be that the starcheck check would give us a red warning and we'd need to double-check this. I was thinking this should probably go as-is and we can re-examine the tolerances in the future if they cause issues.
There was a problem hiding this comment.
I was thinking this should probably go as-is and we can re-examine the tolerances in the future if they cause issues.
YES!
|
I think this is really great, but common usage for me with starcheck has been to "ctrl-c" all the time when I realize I've run with the wrong options or whatever. I just had 8 of these python servers going. So it seems like the python server needs to check regularly if the parent process is alive and also have some kind of reasonable timeout. And/or the perl could handle this signal and do the cleanup? |
|
For the action item of " Proper logging not print statements" it is hard to know what makes sense. The Perl side is a mess of print statements. The new python code could have better logging, but would we really want to use utils.config_logging for this? One can't use it to start logging the server startup, for example, because one needs to start up the Python server to do things like configure a logger. I'm not sure how much value it would add here. |
Use Dumper instead of Dump because at least we know where it came from.
|
About logging, just whatever fits with the existing standards for starcheck is fine. I didn't really look at what the rest of the Perl code is doing. For the Python code you could just make a logger with |
|
I am not going into the details of this PR, just following the comments, and I wanted to say that I just used it with this weeks loads without trouble. And by that I mean:
|
|
My testing is not done, but I've updated the PR with some linux functional and regression test notes. |
|
The testing looks quite thorough, thanks! |
|
About the Was that code actually being run? I would assume that it would fail with an exception always if it ran. |
|
Yes, I searched for other decodes and found none. The characteristics date code with the decode that was hanging around is only called for some historical schedules (so I fixed it so more regression weeks would run to completion). We could perhaps cut it at this point and define a will-run-only-on-schedules-after date for starcheck and document. But I figured not for this release. |
jeanconn
left a comment
There was a problem hiding this comment.
This looks good to me and I'm done with my testing.
| $obs{$oflsid}->add_guide_summ($oflsid, \%guidesumm); | ||
| } | ||
| else { | ||
| my $obsid = $obs{$oflsid}->{obsid}; |
There was a problem hiding this comment.
I probably should have left this line (the print warning statement is likely going to also spit out an "undefined $obsid" warning. But I'll fix it in a future PR).
|
As noted above I just found a tiny defect in a change of mine, but I still think this is fine to go. I just don't want to merge as some of this is my code with only (recorded) self-review. |
Description
Driven by problems getting
Inline::Pythonto work with ska3-prime, this PR replaces all the inline Python calls with a call to a new lightweight server process that runs Python function calls. This is basically the idea that @javierggt for an API server, but it users an even lighter framework (than e.g. flask) for performance. The server gets called at least thousands of times currently, though there are probably opportunities to reduce that with some code changes.Current testing status is that it runs to completion on the DEC1922 loads and produces output that looks about right.
Process management for the Python function server (
starcheck/server.py) took a little time to work out but it seems to be working, more or less. There might still be room for improvement.The new server includes reasonable exception handling.
To do:
get_fid_actionssub in parse CM. I think this might be broken for running loads in the far past.Interface impacts
Testing
Unit tests
Functional tests
Output comparison
Ran starcheck for DEC1922A loads in this branch and in master:
Then:
On ska3-flight HEAD linux
Functional and regression tests
Ran these in https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/
To functionally test the change in the bad pixel check, I introduced a bad pixel near a guide star and compared outputs vs master.
https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/badpixcheck/jan0923_master_badpix.html#obsid25550
https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/badpixcheck/jan0923_noinline_badpix.html#obsid25550
Check is still functional.
Run of full regression test set revealed latent issue with #389 fixed in 5be93b4. I compared this PR code to master + that same fix and the regression outputs show only two tiny, understood, and acceptable changes related to the PR update to the bad pixel test.
Performance
This branch runs about 50% slower. For DEC1922A this means about 90 seconds in this branch vs. 60 seconds in master.