Skip to content

Replace Inline::Python with a light Python function server#402

Merged
taldcroft merged 30 commits intomasterfrom
no-inline-python
Jan 6, 2023
Merged

Replace Inline::Python with a light Python function server#402
taldcroft merged 30 commits intomasterfrom
no-inline-python

Conversation

@taldcroft
Copy link
Copy Markdown
Member

@taldcroft taldcroft commented Dec 30, 2022

Description

Driven by problems getting Inline::Python to work with ska3-prime, this PR replaces all the inline Python calls with a call to a new lightweight server process that runs Python function calls. This is basically the idea that @javierggt for an API server, but it users an even lighter framework (than e.g. flask) for performance. The server gets called at least thousands of times currently, though there are probably opportunities to reduce that with some code changes.

Current testing status is that it runs to completion on the DEC1922 loads and produces output that looks about right.

Process management for the Python function server (starcheck/server.py) took a little time to work out but it seems to be working, more or less. There might still be room for improvement.

The new server includes reasonable exception handling.

To do:

  • Proper logging not print statements
  • Improve performance
  • Get a free port from the system in Perl then pass that to the Python server
  • Secure communications by using a random key that is shared between client and server
  • Improve the get_fid_actions sub in parse CM. I think this might be broken for running loads in the far past.
  • Testing

Interface impacts

Testing

Unit tests

  • No unit tests

Functional tests

Output comparison

Ran starcheck for DEC1922A loads in this branch and in master:

$ ./sandbox_starcheck -dir $SKA/data/mpcrit1/mplogs/2022/DEC1922/oflsa -run_start_time=2022:352:00:00:00 -out out_{pyserver,master}

Then:

ska3) ➜  starcheck git:(no-inline-python) diff out_pyserver.txt out_master.txt
1,2c1,2
<  ------------  Starcheck 13.17.1.dev21+gd6087bd.d20230101    -----------------
<  Run on Sun Jan  1 06:45:26 EST 2023 by aldcroft from daze
---
>  ------------  Starcheck 13.17.1.dev9+g6057b15    -----------------
>  Run on Sat Dec 31 10:35:06 EST 2022 by aldcroft from daze

On ska3-flight HEAD linux

  • Confirmed that ctrl-c of the sandbox starcheck script results in all perl and python processes going away.
  • kill -9 of a the calling perl did not kill the starcheck.server python but it times out appropriately
  • Did a functional test with dither manually set to run into the load and not match kadi - completed successfully with only kadi initial state used for propagation and two warning recorded in starcheck.txt
  • confirmed -help works
  • confirmed -verbose 2 3 show different levels of outputs
  • confirmed -max_obsids stops processing after N obsids without other side effects

Functional and regression tests

Ran these in https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/

# Long week and IR Zone holds
starcheck -dir /data/mpcrit1/mplogs/2022/DEC2622/oflsa -out dec2622a_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2022/DEC2622/oflsa -out dec2622a_test
/proj/sot/ska/bin/diff2html dec2622a_flight.txt dec2622a_test.txt > dec2622a_diff.html

# Monitor star
starcheck -dir /data/mpcrit1/mplogs/2022/NOV2822/oflsa -out nov2822a_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2022/NOV2822/oflsa -out nov2822a_test
/proj/sot/ska/bin/diff2html nov2822a_flight.txt nov2822a_test.txt > nov2822a_diff.html

# Maneuver only loads
starcheck -dir /data/mpcrit1/mplogs/2022/OCT2422/oflsb -out oct2422b_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2022/OCT2422/oflsb -out oct2422b_test
/proj/sot/ska/bin/diff2html oct2422b_flight.txt oct2422b_test.txt > oct2422b_diff.html

# Replan
starcheck -dir /data/mpcrit1/mplogs/2021/MAY0521/oflsa -out may0521a_flight
/home/jeanconn/git/starcheck_noinline/sandbox_starcheck -dir /data/mpcrit1/mplogs/2021/MAY0521/oflsa -out may0521a_test
/proj/sot/ska/bin/diff2html may0521a_flight.txt may0521a_test.txt > may0521a_diff.html

To functionally test the change in the bad pixel check, I introduced a bad pixel near a guide star and compared outputs vs master.

https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/badpixcheck/jan0923_master_badpix.html#obsid25550

https://icxc.cfa.harvard.edu/aspect/test_review_outputs/starcheck-pr402/badpixcheck/jan0923_noinline_badpix.html#obsid25550

Check is still functional.

Run of full regression test set revealed latent issue with #389 fixed in 5be93b4. I compared this PR code to master + that same fix and the regression outputs show only two tiny, understood, and acceptable changes related to the PR update to the bad pixel test.

********************* 2005/JUL1105/oflsc ********************
********************* 2005/AUG2705/oflsb ********************
********************* 2005/NOV0705/oflsb ********************
********************* 2005/MAR0705/oflsb ********************
********************* 2006/MAR0606/oflsc ********************
********************* 2006/NOV1306/oflsa ********************
********************* 2006/AUG0706/oflsb ********************
********************* 2006/DEC2506/oflsc ********************
********************* 2006/MAR0606/oflsc ********************
********************* 2006/NOV2006/oflsb ********************
********************* 2007/MAR0507/oflsa ********************
********************* 2007/AUG0607/oflsa ********************
********************* 2007/AUG1407/oflsa ********************
********************* 2007/DEC1007/oflsb ********************
********************* 2007/MAR0507/oflsa ********************
********************* 2007/SEP0307/oflsa ********************
********************* 2008/JUL0708/oflsb ********************
********************* 2008/MAY0508/oflsa ********************
********************* 2008/AUG1808/oflsa ********************
********************* 2008/SEP0108/oflsb ********************
********************* 2008/SEP2908/oflsb ********************
********************* 2009/APR2009/oflsc ********************
********************* 2009/FEB1609/oflsa ********************
********************* 2009/FEB2309/oflsc ********************
********************* 2009/JUL0609/oflsb ********************
********************* 2009/JUN2209/oflsb ********************
********************* 2009/NOV3009/oflsb ********************
--- /home/jeanconn/git/starcheck_noinline/test_regress/release/2009/NOV3009/oflsb/starcheck.txt	2023-01-05 17:46:27.310431000 -0500
+++ /home/jeanconn/git/starcheck_noinline/test_regress/fido.cfa.harvard.edu_48d754734f76c6d488de6d18f96188ba4380c1e4/2009/NOV3009/oflsb/starcheck.txt	2023-01-05 23:10:38.313098000 -0500
@@ -81,7 +81,7 @@
 OBSID = 11935 at 2009:339:14:45:41.729   7.5 ACQ | 5.0 GUI | Critical:11 Caution:16
 OBSID = 11936 at 2009:339:15:53:45.926   7.6 ACQ | 5.0 GUI | Critical:10 Caution: 9
 OBSID = 11937 at 2009:339:16:35:42.174   7.6 ACQ | 5.0 GUI | Critical: 9 Caution: 8
-OBSID = 11938 at 2009:339:17:00:52.174   6.5 ACQ | 5.0 GUI | Critical:11 Caution:12
+OBSID = 11938 at 2009:339:17:00:52.174   6.5 ACQ | 5.0 GUI | Critical:10 Caution:12
 OBSID = 11939 at 2009:339:17:26:02.174   7.7 ACQ | 5.0 GUI | Critical:10 Caution: 9
 OBSID = 12037 at 2009:339:17:51:13.756   7.3 ACQ | 5.0 GUI | Critical:11 Caution:11
          ------  2009:340:03:45:00.000   OBC Load Segment Begins     CL340:0304 
@@ -1905,7 +1905,6 @@
 >> CRITICAL: [ 4] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 5] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 6] Readout Size. 6x6 Should be 8x8
->> CRITICAL: [ 7] Nearby ACA bad pixel.  row, col (-319, -299), dy, dz (5, 44) 
 >> CRITICAL: [ 7] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 8] Readout Size. 6x6 Should be 8x8
 >> CRITICAL: [ 9] Readout Size. 6x6 Should be 8x8
********************* 2009/OCT0509/oflsa ********************
********************* 2009/DEC2109/oflsb ********************
********************* 2010/APR1110/oflsb ********************
********************* 2010/APR1210/oflsa ********************
********************* 2010/JAN1110/oflsa ********************
********************* 2010/JUL0510/oflsb ********************
********************* 2010/OCT1110/oflsb ********************
********************* 2010/OCT2510/oflsb ********************
********************* 2011/JAN1711/oflsa ********************
********************* 2011/MAR1411/oflsa ********************
********************* 2011/APR0411/oflsa ********************
********************* 2011/DEC1211/oflsa ********************
********************* 2012/JAN3012/oflsa ********************
********************* 2013/JUL2913/oflsa ********************
********************* 2014/JAN2514/oflsa ********************
********************* 2015/JAN1215/oflsb ********************
--- /home/jeanconn/git/starcheck_noinline/test_regress/release/2015/JAN1215/oflsb/starcheck.txt	2023-01-05 18:45:24.075190000 -0500
+++ /home/jeanconn/git/starcheck_noinline/test_regress/fido.cfa.harvard.edu_48d754734f76c6d488de6d18f96188ba4380c1e4/2015/JAN1215/oflsb/starcheck.txt	2023-01-05 23:52:48.696908000 -0500
@@ -1063,7 +1063,7 @@
 [ 7]  6   185210664   BOT  8x8   0.955   9.205  10.703   1782    555  20   1  120          
 [ 8]  7   185339360   BOT  8x8   0.973   8.591  10.094   -407  -2098  20   1  120          
 
->> CRITICAL: [ 6] Nearby ACA bad pixel.  row, col (-318, -296), dy, dz (2, 25) 
+>> CRITICAL: [ 6] Nearby ACA bad pixel.  row, col (-317, -296), dy, dz (2, 25) 
 >> CAUTION : [1] Guide sum mag diff from agasc mag   0.08276
 >> CAUTION : [2] Guide sum mag diff from agasc mag   0.28450
 >> CAUTION : [3] Guide sum mag diff from agasc mag   0.02349
********************* 2015/DEC1115/oflsa ********************
********************* 2015/DEC1115/oflsb ********************

Performance

This branch runs about 50% slower. For DEC1922A this means about 90 seconds in this branch vs. 60 seconds in master.

@taldcroft taldcroft requested a review from jeanconn December 30, 2022 15:10
@jeanconn
Copy link
Copy Markdown
Contributor

Great and quick idea. It seems to me that the concerns are speed and security. Speed seems reasonable. For security, I like that it "only allow functions in the public API of starcheck module". Do we also need to either limit access to the server with key or obscurity? And/or run the Python in a sandbox or container, or as unprivileged user?

@taldcroft
Copy link
Copy Markdown
Member Author

taldcroft commented Dec 30, 2022

Maybe I'm being naive, but I think an attack would need this:

  • Be on the HEAD network on the same machine.
  • Discover the random port for the few minutes a week that starcheck is running
  • Find vulnerabilities in the exposed functions that allow doing something malicious and worthwhile to an attacker. The only functions that are exposed are those within the flight starcheck package.
  • The only credible thing I can think of is using all the memory and making the machine crash. I don't think there is any way to run arbitrary code by calling one of those functions.
  • However it is a fair point that maybe we should be make an explicit allow-list of functions just to be sure. I'm not immediately sure which symbols from starcheck.utils are exposed.

It also just occurred to me that the starcheck script could generate a random key and pass that to the server on creation and then require that key for commands, so that might pretty much make this lock tight.

@jeanconn
Copy link
Copy Markdown
Contributor

jeanconn commented Dec 30, 2022

Yeah, that's what I meant by "key" as the first choice. It wouldn't need to be SSL (it would just be a startup generated "password") . And you'd add obscurity by using an available port instead of the hardcoded one (which you mention as the third box in the todo items anyway).

@jeanconn
Copy link
Copy Markdown
Contributor

I also wasn't sure about IPC vs TCP for this kind of app.

@jeanconn
Copy link
Copy Markdown
Contributor

And despite my "It wouldn't need to be SSL" comment, I don't know how much slower this would be if it used standard key security via SSL/TLS.

@taldcroft
Copy link
Copy Markdown
Member Author

OK, now the communication between client and server is secured by a shared random key that generated by starcheck.pl at the beginning. I think this addresses any credible security issues.

Also the port is now randomly selected by the OS.

@taldcroft
Copy link
Copy Markdown
Member Author

@jeanconn - apart from details like logging, this is ready for real review. In order to improve performance I made some code changes that you need to review carefully.

Along the way I reformatted sections of code that I was touching since I just couldn't see the logic otherwise. It might just be worth running perltidy at some point.

@taldcroft taldcroft changed the title WIP: Replace Inline::Python with a light Python function server Replace Inline::Python with a light Python function server Jan 1, 2023
@jeanconn jeanconn mentioned this pull request Jan 3, 2023
30 tasks
foreach my $pixel (@bad_pixels) {
my $dy = abs($yag-$pixel->{yag});
my $dz = abs($zag-$pixel->{zag});
my $dy = abs($pixel_row-$pixel->{row}) * 5;
Copy link
Copy Markdown
Contributor

@jeanconn jeanconn Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this comparison, was the previous logic incorrect? We convert the bad pixels to yag, zag and confirm yag zag of those pixels not within dither + 25 arcsecs of star position. Multiplying the pixel values by 5 is not that close to the previous values. I'm wondering if it would make sense to leave this alone for this release and then either update to match the proseco logic (work in pixel space instead) or put that bad pixel check in sparkles (to be integrated in starcheck #385 ).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic was correct and I believe the new logic is also correct. Note that the scale factor is being applied to a delta row/col, so the worst case change is likely < 0.05 arcsec on a check of 16 + 25 arcsec. From the perspective of actual ACA operational requirements that level of difference is inconsequential. Multiplying by the mean scale factor (around 4.996?) would reduce that even more.

My only concern is if this can cause a mismatch with sparkles. But at first glance it doesn't look like sparkles checks that, so that probably can't happen. And it sounds like you know that proseco works in pixel space, so then this change is already accomplishing a good thing, no?

Copy link
Copy Markdown
Contributor

@jeanconn jeanconn Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that this was a delta pixel space check in this version and was just seeing pixels * 5. I'm blaming the lack of whitespace around an arithmetic operator (which wasn't a change...).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regression weeks I was surprised to see a change of > .5 arcsec. This one (with hrc dither) put us above the tolerance in flight and passing the check in master. I still think this is OK but not consistent with 0.05 arcsec. Also, I think maybe this check should have been done previously with pix_zero_loc='center' for yagzag to pixels for the star positions? But still in the noise overall I think.

In [17]: from chandra_aca.transform import (pixels_to_yagzag, yagzag_to_pixels)

In [18]: star_yag, star_zag = (1617.65, -1557.65)

In [19]: pix_row, pix_col = (-319, -299)

In [20]: star_row, star_col = yagzag_to_pixels(star_yag, star_zag)

In [21]: pix_yag, pix_zag = pixels_to_yagzag(pix_row, pix_col)

In [22]: pix_zag - star_zag
Out[22]: 44.61728930771119

In [23]: (pix_col - star_col) * 5
Out[23]: 45.14387887280691

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an open question for you @taldcroft . Does this match your expectations and look OK to you?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so I am changing my expectations of variations in the local plate scale with this:

In [9]: y1, z1 = pixels_to_yagzag(-300, -300)
In [10]: y2, z2 = pixels_to_yagzag(-300, -310)
In [11]: (z1 - z2) / 10
Out[11]: 4.9470113374697124

So what matters to me really is whether starcheck can fail a catalog that proseco selected as OK. I'm not quite sure off hand. @jeanconn maybe you have been looking at the code and the threshold that proseco applies?

At the end of the day there is a fundamental point that bad pixels are defined in pixel space and stars are located in yag/zag space, so no matter how you define the distance metric there will be changes depending on the local plate scale.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. With regard to bad pixels defined in pixel space and stars in yag zag space, agreed. I was just thinking that the converter mention that 'center' is often more what you have in mind and for this check that is based on "star distance to a pixel" the distance to pixel center is probably more right.

But to your point about how consistent this is with proseco, the guide star exclusion for bad pixels is a little opaque (sorry) https://github.com/sot/proseco/blob/7a75ae329a3fde4070fbe808633db2abdcf19344/proseco/guide.py#L1330 . The candidate star positions are converted to row/col with edge https://github.com/sot/proseco/blob/master/proseco/core.py#L1253 . It looks to me like we're adding 4 to 5 pixels padding to define the exclusion range, so I would think this could be 1 pixel less conservative than the starcheck check for some cases.

So worst case would be that the starcheck check would give us a red warning and we'd need to double-check this. I was thinking this should probably go as-is and we can re-examine the tolerances in the future if they cause issues.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this should probably go as-is and we can re-examine the tolerances in the future if they cause issues.

YES!

Comment thread starcheck/src/starcheck.pl
Comment thread starcheck/src/starcheck.pl Outdated
Comment thread starcheck/utils.py
Comment thread starcheck/src/starcheck.pl Outdated
Comment thread starcheck/utils.py
@jeanconn
Copy link
Copy Markdown
Contributor

jeanconn commented Jan 4, 2023

I think this is really great, but common usage for me with starcheck has been to "ctrl-c" all the time when I realize I've run with the wrong options or whatever. I just had 8 of these python servers going. So it seems like the python server needs to check regularly if the parent process is alive and also have some kind of reasonable timeout. And/or the perl could handle this signal and do the cleanup?

Comment thread starcheck/src/starcheck.pl Outdated
@jeanconn
Copy link
Copy Markdown
Contributor

jeanconn commented Jan 5, 2023

For the action item of " Proper logging not print statements" it is hard to know what makes sense. The Perl side is a mess of print statements. The new python code could have better logging, but would we really want to use utils.config_logging for this? One can't use it to start logging the server startup, for example, because one needs to start up the Python server to do things like configure a logger. I'm not sure how much value it would add here.

Use Dumper instead of Dump because at least we know where it came from.
@taldcroft
Copy link
Copy Markdown
Member Author

About logging, just whatever fits with the existing standards for starcheck is fine. I didn't really look at what the rest of the Perl code is doing. For the Python code you could just make a logger with ska_helpers.logger.basic_logger, with a default log level of INFO and then all logger.debug(..) statements. The main point is have code in there to make it easy to debug in a dev environment if something goes wrong. In theory you could have the main command line logging level control this but probably not worth it initially.

@javierggt
Copy link
Copy Markdown
Contributor

javierggt commented Jan 5, 2023

I am not going into the details of this PR, just following the comments, and I wanted to say that I just used it with this weeks loads without trouble. And by that I mean:

  • I ran both on HEAD and on Mac OS
  • I could see the logging output for this branch (i.e.: I ran the right version)
  • in both cases starcheck.txt matched the current flight version.

Comment thread starcheck/server.py
@jeanconn
Copy link
Copy Markdown
Contributor

jeanconn commented Jan 5, 2023

My testing is not done, but I've updated the PR with some linux functional and regression test notes.

@taldcroft
Copy link
Copy Markdown
Member Author

The testing looks quite thorough, thanks!

@taldcroft
Copy link
Copy Markdown
Member Author

About the decode's, good catch. Did you search for any outstanding decode() calls?

Was that code actually being run? I would assume that it would fail with an exception always if it ran.

@jeanconn
Copy link
Copy Markdown
Contributor

jeanconn commented Jan 6, 2023

Yes, I searched for other decodes and found none. The characteristics date code with the decode that was hanging around is only called for some historical schedules (so I fixed it so more regression weeks would run to completion). We could perhaps cut it at this point and define a will-run-only-on-schedules-after date for starcheck and document. But I figured not for this release.

Copy link
Copy Markdown
Contributor

@jeanconn jeanconn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and I'm done with my testing.

$obs{$oflsid}->add_guide_summ($oflsid, \%guidesumm);
}
else {
my $obsid = $obs{$oflsid}->{obsid};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably should have left this line (the print warning statement is likely going to also spit out an "undefined $obsid" warning. But I'll fix it in a future PR).

@jeanconn
Copy link
Copy Markdown
Contributor

jeanconn commented Jan 6, 2023

As noted above I just found a tiny defect in a change of mine, but I still think this is fine to go. I just don't want to merge as some of this is my code with only (recorded) self-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants