Skip to content

Fixes regressions when running on a cluster with more or fewer than 4 engines#489

Merged
bgrant merged 5 commits into
masterfrom
bugfix/cluster-size-regression
Jul 4, 2014
Merged

Fixes regressions when running on a cluster with more or fewer than 4 engines#489
bgrant merged 5 commits into
masterfrom
bugfix/cluster-size-regression

Conversation

@kwmsmith
Copy link
Copy Markdown
Contributor

@kwmsmith kwmsmith commented Jul 3, 2014

This is based on @cowlicks PR #441. Rebased on master.

@kwmsmith kwmsmith added the bug label Jul 3, 2014
@kwmsmith kwmsmith added this to the 0.4 milestone Jul 3, 2014
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.04%) when pulling 9b34b3a on bugfix/cluster-size-regression into 02d25eb on master.

@bgrant
Copy link
Copy Markdown
Contributor

bgrant commented Jul 3, 2014

Hey @kwmsmith : I get a failure with 11 engines under Python3:

$ make test
python -m unittest discover
..........................s........................................................................................................ssss.................................................................................................................................E....................................................
======================================================================
ERROR: test_dist_sizes (distarray.tests.test_metadata_utils.TestGridSizes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./distarray/tests/test_metadata_utils.py", line 225, in test_dist_sizes
    dist = Distribution(self.context, (2, 3, 4), dist=('n', 'b', 'c'))
  File "./distarray/dist/maps.py", line 594, in __new__
    grid_shape = grid_shape or make_grid_shape(shape, dist, len(targets))
  File "./distarray/metadata_utils.py", line 180, in make_grid_shape
    check_grid_shape_postconditions(out_grid_shape, shape, dist, comm_size)
  File "./distarray/metadata_utils.py", line 69, in check_grid_shape_postconditions
    "= %s" % (shape, grid_shape))
ValueError: all(gs <= s for (s, gs) in zip(shape, grid_shape) if s > 0) not satisfied, shape = (2, 3, 4) and grid_shape = (1, 1, 11)

----------------------------------------------------------------------
Ran 316 tests in 21.922s

FAILED (errors=1, skipped=5)
make: *** [test_client] Error 1

@bgrant
Copy link
Copy Markdown
Contributor

bgrant commented Jul 3, 2014

Also lots of HubTimeoutErrors for n=1...

@bgrant
Copy link
Copy Markdown
Contributor

bgrant commented Jul 3, 2014

Here's the first error:

$ python -m unittest discover -cf
......s..ss.s...s......s........ssssssE
======================================================================
ERROR: setUpClass (distarray.dist.tests.test_distarray.TestSetItemSlicing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./distarray/testing.py", line 180, in setUpClass
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out

----------------------------------------------------------------------
Ran 30 tests in 11.289s

FAILED (errors=1, skipped=12)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "./distarray/dist/cleanup.py", line 27, in cleanup_all
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "./distarray/dist/cleanup.py", line 69, in clear_all
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out

It seems to be the same under Python 2.

@kwmsmith
Copy link
Copy Markdown
Contributor Author

kwmsmith commented Jul 3, 2014

@bgrant the all(gs <= s for (s, gs) in zip(shape, grid_shape) if s > 0) not satisfied, shape = (2, 3, 4) and grid_shape = (1, 1, 11) error is a separate issue, to be addressed in a different PR.

Don't know about the hub timeout stuff, looking into it.

@kwmsmith
Copy link
Copy Markdown
Contributor Author

kwmsmith commented Jul 3, 2014

@bgrant I can't reproduce the hub timeout error, but I did fix something in test_metadata_utils.py. Please try again.

@bgrant
Copy link
Copy Markdown
Contributor

bgrant commented Jul 3, 2014

$ dacluster restart -n1                                                                                                          [10/9626]
2014-07-03 16:29:56.978 [IPClusterStop] Using existing profile dir: '/Users/robertgrant/development/venvs/py33/ipython-config/profile_default'
2014-07-03 16:29:57.018 [IPClusterStop] Stopping cluster [pid=24265] with [signal=2]
2014-07-03 16:29:57.565 [IPClusterStart] Using existing profile dir: '/Users/robertgrant/development/venvs/py33/ipython-config/profile_default'
2014-07-03 16:29:57.573 [IPClusterStart] Starting ipcluster with [daemon=False]
2014-07-03 16:29:57.574 [IPClusterStart] Creating pid file: /Users/robertgrant/development/venvs/py33/ipython-config/profile_default/pid/ipcluster.pid
2014-07-03 16:29:57.574 [IPClusterStart] Starting Controller with LocalControllerLauncher
2014-07-03 16:29:58.575 [IPClusterStart] Starting 1 Engines with MPIEngineSetLauncher
2014-07-03 16:30:28.581 [IPClusterStart] Engines appear to have started successfully
$ python -m unittest discover -cf
......s..ss.s...s......s........ssssssE
======================================================================
ERROR: setUpClass (distarray.dist.tests.test_distarray.TestSetItemSlicing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./distarray/testing.py", line 180, in setUpClass
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
IPython.parallel.error.TimeoutError: Hub connection request timed out

----------------------------------------------------------------------
Ran 30 tests in 11.282s

FAILED (errors=1, skipped=12)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "./distarray/dist/cleanup.py", line 27, in cleanup_all
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "./distarray/dist/cleanup.py", line 69, in clear_all
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out

@bgrant
Copy link
Copy Markdown
Contributor

bgrant commented Jul 3, 2014

Or with -n3:

$ python -m unittest discover -cf
......s...s.......s......s........sssssssE
======================================================================
ERROR: setUpClass (distarray.dist.tests.test_distributed_io.TestDnpyFileIO)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./distarray/dist/tests/test_distributed_io.py", line 43, in setUpClass
  File "./distarray/testing.py", line 180, in setUpClass
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
IPython.parallel.error.TimeoutError: Hub connection request timed out

----------------------------------------------------------------------
Ran 33 tests in 12.985s

FAILED (errors=1, skipped=11)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "./distarray/dist/cleanup.py", line 27, in cleanup_all
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
    raise error.TimeoutError("Hub connection request timed out")
IPython.parallel.error.TimeoutError: Hub connection request timed out
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "./distarray/dist/cleanup.py", line 69, in clear_all
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 496, in __init__
  File "/Users/robertgrant/development/venvs/py33/lib/python3.3/site-packages/IPython/parallel/client/client.py", line 615, in _connect
IPython.parallel.error.TimeoutError: Hub connection request timed out

@bgrant
Copy link
Copy Markdown
Contributor

bgrant commented Jul 4, 2014

Okay- this works for me with n >= 4 and apparently for everyone else with 0 < n < 4, so I'm merging.

bgrant added a commit that referenced this pull request Jul 4, 2014
Fixes regressions when running on a cluster with more or fewer than 4 engines
@bgrant bgrant merged commit e493295 into master Jul 4, 2014
@bgrant bgrant deleted the bugfix/cluster-size-regression branch July 4, 2014 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants