-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Closed
Description
With:
- Transformers: 7d4b3dd
- huggingface/accelerate@78b8126
- pytorch/pytorch@4e4b859 (torch 2.7 candidate)
On:
- 2 card Intel(R) Data Center GPU Max 1550 (aka PVC), note: each card has 2 tiles, in total there are 4 torch devices available
test_model_parallel_beam_search
tests for the following models fail with "IndexError: list index out of range":
$ cat spec.py
import torch
DEVICE_NAME = 'xpu'
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count
$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest -k test_model_parallel_beam_search \
tests/models/data2vec \
tests/models/roberta \
tests/models/roberta_prelayernorm \
tests/models/xlm_roberta_xl
...
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta/test_modeling_roberta.py::RobertaModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta_prelayernorm/test_modeling_roberta_prelayernorm.py::RobertaPreLayerNormModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_model_parallel_beam_search - IndexError: list index out of range
Failures in all failing cases are similar. Here is a full log for one of them:
$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search
======================================= test session starts ========================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /home/dvrogozh/git/huggingface/transformers
configfile: pyproject.toml
plugins: anyio-4.8.0, rich-0.2.0, subtests-0.14.1, xdist-3.6.1, asyncio-0.23.8, timeout-2.3.1, hypothesis-6.122.3, reportlog-0.4.0, dash-2.18.2
asyncio: mode=strict
collected 1 item
tests/models/data2vec/test_modeling_data2vec_text.py F [100%]
============================================= FAILURES =============================================
______________________ Data2VecTextModelTest.test_model_parallel_beam_search _______________________
self = <tests.models.data2vec.test_modeling_data2vec_text.Data2VecTextModelTest testMethod=test_model_parallel_beam_search>
@require_accelerate
@require_torch_multi_accelerator
@pytest.mark.generate
def test_model_parallel_beam_search(self):
if "xpu" in torch_device:
if not (is_ipex_available("2.5") or version.parse(torch.__version__) >= version.parse("2.6")):
self.skipTest(reason="device_map='auto' does not work with XPU devices")
for model_class in self.all_generative_model_classes:
if model_class._no_split_modules is None:
continue
config, inputs_dict = self.prepare_config_and_inputs_for_generate()
model = model_class(config).eval()
with tempfile.TemporaryDirectory() as tmp_dir:
model.cpu().save_pretrained(tmp_dir)
> new_model = model_class.from_pretrained(tmp_dir, device_map="auto")
tests/generation/test_utils.py:693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/modeling_utils.py:4195: in from_pretrained
device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs)
../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tied_params = ['lm_head.decoder.bias'], module_size = 396
module_sizes = defaultdict(<class 'int'>, {'': 147892, 'data2vec_text': 143016, 'data2vec_text.embeddings': 88704, 'data2vec_text.emb...sition_ids': 4096, 'data2vec_text.embeddings.token_type_ids': 4096, 'lm_head.decoder': 0, 'lm_head.decoder.weight': 0})
modules_to_treat = [('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
def get_module_size_with_ties(
tied_params,
module_size,
module_sizes,
modules_to_treat,
) -> Tuple[int, List[str], List[nn.Module]]:
"""
Calculate the total size of a module, including its tied parameters.
Args:
tied_params (`List[str]`): The list of tied parameters.
module_size (`int`): The size of the module without tied parameters.
module_sizes (`Dict[str, int]`): A dictionary mapping each layer name to its size.
modules_to_treat (`List[Tuple[str, nn.Module]]`): The list of named modules to treat.
Returns:
`Tuple[int, List[str], List[nn.Module]]`: The total size of the module, the names of the tied modules, and the
tied modules.
"""
if len(tied_params) < 1:
return module_size, [], []
tied_module_names = []
tied_modules = []
for tied_param in tied_params:
> tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")][0]
E IndexError: list index out of range
../accelerate/src/accelerate/utils/modeling.py:1129: IndexError
--------------------------------------- Captured stderr call ---------------------------------------
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
===================================== short test summary info ======================================
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
======================================== 1 failed in 2.65s =========================================
Observations:
- Failures are sensitive to a number of GPUs across which
device_map=auto
works. Issue happens with 4 XPU devices. Issue does not happen with XPU devices (run withZE_AFFINITY_MASK=0,1
). - This calculation goes off:
tied_param=lm_head.decoder.bias
modules_to_treat=[('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
# which gives:
[i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")]=[]
# and taking index `[0]` eventually does not work
marthos1
Metadata
Metadata
Assignees
Labels
No labels