mul 8000 ti-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range"

@SunMarc

With:

Transformers: 7d4b3dd
huggingface/accelerate@78b8126
pytorch/pytorch@4e4b859 (torch 2.7 candidate)

On:

2 card Intel(R) Data Center GPU Max 1550 (aka PVC), note: each card has 2 tiles, in total there are 4 torch devices available

test_model_parallel_beam_search tests for the following models fail with "IndexError: list index out of range":

$ cat spec.py
import torch
DEVICE_NAME = 'xpu'
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count

$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest -k test_model_parallel_beam_search \
  tests/models/data2vec \
  tests/models/roberta \
  tests/models/roberta_prelayernorm \
  tests/models/xlm_roberta_xl
...
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta/test_modeling_roberta.py::RobertaModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta_prelayernorm/test_modeling_roberta_prelayernorm.py::RobertaPreLayerNormModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_model_parallel_beam_search - IndexError: list index out of range

Failures in all failing cases are similar. Here is a full log for one of them:

$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search
======================================= test session starts ========================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /home/dvrogozh/git/huggingface/transformers
configfile: pyproject.toml
plugins: anyio-4.8.0, rich-0.2.0, subtests-0.14.1, xdist-3.6.1, asyncio-0.23.8, timeout-2.3.1, hypothesis-6.122.3, reportlog-0.4.0, dash-2.18.2
asyncio: mode=strict
collected 1 item

tests/models/data2vec/test_modeling_data2vec_text.py F                                       [100%]

============================================= FAILURES =============================================
______________________ Data2VecTextModelTest.test_model_parallel_beam_search _______________________

self = <tests.models.data2vec.test_modeling_data2vec_text.Data2VecTextModelTest testMethod=test_model_parallel_beam_search>

    @require_accelerate
    @require_torch_multi_accelerator
    @pytest.mark.generate
    def test_model_parallel_beam_search(self):
        if "xpu" in torch_device:
            if not (is_ipex_available("2.5") or version.parse(torch.__version__) >= version.parse("2.6")):
                self.skipTest(reason="device_map='auto' does not work with XPU devices")

        for model_class in self.all_generative_model_classes:
            if model_class._no_split_modules is None:
                continue

            config, inputs_dict = self.prepare_config_and_inputs_for_generate()

            model = model_class(config).eval()
            with tempfile.TemporaryDirectory() as tmp_dir:
                model.cpu().save_pretrained(tmp_dir)
>               new_model = model_class.from_pretrained(tmp_dir, device_map="auto")

tests/generation/test_utils.py:693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/modeling_utils.py:4195: in from_pretrained
    device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs)
../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
    module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

tied_params = ['lm_head.decoder.bias'], module_size = 396
module_sizes = defaultdict(<class 'int'>, {'': 147892, 'data2vec_text': 143016, 'data2vec_text.embeddings': 88704, 'data2vec_text.emb...sition_ids': 4096, 'data2vec_text.embeddings.token_type_ids': 4096, 'lm_head.decoder': 0, 'lm_head.decoder.weight': 0})
modules_to_treat = [('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]

    def get_module_size_with_ties(
        tied_params,
        module_size,
        module_sizes,
        modules_to_treat,
    ) -> Tuple[int, List[str], List[nn.Module]]:
        """
        Calculate the total size of a module, including its tied parameters.

        Args:
            tied_params (`List[str]`): The list of tied parameters.
            module_size (`int`): The size of the module without tied parameters.
            module_sizes (`Dict[str, int]`): A dictionary mapping each layer name to its size.
            modules_to_treat (`List[Tuple[str, nn.Module]]`): The list of named modules to treat.

        Returns:
            `Tuple[int, List[str], List[nn.Module]]`: The total size of the module, the names of the tied modules, and the
            tied modules.
        """
        if len(tied_params) < 1:
            return module_size, [], []
        tied_module_names = []
        tied_modules = []

        for tied_param in tied_params:
>           tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")][0]
E           IndexError: list index out of range

../accelerate/src/accelerate/utils/modeling.py:1129: IndexError
--------------------------------------- Captured stderr call ---------------------------------------
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
===================================== short test summary info ======================================
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
======================================== 1 failed in 2.65s =========================================

Observations:

Failures are sensitive to a number of GPUs across which device_map=auto works. Issue happens with 4 XPU devices. Issue does not happen with XPU devices (run with ZE_AFFINITY_MASK=0,1).
This calculation goes off:

tied_param=lm_head.decoder.bias
modules_to_treat=[('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
# which gives:
[i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")]=[]
# and taking index `[0]` eventually does not work

CC: @SunMarc @ydshieh @faaany

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions