update
This commit is contained in:
		| @@ -0,0 +1,25 @@ | ||||
| # InternLM Transformers | ||||
|  | ||||
| [English](./README.md) | | ||||
| [简体中文](./README-zh-Hans.md)  | ||||
|  | ||||
| 该文件夹下包含了 transformers 格式的 `InternLM` 模型。 | ||||
|  | ||||
|  | ||||
| ## 权重转换 | ||||
|  | ||||
| `convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式。在仓库根目录运行以下命令: | ||||
|  | ||||
| ```bash | ||||
| python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model | ||||
| ``` | ||||
|  | ||||
| 然后可以使用 `from_pretrained` 接口加载: | ||||
|  | ||||
| ```python | ||||
| >>> from transformers import AutoTokenizer, AutoModel | ||||
| >>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda() | ||||
| ``` | ||||
|  | ||||
|  | ||||
| `intern_moss_example.py` 展示了如何使用 LoRA 来在 `fnlp/moss-moon-002-sft` 数据集上进行微调的样例。 | ||||
| @@ -0,0 +1,23 @@ | ||||
| # InternLM Transformers | ||||
|  | ||||
| [English](./README.md) | | ||||
| [简体中文](./README-zh-Hans.md)  | ||||
|  | ||||
| This folder contains the `InternLM` model in transformers format. | ||||
|  | ||||
| ## Weight Conversion | ||||
|  | ||||
| `convert2hf.py` can convert saved training weights into the transformers format with a single command. Execute the command in the root directory of repository: | ||||
|  | ||||
| ```bash | ||||
| python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model | ||||
| ``` | ||||
|  | ||||
| Then, you can load it using the `from_pretrained` interface: | ||||
|  | ||||
| ```python | ||||
| >>> from transformers import AutoTokenizer, AutoModel | ||||
| >>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda() | ||||
| ``` | ||||
|  | ||||
| `intern_moss_example.py` demonstrates an example of how to use LoRA for fine-tuning on the `fnlp/moss-moon-002-sft` dataset. | ||||
| @@ -0,0 +1,120 @@ | ||||
| # coding=utf-8 | ||||
| # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. | ||||
| # | ||||
| # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX | ||||
| # and OPT implementations in this library. It has been modified from its | ||||
| # original forms to accommodate minor architectural differences compared | ||||
| # to GPT-NeoX and OPT used by the Meta AI team that trained the model. | ||||
| # | ||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||
| # you may not use this file except in compliance with the License. | ||||
| # You may obtain a copy of the License at | ||||
| # | ||||
| #     http://www.apache.org/licenses/LICENSE-2.0 | ||||
| # | ||||
| # Unless required by applicable law or agreed to in writing, software | ||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | ||||
| """ InternLM model configuration""" | ||||
|  | ||||
| from transformers.utils import logging | ||||
| from transformers.configuration_utils import PretrainedConfig | ||||
|  | ||||
|  | ||||
| logger = logging.get_logger(__name__) | ||||
|  | ||||
| INTERNLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {} | ||||
|  | ||||
|  | ||||
| class InternLMConfig(PretrainedConfig): | ||||
|     r""" | ||||
|     This is the configuration class to store the configuration of a [`InternLMModel`]. It is used to instantiate an InternLM | ||||
|     model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | ||||
|     defaults will yield a similar configuration to that of the InternLM-7B. | ||||
|  | ||||
|     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the | ||||
|     documentation from [`PretrainedConfig`] for more information. | ||||
|  | ||||
|  | ||||
|     Args: | ||||
|         vocab_size (`int`, *optional*, defaults to 32000): | ||||
|             Vocabulary size of the InternLM model. Defines the number of different tokens that can be represented by the | ||||
|             `inputs_ids` passed when calling [`InternLMModel`] | ||||
|         hidden_size (`int`, *optional*, defaults to 4096): | ||||
|             Dimension of the hidden representations. | ||||
|         intermediate_size (`int`, *optional*, defaults to 11008): | ||||
|             Dimension of the MLP representations. | ||||
|         num_hidden_layers (`int`, *optional*, defaults to 32): | ||||
|             Number of hidden layers in the Transformer encoder. | ||||
|         num_attention_heads (`int`, *optional*, defaults to 32): | ||||
|             Number of attention heads for each attention layer in the Transformer encoder. | ||||
|         hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): | ||||
|             The non-linear activation function (function or string) in the decoder. | ||||
|         max_position_embeddings (`int`, *optional*, defaults to 2048): | ||||
|             The maximum sequence length that this model might ever be used with. Typically set this to something large | ||||
|             just in case (e.g., 512 or 1024 or 2048). | ||||
|         initializer_range (`float`, *optional*, defaults to 0.02): | ||||
|             The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | ||||
|         rms_norm_eps (`float`, *optional*, defaults to 1e-12): | ||||
|             The epsilon used by the rms normalization layers. | ||||
|         use_cache (`bool`, *optional*, defaults to `True`): | ||||
|             Whether or not the model should return the last key/values attentions (not used by all models). Only | ||||
|             relevant if `config.is_decoder=True`. | ||||
|         tie_word_embeddings(`bool`, *optional*, defaults to `False`): | ||||
|             Whether to tie weight embeddings | ||||
|         Example: | ||||
|  | ||||
|     ```python | ||||
|     >>> from transformers import InternLMModel, InternLMConfig | ||||
|  | ||||
|     >>> # Initializing a InternLM internlm-7b style configuration | ||||
|     >>> configuration = InternLMConfig() | ||||
|  | ||||
|     >>> # Initializing a model from the internlm-7b style configuration | ||||
|     >>> model = InternLMModel(configuration) | ||||
|  | ||||
|     >>> # Accessing the model configuration | ||||
|     >>> configuration = model.config | ||||
|     ```""" | ||||
|     model_type = "internlm" | ||||
|     _auto_class = "AutoConfig" | ||||
|  | ||||
|     def __init__( | ||||
|         self, | ||||
|         vocab_size=103168, | ||||
|         hidden_size=4096, | ||||
|         intermediate_size=11008, | ||||
|         num_hidden_layers=32, | ||||
|         num_attention_heads=32, | ||||
|         hidden_act="silu", | ||||
|         max_position_embeddings=2048, | ||||
|         initializer_range=0.02, | ||||
|         rms_norm_eps=1e-6, | ||||
|         use_cache=True, | ||||
|         pad_token_id=0, | ||||
|         bos_token_id=1, | ||||
|         eos_token_id=2, | ||||
|         tie_word_embeddings=False, | ||||
|         bias=True, | ||||
|         **kwargs, | ||||
|     ): | ||||
|         self.vocab_size = vocab_size | ||||
|         self.max_position_embeddings = max_position_embeddings | ||||
|         self.hidden_size = hidden_size | ||||
|         self.intermediate_size = intermediate_size | ||||
|         self.num_hidden_layers = num_hidden_layers | ||||
|         self.num_attention_heads = num_attention_heads | ||||
|         self.hidden_act = hidden_act | ||||
|         self.initializer_range = initializer_range | ||||
|         self.rms_norm_eps = rms_norm_eps | ||||
|         self.use_cache = use_cache | ||||
|         self.bias = bias | ||||
|         super().__init__( | ||||
|             pad_token_id=pad_token_id, | ||||
|             bos_token_id=bos_token_id, | ||||
|             eos_token_id=eos_token_id, | ||||
|             tie_word_embeddings=tie_word_embeddings, | ||||
|             **kwargs, | ||||
|         ) | ||||
| @@ -0,0 +1,175 @@ | ||||
| import argparse | ||||
| import math | ||||
| import json | ||||
| import os | ||||
| import re | ||||
| import tempfile | ||||
|  | ||||
| import torch | ||||
| from modeling_internlm import InternLMConfig, InternLMForCausalLM | ||||
| from tokenization_internlm import InternLMTokenizer | ||||
|  | ||||
| NUM_SHARDS = { | ||||
|     "7B": 1, | ||||
| } | ||||
|  | ||||
|  | ||||
| def convert2hf(model_config, states_tp_pps): | ||||
|  | ||||
|     with tempfile.TemporaryDirectory() as folder: | ||||
|         states = merge_pp(states_tp_pps)[0] | ||||
|  | ||||
|         if "embedding.word_embeddings.weight" in states: | ||||
|             embedding_key = "embedding.word_embeddings.weight" | ||||
|         elif "embedding.weight" in states: | ||||
|             embedding_key = "embedding.weight" | ||||
|         else: | ||||
|             print("Check embedding states'names in below:", flush=True) | ||||
|             print(list(states.keys()), flush=True) | ||||
|  | ||||
|         dims_per_head = model_config["hidden_size"] // model_config["num_attention_heads"] | ||||
|         base = 10000.0 | ||||
|         inv_freq = 1.0 / (base ** (torch.arange(0, dims_per_head, 2).float() / dims_per_head)) | ||||
|  | ||||
|         current_states = {} | ||||
|  | ||||
|         current_states["model.embed_tokens.weight"] = states.pop(embedding_key) | ||||
|         current_states["model.norm.weight"] = states.pop("norm.weight") | ||||
|         current_states["lm_head.weight"] = states.pop("head.weight") | ||||
|  | ||||
|         for i in range(model_config["num_layers"]): | ||||
|             states.pop(f"blocks.{i}.mixer.rotary_emb.inv_freq", None) | ||||
|  | ||||
|             wqkv = states.pop(f"blocks.{i}.mixer.Wqkv.weight").reshape( | ||||
|                 3, model_config["num_attention_heads"], -1, model_config["hidden_size"] | ||||
|             ) | ||||
|             bqkv = states.pop(f"blocks.{i}.mixer.Wqkv.bias").reshape(3, model_config["num_attention_heads"], -1) | ||||
|  | ||||
|             current_states[f"model.layers.{i}.self_attn.q_proj.weight"] = wqkv[0].reshape( | ||||
|                 -1, model_config["hidden_size"] | ||||
|             ) | ||||
|             current_states[f"model.layers.{i}.self_attn.q_proj.bias"] = bqkv[0].reshape(-1) | ||||
|             current_states[f"model.layers.{i}.self_attn.k_proj.weight"] = wqkv[1].reshape( | ||||
|                 -1, model_config["hidden_size"] | ||||
|             ) | ||||
|             current_states[f"model.layers.{i}.self_attn.k_proj.bias"] = bqkv[1].reshape(-1) | ||||
|             current_states[f"model.layers.{i}.self_attn.v_proj.weight"] = wqkv[2].reshape( | ||||
|                 -1, model_config["hidden_size"] | ||||
|             ) | ||||
|             current_states[f"model.layers.{i}.self_attn.v_proj.bias"] = bqkv[2].reshape(-1) | ||||
|  | ||||
|             current_states[f"model.layers.{i}.self_attn.o_proj.weight"] = states.pop( | ||||
|                 f"blocks.{i}.mixer.out_proj.weight" | ||||
|             ) | ||||
|             current_states[f"model.layers.{i}.self_attn.o_proj.bias"] = states.pop(f"blocks.{i}.mixer.out_proj.bias") | ||||
|  | ||||
|             current_states[f"model.layers.{i}.mlp.gate_proj.weight"] = states.pop(f"blocks.{i}.mlp.w1.weight") | ||||
|             current_states[f"model.layers.{i}.mlp.down_proj.weight"] = states.pop(f"blocks.{i}.mlp.w3.weight") | ||||
|             current_states[f"model.layers.{i}.mlp.up_proj.weight"] = states.pop(f"blocks.{i}.mlp.w2.weight") | ||||
|  | ||||
|             current_states[f"model.layers.{i}.input_layernorm.weight"] = states.pop(f"blocks.{i}.norm1.weight") | ||||
|             current_states[f"model.layers.{i}.post_attention_layernorm.weight"] = states.pop(f"blocks.{i}.norm2.weight") | ||||
|             current_states[f"model.layers.{i}.self_attn.rotary_emb.inv_freq"] = inv_freq | ||||
|  | ||||
|         config = InternLMConfig( | ||||
|             hidden_size=model_config["hidden_size"], | ||||
|             intermediate_size=compute_intermediate_size(model_config["hidden_size"]), | ||||
|             num_attention_heads=model_config["num_attention_heads"], | ||||
|             num_hidden_layers=model_config["num_layers"], | ||||
|             rms_norm_eps=1e-06, | ||||
|             bias=True, | ||||
|         ) | ||||
|  | ||||
|         if model_config["vocab_size"] != -1: | ||||
|             config.vocab_size = model_config["vocab_size"] | ||||
|  | ||||
|         config.save_pretrained(folder) | ||||
|         torch.save(current_states, os.path.join(folder, "pytorch_model.bin")) | ||||
|  | ||||
|         model = InternLMForCausalLM.from_pretrained(folder, torch_dtype=torch.float16) | ||||
|         del model.config._name_or_path | ||||
|  | ||||
|     return config, model | ||||
|  | ||||
|  | ||||
| def compute_intermediate_size(n): | ||||
|     return int(math.ceil(n * 8 / 3) + 255) // 256 * 256 | ||||
|  | ||||
|  | ||||
| def merge_pp(states_tp_pp): | ||||
|     max_tp = len(states_tp_pp) | ||||
|     max_pp = len(states_tp_pp[0]) | ||||
|  | ||||
|     full_states = [] | ||||
|     for tp in range(max_tp): | ||||
|         layer_shift = 0 | ||||
|  | ||||
|         tp_states = {} | ||||
|         for pp in range(max_pp): | ||||
|             _layer_shift = 0 | ||||
|             states = states_tp_pp[tp][pp] | ||||
|             keys = list(states.keys()) | ||||
|             for key in keys: | ||||
|                 match = re.search("\.\d+\.", key) | ||||
|                 if match is not None: | ||||
|                     s, e = match.span() | ||||
|                     layer_idx = int(key[s + 1 : e - 1]) + layer_shift | ||||
|                     _layer_shift = max(_layer_shift, int(key[s + 1 : e - 1])) | ||||
|                     name = key[:s] + f".{layer_idx}." + key[e:] | ||||
|                     tp_states[name] = states[key] | ||||
|                 else: | ||||
|                     tp_states[key] = states[key] | ||||
|             layer_shift += _layer_shift + 1 | ||||
|         full_states.append({(key[6:] if key.startswith("model.") else key): value for key, value in tp_states.items()}) | ||||
|     return full_states | ||||
|  | ||||
|  | ||||
| if __name__ == "__main__": | ||||
|     parser = argparse.ArgumentParser() | ||||
|     parser.add_argument('--src_folder', type=str, default='~/test/') # 需要转换为hf格式的checkpoint文件夹 | ||||
|     parser.add_argument('--tgt_folder', type=str, default='~/output/') # 存放转换后checkpoint的目标文件夹 | ||||
|     parser.add_argument('--tokenizer', type=str, default='~/test/tokenizer.model') # Tokenizer 文件的路径 | ||||
|     args = parser.parse_args() | ||||
|  | ||||
|     def load(fp): | ||||
|         with open(fp, "rb") as f: | ||||
|             pt_data = torch.load(f, map_location="cpu") | ||||
|         return pt_data | ||||
|  | ||||
|     folder = args.src_folder | ||||
|     target_folder = args.tgt_folder | ||||
|     model_config = load(os.path.join(folder, "model_config.pt")) | ||||
|  | ||||
|     fns = list(os.listdir(folder)) | ||||
|  | ||||
|     model_fns = [] | ||||
|     for fn in fns: | ||||
|         if fn.startswith("model_t") and not fn.endswith("md5"): | ||||
|             model_fns.append(fn) | ||||
|  | ||||
|     max_tp, max_pp = -1, -1 | ||||
|     for fn in model_fns: | ||||
|         _, tp, pp = os.path.splitext(fn)[0].split("_") | ||||
|         max_pp = max(max_pp, int(pp[2:]) + 1) | ||||
|         max_tp = max(max_tp, int(tp[2:]) + 1) | ||||
|  | ||||
|     states_tp_pps = [[]] | ||||
|  | ||||
|     for pp in range(max_pp): | ||||
|         model_name = f"model_tp0_pp{pp}.pt" | ||||
|         states = load(os.path.join(folder, model_name)) | ||||
|         states_tp_pps[0].append(states) | ||||
|  | ||||
|     config, model = convert2hf(model_config, states_tp_pps) | ||||
|  | ||||
|     os.makedirs(target_folder, exist_ok=True) | ||||
|     model.save_pretrained(target_folder, max_shard_size="20GB") | ||||
|     # TODO There should be a better way to add this. | ||||
|     with open(os.path.join(target_folder, "config.json")) as fp: | ||||
|         config_dict = json.load(fp) | ||||
|     config_dict["auto_map"]["AutoModel"] = "modeling_internlm.InternLMForCausalLM" | ||||
|     with open(os.path.join(target_folder, "config.json"), "w") as fp: | ||||
|         json.dump(config_dict, fp, indent=2) | ||||
|  | ||||
|     tokenizer = InternLMTokenizer(args.tokenizer) | ||||
|     tokenizer.save_pretrained(target_folder) | ||||
| @@ -0,0 +1,137 @@ | ||||
| import copy | ||||
| import warnings | ||||
| from dataclasses import dataclass | ||||
| from typing import Callable, List, Optional | ||||
|  | ||||
| import torch | ||||
| from torch import nn | ||||
| from transformers import AutoModel, AutoTokenizer | ||||
| from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList | ||||
| from transformers.utils import logging | ||||
|  | ||||
| logger = logging.get_logger(__name__) | ||||
|  | ||||
|  | ||||
| @dataclass | ||||
| class GenerationConfig: | ||||
|     max_length: Optional[int] = None | ||||
|     top_p: Optional[float] = None | ||||
|     temperature: Optional[float] = None | ||||
|     do_sample: Optional[bool] = True | ||||
|     repetition_penalty: Optional[float] = 1.0 | ||||
|  | ||||
|  | ||||
| @torch.inference_mode() | ||||
| def generate_interactive( | ||||
|     model,  | ||||
|     tokenizer, | ||||
|     prompt, | ||||
|     generation_config: Optional[GenerationConfig] = None, | ||||
|     logits_processor: Optional[LogitsProcessorList] = None, | ||||
|     stopping_criteria: Optional[StoppingCriteriaList] = None, | ||||
|     prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None, | ||||
|     additional_eos_token_id: Optional[int] = None, | ||||
|     **kwargs, | ||||
| ): | ||||
|     inputs = tokenizer([prompt], padding=True, return_tensors="pt") | ||||
|     input_length = len(inputs["input_ids"][0]) | ||||
|     for k, v in inputs.items(): | ||||
|         inputs[k] = v.cuda() | ||||
|     input_ids = inputs["input_ids"] | ||||
|     batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1] | ||||
|     if generation_config is None: | ||||
|         generation_config = model.generation_config | ||||
|     generation_config = copy.deepcopy(generation_config) | ||||
|     model_kwargs = generation_config.update(**kwargs) | ||||
|     bos_token_id, eos_token_id = generation_config.bos_token_id, generation_config.eos_token_id | ||||
|     if isinstance(eos_token_id, int): | ||||
|         eos_token_id = [eos_token_id] | ||||
|     if additional_eos_token_id is not None: | ||||
|         eos_token_id.append(additional_eos_token_id) | ||||
|     has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None | ||||
|     if has_default_max_length and generation_config.max_new_tokens is None: | ||||
|         warnings.warn( | ||||
|             f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. " | ||||
|             "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we" | ||||
|             " recommend using `max_new_tokens` to control the maximum length of the generation.", | ||||
|             UserWarning, | ||||
|         ) | ||||
|     elif generation_config.max_new_tokens is not None: | ||||
|         generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length | ||||
|         if not has_default_max_length: | ||||
|             logger.warn( | ||||
|                 f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(=" | ||||
|                 f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. " | ||||
|                 "Please refer to the documentation for more information. " | ||||
|                 "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)", | ||||
|                 UserWarning, | ||||
|             ) | ||||
|  | ||||
|     if input_ids_seq_length >= generation_config.max_length: | ||||
|         input_ids_string = "input_ids" | ||||
|         logger.warning( | ||||
|             f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to" | ||||
|             f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider" | ||||
|             " increasing `max_new_tokens`." | ||||
|         ) | ||||
|  | ||||
|     # 2. Set generation parameters if not already defined | ||||
|     logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList() | ||||
|     stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList() | ||||
|  | ||||
|     logits_processor = model._get_logits_processor( | ||||
|         generation_config=generation_config, | ||||
|         input_ids_seq_length=input_ids_seq_length, | ||||
|         encoder_input_ids=input_ids, | ||||
|         prefix_allowed_tokens_fn=prefix_allowed_tokens_fn, | ||||
|         logits_processor=logits_processor, | ||||
|     ) | ||||
|  | ||||
|     stopping_criteria = model._get_stopping_criteria( | ||||
|         generation_config=generation_config, stopping_criteria=stopping_criteria | ||||
|     ) | ||||
|     logits_warper = model._get_logits_warper(generation_config) | ||||
|  | ||||
|     unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1) | ||||
|     scores = None | ||||
|     while True: | ||||
|         model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs) | ||||
|         # forward pass to get next token | ||||
|         outputs = model( | ||||
|             **model_inputs, | ||||
|             return_dict=True, | ||||
|             output_attentions=False, | ||||
|             output_hidden_states=False, | ||||
|         ) | ||||
|  | ||||
|         next_token_logits = outputs.logits[:, -1, :] | ||||
|  | ||||
|         # pre-process distribution | ||||
|         next_token_scores = logits_processor(input_ids, next_token_logits) | ||||
|         next_token_scores = logits_warper(input_ids, next_token_scores) | ||||
|  | ||||
|         # sample | ||||
|         probs = nn.functional.softmax(next_token_scores, dim=-1) | ||||
|         if generation_config.do_sample: | ||||
|             next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) | ||||
|         else: | ||||
|             next_tokens = torch.argmax(probs, dim=-1) | ||||
|  | ||||
|         # update generated ids, model inputs, and length for next step | ||||
|         input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1) | ||||
|         model_kwargs = model._update_model_kwargs_for_generation( | ||||
|             outputs, model_kwargs, is_encoder_decoder=False | ||||
|         ) | ||||
|         unfinished_sequences = unfinished_sequences.mul((min(next_tokens != i for i in eos_token_id)).long()) | ||||
|          | ||||
|         output_token_ids = input_ids[0].cpu().tolist() | ||||
|         output_token_ids = output_token_ids[input_length:] | ||||
|         for each_eos_token_id in eos_token_id: | ||||
|             if output_token_ids[-1] == each_eos_token_id: | ||||
|                 output_token_ids = output_token_ids[:-1] | ||||
|         response = tokenizer.decode(output_token_ids) | ||||
|  | ||||
|         yield response | ||||
|         # stop when each sentence is finished, or if we exceed the maximum length | ||||
|         if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores): | ||||
|             break | ||||
| @@ -0,0 +1,69 @@ | ||||
| import torch | ||||
| from torch.utils.data import DataLoader | ||||
| from peft import get_peft_model, LoraConfig, TaskType | ||||
| from transformers import get_linear_schedule_with_warmup | ||||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||||
| from tqdm import tqdm | ||||
|  | ||||
| from moss_002_sft import get_dataset, collate_fn | ||||
|  | ||||
| model_path = "model_path" | ||||
| data_dir = "moss_002_sft" | ||||
| data_num = -1 | ||||
| test_size = 10 | ||||
| train_batch_size = 1 | ||||
| epochs = 5 | ||||
| val_per_steps = 1000 | ||||
| lr = 9e-6 | ||||
| peft_config = LoraConfig( | ||||
|     task_type=TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.1, | ||||
|     target_modules=["gate_proj", "down_proj", "up_proj", "q_proj", "k_proj", "v_proj", "o_proj"] | ||||
| ) | ||||
|  | ||||
|  | ||||
| # model | ||||
| model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) | ||||
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) | ||||
| model = get_peft_model(model, peft_config) | ||||
| model.cuda() | ||||
|  | ||||
| # dataset | ||||
| train_dataset, val_dataset = get_dataset(tokenizer, data_dir, num=data_num, test_size=test_size) | ||||
| train_dataloader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True, collate_fn=lambda x: collate_fn(x, tokenizer)) | ||||
|  | ||||
| optimizer = torch.optim.AdamW(model.parameters(), lr) | ||||
| scheduler = get_linear_schedule_with_warmup( | ||||
|     optimizer, 1000, epochs * len(train_dataloader) | ||||
| ) | ||||
|  | ||||
| # train | ||||
| fp = open("output", "w") | ||||
| model.train() | ||||
| for epoch in tqdm(range(epochs), desc="Traning Epoch"): | ||||
|     batch_bar = tqdm(train_dataloader, desc="Training Batch") | ||||
|     for step, batch in enumerate(batch_bar): | ||||
|         batch = {k:v.cuda() for k, v in batch.items()} | ||||
|         with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16): | ||||
|             output = model(**batch) | ||||
|  | ||||
|         loss = output.loss | ||||
|         loss.backward() | ||||
|         optimizer.step() | ||||
|         scheduler.step() | ||||
|         optimizer.zero_grad() | ||||
|         batch_bar.set_postfix({"loss": loss.item()}) | ||||
|         if (step + 1) % val_per_steps == 0: | ||||
|             fp.write(f"Epoch {epoch} Batch {step}: Loss={loss.item()}\n") | ||||
|             for i in tqdm(range(len(val_dataset)), desc="Generating"): | ||||
|                 data, label = val_dataset[i] | ||||
|                 prefix = tokenizer.decode(data.tolist(), skip_special_tokens=True) | ||||
|                 try: | ||||
|                     generate = model.generate(input_ids=data.unsqueeze(0).cuda(), temperature=0.7, top_k=50, do_sample=True, repetition_penalty=1.02, max_new_tokens=100, top_p=0.9) | ||||
|                     text = tokenizer.decode(generate[0].tolist(), skip_special_tokens=True) | ||||
|                     text = text.replace(prefix, "") | ||||
|                     fp.write(f"Prefix: {prefix}\nGenerated: {text}" + "\n---------------------------------\n") | ||||
|                 except Exception as e: | ||||
|                     fp.write(f"Prefix: {prefix}\nError: {e}" + "\n---------------------------------\n") | ||||
|             fp.write("\n==============================\n") | ||||
|             model.train() | ||||
|             torch.cuda.empty_cache() | ||||
| @@ -0,0 +1,105 @@ | ||||
| import os | ||||
| import copy | ||||
|  | ||||
| import torch | ||||
| from torch.utils.data import Dataset | ||||
| from datasets import load_dataset, Dataset as HFDataset | ||||
|  | ||||
| class SFTDataset(Dataset): | ||||
|     # https://github.com/OpenLMLab/MOSS/blob/main/finetune_moss.py | ||||
|     def __init__(self, dataset): | ||||
|         super().__init__() | ||||
|         self.dataset = dataset | ||||
|  | ||||
|     def __len__(self): | ||||
|         return len(self.dataset) | ||||
|      | ||||
|     def __getitem__(self, index): | ||||
|         data = copy.deepcopy(self.dataset[index]["input_ids"]) | ||||
|         no_loss_spans = copy.deepcopy(self.dataset[index]["no_loss_spans"]) | ||||
|  | ||||
|         data = torch.tensor(data, dtype=torch.long) | ||||
|         label = copy.deepcopy(data) | ||||
|  | ||||
|         for no_loss_span in no_loss_spans: | ||||
|             label[no_loss_span[0] : no_loss_span[1]] = -100 | ||||
|  | ||||
|         return data, label | ||||
|      | ||||
| def collate_fn(batch, tokenizer): | ||||
|     batch_input_ids, batch_labels = [], [] | ||||
|     for input_ids, label in batch: | ||||
|         batch_input_ids.append(input_ids) | ||||
|         batch_labels.append(label) | ||||
|  | ||||
|     batch_input_ids = torch.nn.utils.rnn.pad_sequence(batch_input_ids, batch_first=True, padding_value=tokenizer.eos_token_id) | ||||
|     batch_labels = torch.nn.utils.rnn.pad_sequence(batch_labels, batch_first=True, padding_value=-100) | ||||
|  | ||||
|     return { | ||||
|         "input_ids": batch_input_ids, | ||||
|         "attention_mask": (batch_input_ids == tokenizer.eos_token_id).long(), | ||||
|         "labels": batch_labels | ||||
|     } | ||||
|  | ||||
| def process(sample, tokenizer, max_len): | ||||
|     chat = sample["plain_text"].split("<eoa>")[:-1] | ||||
|     num_turns = sample["num_turns"] | ||||
|     meta_instruction = sample["prefix"] | ||||
|  | ||||
|     # encode instruction | ||||
|     instruction_ids = tokenizer.encode(meta_instruction) | ||||
|     assert isinstance(instruction_ids, list), instruction_ids | ||||
|     assert len(instruction_ids) > 0, len(instruction_ids) | ||||
|     input_ids = copy.deepcopy(instruction_ids) | ||||
|     # We do not calculate loss for instruction. | ||||
|     no_loss_spans = [(0, len(instruction_ids))] | ||||
|  | ||||
|     for i in range(num_turns): | ||||
|         # Collect dialogues | ||||
|         cur_turn_ids = [] | ||||
|         cur_no_loss_spans = [] | ||||
|         # Add to cur_turn_ids | ||||
|         cur_turn_ids.extend(tokenizer.encode(chat[i] + "<eoa>")) | ||||
|         # if key == 'Tool Responses': | ||||
|         #     # The format tokens (<|Results|>:...<eor>\n) should have losses.  | ||||
|         #     cur_no_loss_spans.append((len(input_ids + cur_turn_ids) + 5, len(input_ids + cur_turn_ids + cur_ids) - 2)) | ||||
|         if len(input_ids + cur_turn_ids) > max_len: | ||||
|             # Too long, break | ||||
|             break | ||||
|         # Extend input_ids | ||||
|         input_ids.extend(cur_turn_ids) | ||||
|         no_loss_spans.extend(cur_no_loss_spans) | ||||
|  | ||||
|     if len(input_ids) == len(instruction_ids): | ||||
|         # No dialogue, return | ||||
|         return {"input_ids": [], "no_loss_spans": []} | ||||
|     else: | ||||
|         return {"input_ids": input_ids, "no_loss_spans": no_loss_spans} | ||||
|  | ||||
|  | ||||
| def load_data(save_dir, tokenizer, max_len, num=-1) -> HFDataset: | ||||
|     if os.path.exists(save_dir): | ||||
|         print(f"Loading moss-002-sft from {save_dir}") | ||||
|     else: | ||||
|         print(f"Loading moss-002-sft from datasets") | ||||
|         moss_sft = load_dataset("fnlp/moss-002-sft-data", split="train") | ||||
|         moss_sft = moss_sft.map(lambda x:process(x, tokenizer, max_len), num_proc=10) | ||||
|         moss_sft = moss_sft.filter(lambda x:len(x["input_ids"]) != 0) | ||||
|         moss_sft.save_to_disk(save_dir) | ||||
|  | ||||
|     moss_sft = HFDataset.load_from_disk(save_dir) | ||||
|     if num != -1: | ||||
|         moss_sft = moss_sft.select(range(num)) | ||||
|     print( | ||||
|         f"Load successfully, total {len(moss_sft)} samples.") | ||||
|      | ||||
|     return moss_sft | ||||
|  | ||||
| def get_dataset(tokenizer, save_dir, max_len=1024, num=-1, test_size=0.1): | ||||
|     moss_sft_data = load_data(save_dir, tokenizer, max_len, num) | ||||
|     moss_sft_split = moss_sft_data.train_test_split(test_size=test_size) | ||||
|     train_dataset = SFTDataset(moss_sft_split["train"]) | ||||
|     val_dataset = SFTDataset(moss_sft_split["test"]) | ||||
|  | ||||
|     return train_dataset, val_dataset | ||||
|  | ||||
| @@ -0,0 +1,998 @@ | ||||
| # coding=utf-8 | ||||
| # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. | ||||
| # | ||||
| # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX | ||||
| # and OPT implementations in this library. It has been modified from its | ||||
| # original forms to accommodate minor architectural differences compared | ||||
| # to GPT-NeoX and OPT used by the Meta AI team that trained the model. | ||||
| # | ||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||
| # you may not use this file except in compliance with the License. | ||||
| # You may obtain a copy of the License at | ||||
| # | ||||
| #     http://www.apache.org/licenses/LICENSE-2.0 | ||||
| # | ||||
| # Unless required by applicable law or agreed to in writing, software | ||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | ||||
| """ PyTorch InternLM model.""" | ||||
| import math | ||||
| from typing import List, Optional, Tuple, Union | ||||
| import threading, queue | ||||
|  | ||||
| import torch | ||||
| import torch.utils.checkpoint | ||||
| from torch import nn | ||||
| from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss | ||||
|  | ||||
| from transformers.activations import ACT2FN | ||||
| from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast | ||||
| from transformers.modeling_utils import PreTrainedModel | ||||
| from transformers.generation.streamers import BaseStreamer | ||||
| from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings | ||||
| from configuration_internlm import InternLMConfig | ||||
|  | ||||
|  | ||||
| logger = logging.get_logger(__name__) | ||||
|  | ||||
| _CONFIG_FOR_DOC = "InternLMConfig" | ||||
|  | ||||
| # Copied from transformers.models.bart.modeling_bart._make_causal_mask | ||||
| def _make_causal_mask( | ||||
|     input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0 | ||||
| ): | ||||
|     """ | ||||
|     Make causal mask used for bi-directional self-attention. | ||||
|     """ | ||||
|     bsz, tgt_len = input_ids_shape | ||||
|     mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device) | ||||
|     mask_cond = torch.arange(mask.size(-1), device=device) | ||||
|     mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) | ||||
|     mask = mask.to(dtype) | ||||
|  | ||||
|     if past_key_values_length > 0: | ||||
|         mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1) | ||||
|     return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) | ||||
|  | ||||
|  | ||||
| # Copied from transformers.models.bart.modeling_bart._expand_mask | ||||
| def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): | ||||
|     """ | ||||
|     Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. | ||||
|     """ | ||||
|     bsz, src_len = mask.size() | ||||
|     tgt_len = tgt_len if tgt_len is not None else src_len | ||||
|  | ||||
|     expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) | ||||
|  | ||||
|     inverted_mask = 1.0 - expanded_mask | ||||
|  | ||||
|     return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) | ||||
|  | ||||
|  | ||||
| class InternLMRMSNorm(nn.Module): | ||||
|     def __init__(self, hidden_size, eps=1e-6): | ||||
|         """ | ||||
|         InternLMRMSNorm is equivalent to T5LayerNorm | ||||
|         """ | ||||
|         super().__init__() | ||||
|         self.weight = nn.Parameter(torch.ones(hidden_size)) | ||||
|         self.variance_epsilon = eps | ||||
|  | ||||
|     def forward(self, hidden_states): | ||||
|         variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True) | ||||
|         hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) | ||||
|  | ||||
|         # convert into half-precision if necessary | ||||
|         if self.weight.dtype in [torch.float16, torch.bfloat16]: | ||||
|             hidden_states = hidden_states.to(self.weight.dtype) | ||||
|  | ||||
|         return self.weight * hidden_states | ||||
|  | ||||
|  | ||||
| class InternLMRotaryEmbedding(torch.nn.Module): | ||||
|     def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None): | ||||
|         super().__init__() | ||||
|         inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim)) | ||||
|         self.register_buffer("inv_freq", inv_freq) | ||||
|  | ||||
|         # Build here to make `torch.jit.trace` work. | ||||
|         self.max_seq_len_cached = max_position_embeddings | ||||
|         t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype) | ||||
|         freqs = torch.einsum("i,j->ij", t, self.inv_freq) | ||||
|         # Different from paper, but it uses a different permutation in order to obtain the same calculation | ||||
|         emb = torch.cat((freqs, freqs), dim=-1) | ||||
|         self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False) | ||||
|         self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False) | ||||
|  | ||||
|     def forward(self, x, seq_len=None): | ||||
|         # x: [bs, num_attention_heads, seq_len, head_size] | ||||
|         # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case. | ||||
|         if seq_len > self.max_seq_len_cached: | ||||
|             self.max_seq_len_cached = seq_len | ||||
|             t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype) | ||||
|             freqs = torch.einsum("i,j->ij", t, self.inv_freq) | ||||
|             # Different from paper, but it uses a different permutation in order to obtain the same calculation | ||||
|             emb = torch.cat((freqs, freqs), dim=-1).to(x.device) | ||||
|             self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False) | ||||
|             self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False) | ||||
|         return ( | ||||
|             self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype), | ||||
|             self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype), | ||||
|         ) | ||||
|  | ||||
|  | ||||
| def rotate_half(x): | ||||
|     """Rotates half the hidden dims of the input.""" | ||||
|     x1 = x[..., : x.shape[-1] // 2] | ||||
|     x2 = x[..., x.shape[-1] // 2 :] | ||||
|     return torch.cat((-x2, x1), dim=-1) | ||||
|  | ||||
|  | ||||
| def apply_rotary_pos_emb(q, k, cos, sin, position_ids): | ||||
|     # The first two dimensions of cos and sin are always 1, so we can `squeeze` them. | ||||
|     cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim] | ||||
|     sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim] | ||||
|     cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim] | ||||
|     sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim] | ||||
|     q_embed = (q * cos) + (rotate_half(q) * sin) | ||||
|     k_embed = (k * cos) + (rotate_half(k) * sin) | ||||
|     return q_embed, k_embed | ||||
|  | ||||
|  | ||||
| class InternLMMLP(nn.Module): | ||||
|     def __init__( | ||||
|         self, | ||||
|         hidden_size: int, | ||||
|         intermediate_size: int, | ||||
|         hidden_act: str, | ||||
|     ): | ||||
|         super().__init__() | ||||
|         self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False) | ||||
|         self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False) | ||||
|         self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False) | ||||
|         self.act_fn = ACT2FN[hidden_act] | ||||
|  | ||||
|     def forward(self, x): | ||||
|         return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) | ||||
|  | ||||
|  | ||||
| class InternLMAttention(nn.Module): | ||||
|     """Multi-headed attention from 'Attention Is All You Need' paper""" | ||||
|  | ||||
|     def __init__(self, config: InternLMConfig): | ||||
|         super().__init__() | ||||
|         self.config = config | ||||
|         self.hidden_size = config.hidden_size | ||||
|         self.num_heads = config.num_attention_heads | ||||
|         self.head_dim = self.hidden_size // self.num_heads | ||||
|         self.max_position_embeddings = config.max_position_embeddings | ||||
|  | ||||
|         if (self.head_dim * self.num_heads) != self.hidden_size: | ||||
|             raise ValueError( | ||||
|                 f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" | ||||
|                 f" and `num_heads`: {self.num_heads})." | ||||
|             ) | ||||
|         self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.bias) | ||||
|         self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.bias) | ||||
|         self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.bias) | ||||
|         self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.bias) | ||||
|         self.rotary_emb = InternLMRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings) | ||||
|  | ||||
|     def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): | ||||
|         return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() | ||||
|  | ||||
|     def forward( | ||||
|         self, | ||||
|         hidden_states: torch.Tensor, | ||||
|         attention_mask: Optional[torch.Tensor] = None, | ||||
|         position_ids: Optional[torch.LongTensor] = None, | ||||
|         past_key_value: Optional[Tuple[torch.Tensor]] = None, | ||||
|         output_attentions: bool = False, | ||||
|         use_cache: bool = False, | ||||
|     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: | ||||
|         bsz, q_len, _ = hidden_states.size() | ||||
|  | ||||
|         query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) | ||||
|         key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) | ||||
|         value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) | ||||
|  | ||||
|         kv_seq_len = key_states.shape[-2] | ||||
|         if past_key_value is not None: | ||||
|             kv_seq_len += past_key_value[0].shape[-2] | ||||
|         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) | ||||
|         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) | ||||
|         # [bsz, nh, t, hd] | ||||
|  | ||||
|         if past_key_value is not None: | ||||
|             # reuse k, v, self_attention | ||||
|             key_states = torch.cat([past_key_value[0], key_states], dim=2) | ||||
|             value_states = torch.cat([past_key_value[1], value_states], dim=2) | ||||
|  | ||||
|         past_key_value = (key_states, value_states) if use_cache else None | ||||
|  | ||||
|         attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) | ||||
|  | ||||
|         if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len): | ||||
|             raise ValueError( | ||||
|                 f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" | ||||
|                 f" {attn_weights.size()}" | ||||
|             ) | ||||
|  | ||||
|         if attention_mask is not None: | ||||
|             if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): | ||||
|                 raise ValueError( | ||||
|                     f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}" | ||||
|                 ) | ||||
|             attn_weights = attn_weights + attention_mask | ||||
|             attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min)) | ||||
|  | ||||
|         # upcast attention to fp32 | ||||
|         attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) | ||||
|         attn_output = torch.matmul(attn_weights, value_states) | ||||
|  | ||||
|         if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim): | ||||
|             raise ValueError( | ||||
|                 f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" | ||||
|                 f" {attn_output.size()}" | ||||
|             ) | ||||
|  | ||||
|         attn_output = attn_output.transpose(1, 2) | ||||
|         attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) | ||||
|  | ||||
|         attn_output = self.o_proj(attn_output) | ||||
|  | ||||
|         if not output_attentions: | ||||
|             attn_weights = None | ||||
|  | ||||
|         return attn_output, attn_weights, past_key_value | ||||
|  | ||||
|  | ||||
| class InternLMDecoderLayer(nn.Module): | ||||
|     def __init__(self, config: InternLMConfig): | ||||
|         super().__init__() | ||||
|         self.hidden_size = config.hidden_size | ||||
|         self.self_attn = InternLMAttention(config=config) | ||||
|         self.mlp = InternLMMLP( | ||||
|             hidden_size=self.hidden_size, | ||||
|             intermediate_size=config.intermediate_size, | ||||
|             hidden_act=config.hidden_act, | ||||
|         ) | ||||
|         self.input_layernorm = InternLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||||
|         self.post_attention_layernorm = InternLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||||
|  | ||||
|     def forward( | ||||
|         self, | ||||
|         hidden_states: torch.Tensor, | ||||
|         attention_mask: Optional[torch.Tensor] = None, | ||||
|         position_ids: Optional[torch.LongTensor] = None, | ||||
|         past_key_value: Optional[Tuple[torch.Tensor]] = None, | ||||
|         output_attentions: Optional[bool] = False, | ||||
|         use_cache: Optional[bool] = False, | ||||
|     ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]: | ||||
|         """ | ||||
|         Args: | ||||
|             hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` | ||||
|             attention_mask (`torch.FloatTensor`, *optional*): attention mask of size | ||||
|                 `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. | ||||
|             output_attentions (`bool`, *optional*): | ||||
|                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under | ||||
|                 returned tensors for more detail. | ||||
|             use_cache (`bool`, *optional*): | ||||
|                 If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding | ||||
|                 (see `past_key_values`). | ||||
|             past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states | ||||
|         """ | ||||
|  | ||||
|         residual = hidden_states | ||||
|  | ||||
|         hidden_states = self.input_layernorm(hidden_states) | ||||
|  | ||||
|         # Self Attention | ||||
|         hidden_states, self_attn_weights, present_key_value = self.self_attn( | ||||
|             hidden_states=hidden_states, | ||||
|             attention_mask=attention_mask, | ||||
|             position_ids=position_ids, | ||||
|             past_key_value=past_key_value, | ||||
|             output_attentions=output_attentions, | ||||
|             use_cache=use_cache, | ||||
|         ) | ||||
|         hidden_states = residual + hidden_states | ||||
|  | ||||
|         # Fully Connected | ||||
|         residual = hidden_states | ||||
|         hidden_states = self.post_attention_layernorm(hidden_states) | ||||
|         hidden_states = self.mlp(hidden_states) | ||||
|         hidden_states = residual + hidden_states | ||||
|  | ||||
|         outputs = (hidden_states,) | ||||
|  | ||||
|         if output_attentions: | ||||
|             outputs += (self_attn_weights,) | ||||
|  | ||||
|         if use_cache: | ||||
|             outputs += (present_key_value,) | ||||
|  | ||||
|         return outputs | ||||
|  | ||||
|  | ||||
| INTERNLM_START_DOCSTRING = r""" | ||||
|     This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the | ||||
|     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | ||||
|     etc.) | ||||
|  | ||||
|     This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | ||||
|     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | ||||
|     and behavior. | ||||
|  | ||||
|     Parameters: | ||||
|         config ([`InternLMConfig`]): | ||||
|             Model configuration class with all the parameters of the model. Initializing with a config file does not | ||||
|             load the weights associated with the model, only the configuration. Check out the | ||||
|             [`~PreTrainedModel.from_pretrained`] method to load the model weights. | ||||
| """ | ||||
|  | ||||
|  | ||||
| @add_start_docstrings( | ||||
|     "The bare InternLM Model outputting raw hidden-states without any specific head on top.", | ||||
|     INTERNLM_START_DOCSTRING, | ||||
| ) | ||||
| class InternLMPreTrainedModel(PreTrainedModel): | ||||
|     config_class = InternLMConfig | ||||
|     base_model_prefix = "model" | ||||
|     supports_gradient_checkpointing = True | ||||
|     _no_split_modules = ["InternLMDecoderLayer"] | ||||
|     _keys_to_ignore_on_load_unexpected = [r"decoder\.version"] | ||||
|  | ||||
|     def _init_weights(self, module): | ||||
|         std = self.config.initializer_range | ||||
|         if isinstance(module, nn.Linear): | ||||
|             module.weight.data.normal_(mean=0.0, std=std) | ||||
|             if module.bias is not None: | ||||
|                 module.bias.data.zero_() | ||||
|         elif isinstance(module, nn.Embedding): | ||||
|             module.weight.data.normal_(mean=0.0, std=std) | ||||
|             if module.padding_idx is not None: | ||||
|                 module.weight.data[module.padding_idx].zero_() | ||||
|  | ||||
|     def _set_gradient_checkpointing(self, module, value=False): | ||||
|         if isinstance(module, InternLMModel): | ||||
|             module.gradient_checkpointing = value | ||||
|  | ||||
|  | ||||
| INTERNLM_INPUTS_DOCSTRING = r""" | ||||
|     Args: | ||||
|         input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): | ||||
|             Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide | ||||
|             it. | ||||
|  | ||||
|             Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and | ||||
|             [`PreTrainedTokenizer.__call__`] for details. | ||||
|  | ||||
|             [What are input IDs?](../glossary#input-ids) | ||||
|         attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): | ||||
|             Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | ||||
|  | ||||
|             - 1 for tokens that are **not masked**, | ||||
|             - 0 for tokens that are **masked**. | ||||
|  | ||||
|             [What are attention masks?](../glossary#attention-mask) | ||||
|  | ||||
|             Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and | ||||
|             [`PreTrainedTokenizer.__call__`] for details. | ||||
|  | ||||
|             If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see | ||||
|             `past_key_values`). | ||||
|  | ||||
|             If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`] | ||||
|             and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more | ||||
|             information on the default strategy. | ||||
|  | ||||
|             - 1 indicates the head is **not masked**, | ||||
|             - 0 indicates the head is **masked**. | ||||
|         position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): | ||||
|             Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, | ||||
|             config.n_positions - 1]`. | ||||
|  | ||||
|             [What are position IDs?](../glossary#position-ids) | ||||
|         past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): | ||||
|             Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape | ||||
|             `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape | ||||
|             `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. | ||||
|  | ||||
|             Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention | ||||
|             blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. | ||||
|  | ||||
|             If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that | ||||
|             don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all | ||||
|             `decoder_input_ids` of shape `(batch_size, sequence_length)`. | ||||
|         inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): | ||||
|             Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This | ||||
|             is useful if you want more control over how to convert `input_ids` indices into associated vectors than the | ||||
|             model's internal embedding lookup matrix. | ||||
|         use_cache (`bool`, *optional*): | ||||
|             If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see | ||||
|             `past_key_values`). | ||||
|         output_attentions (`bool`, *optional*): | ||||
|             Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned | ||||
|             tensors for more detail. | ||||
|         output_hidden_states (`bool`, *optional*): | ||||
|             Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for | ||||
|             more detail. | ||||
|         return_dict (`bool`, *optional*): | ||||
|             Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. | ||||
| """ | ||||
|  | ||||
|  | ||||
| @add_start_docstrings( | ||||
|     "The bare InternLM Model outputting raw hidden-states without any specific head on top.", | ||||
|     INTERNLM_START_DOCSTRING, | ||||
| ) | ||||
| class InternLMModel(InternLMPreTrainedModel): | ||||
|     """ | ||||
|     Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLMDecoderLayer`] | ||||
|  | ||||
|     Args: | ||||
|         config: InternLMConfig | ||||
|     """ | ||||
|     _auto_class = "AutoModel" | ||||
|  | ||||
|     def __init__(self, config: InternLMConfig): | ||||
|         super().__init__(config) | ||||
|         self.padding_idx = config.pad_token_id | ||||
|         self.vocab_size = config.vocab_size | ||||
|  | ||||
|         self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) | ||||
|         self.layers = nn.ModuleList([InternLMDecoderLayer(config) for _ in range(config.num_hidden_layers)]) | ||||
|         self.norm = InternLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||||
|  | ||||
|         self.gradient_checkpointing = False | ||||
|         # Initialize weights and apply final processing | ||||
|         self.post_init() | ||||
|  | ||||
|     def get_input_embeddings(self): | ||||
|         return self.embed_tokens | ||||
|  | ||||
|     def set_input_embeddings(self, value): | ||||
|         self.embed_tokens = value | ||||
|  | ||||
|     # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask | ||||
|     def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): | ||||
|         # create causal mask | ||||
|         # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] | ||||
|         combined_attention_mask = None | ||||
|         if input_shape[-1] > 1: | ||||
|             combined_attention_mask = _make_causal_mask( | ||||
|                 input_shape, | ||||
|                 inputs_embeds.dtype, | ||||
|                 device=inputs_embeds.device, | ||||
|                 past_key_values_length=past_key_values_length, | ||||
|             ) | ||||
|  | ||||
|         if attention_mask is not None: | ||||
|             # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] | ||||
|             expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to( | ||||
|                 inputs_embeds.device | ||||
|             ) | ||||
|             combined_attention_mask = ( | ||||
|                 expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask | ||||
|             ) | ||||
|  | ||||
|         return combined_attention_mask | ||||
|  | ||||
|     @add_start_docstrings_to_model_forward(INTERNLM_INPUTS_DOCSTRING) | ||||
|     def forward( | ||||
|         self, | ||||
|         input_ids: torch.LongTensor = None, | ||||
|         attention_mask: Optional[torch.Tensor] = None, | ||||
|         position_ids: Optional[torch.LongTensor] = None, | ||||
|         past_key_values: Optional[List[torch.FloatTensor]] = None, | ||||
|         inputs_embeds: Optional[torch.FloatTensor] = None, | ||||
|         use_cache: Optional[bool] = None, | ||||
|         output_attentions: Optional[bool] = None, | ||||
|         output_hidden_states: Optional[bool] = None, | ||||
|         return_dict: Optional[bool] = None, | ||||
|     ) -> Union[Tuple, BaseModelOutputWithPast]: | ||||
|         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions | ||||
|         output_hidden_states = ( | ||||
|             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states | ||||
|         ) | ||||
|         use_cache = use_cache if use_cache is not None else self.config.use_cache | ||||
|  | ||||
|         return_dict = return_dict if return_dict is not None else self.config.use_return_dict | ||||
|  | ||||
|         # retrieve input_ids and inputs_embeds | ||||
|         if input_ids is not None and inputs_embeds is not None: | ||||
|             raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time") | ||||
|         elif input_ids is not None: | ||||
|             batch_size, seq_length = input_ids.shape | ||||
|         elif inputs_embeds is not None: | ||||
|             batch_size, seq_length, _ = inputs_embeds.shape | ||||
|         else: | ||||
|             raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds") | ||||
|  | ||||
|         seq_length_with_past = seq_length | ||||
|         past_key_values_length = 0 | ||||
|  | ||||
|         if past_key_values is not None: | ||||
|             past_key_values_length = past_key_values[0][0].shape[2] | ||||
|             seq_length_with_past = seq_length_with_past + past_key_values_length | ||||
|  | ||||
|         if position_ids is None: | ||||
|             device = input_ids.device if input_ids is not None else inputs_embeds.device | ||||
|             position_ids = torch.arange( | ||||
|                 past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device | ||||
|             ) | ||||
|             position_ids = position_ids.unsqueeze(0).view(-1, seq_length) | ||||
|         else: | ||||
|             position_ids = position_ids.view(-1, seq_length).long() | ||||
|  | ||||
|         if inputs_embeds is None: | ||||
|             inputs_embeds = self.embed_tokens(input_ids) | ||||
|         # embed positions | ||||
|         if attention_mask is None: | ||||
|             attention_mask = torch.ones( | ||||
|                 (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device | ||||
|             ) | ||||
|         attention_mask = self._prepare_decoder_attention_mask( | ||||
|             attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length | ||||
|         ) | ||||
|  | ||||
|         hidden_states = inputs_embeds | ||||
|  | ||||
|         if self.gradient_checkpointing and self.training: | ||||
|             if use_cache: | ||||
|                 logger.warning_once( | ||||
|                     "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." | ||||
|                 ) | ||||
|                 use_cache = False | ||||
|  | ||||
|         # decoder layers | ||||
|         all_hidden_states = () if output_hidden_states else None | ||||
|         all_self_attns = () if output_attentions else None | ||||
|         next_decoder_cache = () if use_cache else None | ||||
|  | ||||
|         for idx, decoder_layer in enumerate(self.layers): | ||||
|             if output_hidden_states: | ||||
|                 all_hidden_states += (hidden_states,) | ||||
|  | ||||
|             past_key_value = past_key_values[idx] if past_key_values is not None else None | ||||
|  | ||||
|             if self.gradient_checkpointing and self.training: | ||||
|  | ||||
|                 def create_custom_forward(module): | ||||
|                     def custom_forward(*inputs): | ||||
|                         # None for past_key_value | ||||
|                         return module(*inputs, output_attentions, None) | ||||
|  | ||||
|                     return custom_forward | ||||
|  | ||||
|                 layer_outputs = torch.utils.checkpoint.checkpoint( | ||||
|                     create_custom_forward(decoder_layer), | ||||
|                     hidden_states, | ||||
|                     attention_mask, | ||||
|                     position_ids, | ||||
|                     None, | ||||
|                 ) | ||||
|             else: | ||||
|                 layer_outputs = decoder_layer( | ||||
|                     hidden_states, | ||||
|                     attention_mask=attention_mask, | ||||
|                     position_ids=position_ids, | ||||
|                     past_key_value=past_key_value, | ||||
|                     output_attentions=output_attentions, | ||||
|                     use_cache=use_cache, | ||||
|                 ) | ||||
|  | ||||
|             hidden_states = layer_outputs[0] | ||||
|  | ||||
|             if use_cache: | ||||
|                 next_decoder_cache += (layer_outputs[2 if output_attentions else 1],) | ||||
|  | ||||
|             if output_attentions: | ||||
|                 all_self_attns += (layer_outputs[1],) | ||||
|  | ||||
|         hidden_states = self.norm(hidden_states) | ||||
|  | ||||
|         # add hidden states from the last decoder layer | ||||
|         if output_hidden_states: | ||||
|             all_hidden_states += (hidden_states,) | ||||
|  | ||||
|         next_cache = next_decoder_cache if use_cache else None | ||||
|         if not return_dict: | ||||
|             return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None) | ||||
|         return BaseModelOutputWithPast( | ||||
|             last_hidden_state=hidden_states, | ||||
|             past_key_values=next_cache, | ||||
|             hidden_states=all_hidden_states, | ||||
|             attentions=all_self_attns, | ||||
|         ) | ||||
|  | ||||
|  | ||||
| class InternLMForCausalLM(InternLMPreTrainedModel): | ||||
|     _auto_class = "AutoModelForCausalLM" | ||||
|  | ||||
|     def __init__(self, config): | ||||
|         super().__init__(config) | ||||
|         self.model = InternLMModel(config) | ||||
|  | ||||
|         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) | ||||
|  | ||||
|         # Initialize weights and apply final processing | ||||
|         self.post_init() | ||||
|  | ||||
|     def get_input_embeddings(self): | ||||
|         return self.model.embed_tokens | ||||
|  | ||||
|     def set_input_embeddings(self, value): | ||||
|         self.model.embed_tokens = value | ||||
|  | ||||
|     def get_output_embeddings(self): | ||||
|         return self.lm_head | ||||
|  | ||||
|     def set_output_embeddings(self, new_embeddings): | ||||
|         self.lm_head = new_embeddings | ||||
|  | ||||
|     def set_decoder(self, decoder): | ||||
|         self.model = decoder | ||||
|  | ||||
|     def get_decoder(self): | ||||
|         return self.model | ||||
|  | ||||
|     @add_start_docstrings_to_model_forward(INTERNLM_INPUTS_DOCSTRING) | ||||
|     @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC) | ||||
|     def forward( | ||||
|         self, | ||||
|         input_ids: torch.LongTensor = None, | ||||
|         attention_mask: Optional[torch.Tensor] = None, | ||||
|         position_ids: Optional[torch.LongTensor] = None, | ||||
|         past_key_values: Optional[List[torch.FloatTensor]] = None, | ||||
|         inputs_embeds: Optional[torch.FloatTensor] = None, | ||||
|         labels: Optional[torch.LongTensor] = None, | ||||
|         use_cache: Optional[bool] = None, | ||||
|         output_attentions: Optional[bool] = None, | ||||
|         output_hidden_states: Optional[bool] = None, | ||||
|         return_dict: Optional[bool] = None, | ||||
|     ) -> Union[Tuple, CausalLMOutputWithPast]: | ||||
|         r""" | ||||
|         Args: | ||||
|             labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): | ||||
|                 Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., | ||||
|                 config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored | ||||
|                 (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. | ||||
|  | ||||
|         Returns: | ||||
|  | ||||
|         Example: | ||||
|  | ||||
|         ```python | ||||
|         >>> from transformers import AutoTokenizer, InternLMForCausalLM | ||||
|  | ||||
|         >>> model = InternLMForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) | ||||
|         >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER) | ||||
|  | ||||
|         >>> prompt = "Hey, are you consciours? Can you talk to me?" | ||||
|         >>> inputs = tokenizer(prompt, return_tensors="pt") | ||||
|  | ||||
|         >>> # Generate | ||||
|         >>> generate_ids = model.generate(inputs.input_ids, max_length=30) | ||||
|         >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] | ||||
|         "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you." | ||||
|         ```""" | ||||
|  | ||||
|         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions | ||||
|         output_hidden_states = ( | ||||
|             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states | ||||
|         ) | ||||
|         return_dict = return_dict if return_dict is not None else self.config.use_return_dict | ||||
|  | ||||
|         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) | ||||
|         outputs = self.model( | ||||
|             input_ids=input_ids, | ||||
|             attention_mask=attention_mask, | ||||
|             position_ids=position_ids, | ||||
|             past_key_values=past_key_values, | ||||
|             inputs_embeds=inputs_embeds, | ||||
|             use_cache=use_cache, | ||||
|             output_attentions=output_attentions, | ||||
|             output_hidden_states=output_hidden_states, | ||||
|             return_dict=return_dict, | ||||
|         ) | ||||
|  | ||||
|         hidden_states = outputs[0] | ||||
|         logits = self.lm_head(hidden_states) | ||||
|  | ||||
|         loss = None | ||||
|         if labels is not None: | ||||
|             # Shift so that tokens < n predict n | ||||
|             shift_logits = logits[..., :-1, :].contiguous() | ||||
|             shift_labels = labels[..., 1:].contiguous() | ||||
|             # Flatten the tokens | ||||
|             loss_fct = CrossEntropyLoss() | ||||
|             shift_logits = shift_logits.view(-1, self.config.vocab_size) | ||||
|             shift_labels = shift_labels.view(-1) | ||||
|             # Enable model parallelism | ||||
|             shift_labels = shift_labels.to(shift_logits.device) | ||||
|             loss = loss_fct(shift_logits, shift_labels) | ||||
|  | ||||
|         if not return_dict: | ||||
|             output = (logits,) + outputs[1:] | ||||
|             return (loss,) + output if loss is not None else output | ||||
|  | ||||
|         return CausalLMOutputWithPast( | ||||
|             loss=loss, | ||||
|             logits=logits, | ||||
|             past_key_values=outputs.past_key_values, | ||||
|             hidden_states=outputs.hidden_states, | ||||
|             attentions=outputs.attentions, | ||||
|         ) | ||||
|  | ||||
|     def prepare_inputs_for_generation( | ||||
|         self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs | ||||
|     ): | ||||
|         if past_key_values: | ||||
|             input_ids = input_ids[:, -1:] | ||||
|  | ||||
|         position_ids = kwargs.get("position_ids", None) | ||||
|         if attention_mask is not None and position_ids is None: | ||||
|             # create position_ids on the fly for batch generation | ||||
|             position_ids = attention_mask.long().cumsum(-1) - 1 | ||||
|             position_ids.masked_fill_(attention_mask == 0, 1) | ||||
|             if past_key_values: | ||||
|                 position_ids = position_ids[:, -1].unsqueeze(-1) | ||||
|  | ||||
|         # if `inputs_embeds` are passed, we only want to use them in the 1st generation step | ||||
|         if inputs_embeds is not None and past_key_values is None: | ||||
|             model_inputs = {"inputs_embeds": inputs_embeds} | ||||
|         else: | ||||
|             model_inputs = {"input_ids": input_ids} | ||||
|  | ||||
|         model_inputs.update( | ||||
|             { | ||||
|                 "position_ids": position_ids, | ||||
|                 "past_key_values": past_key_values, | ||||
|                 "use_cache": kwargs.get("use_cache"), | ||||
|                 "attention_mask": attention_mask, | ||||
|             } | ||||
|         ) | ||||
|         return model_inputs | ||||
|  | ||||
|     @staticmethod | ||||
|     def _reorder_cache(past_key_values, beam_idx): | ||||
|         reordered_past = () | ||||
|         for layer_past in past_key_values: | ||||
|             reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) | ||||
|         return reordered_past | ||||
|      | ||||
|     def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = []): | ||||
|         prompt = "" | ||||
|         for record in history: | ||||
|             prompt += f"""<s><|User|>:{record[0]}<eoh>\n<|Bot|>:{record[1]}<eoa>\n""" | ||||
|         if len(prompt) == 0: | ||||
|             prompt += "<s>" | ||||
|         prompt += f"""<|User|>:{query}<eoh>\n<|Bot|>:""" | ||||
|         return tokenizer([prompt], return_tensors="pt") | ||||
|      | ||||
|     @torch.no_grad() | ||||
|     def chat(self,  | ||||
|              tokenizer,  | ||||
|              query: str, | ||||
|              history: List[Tuple[str, str]] = [],  | ||||
|              streamer: Optional[BaseStreamer] = None, | ||||
|              max_new_tokens: int = 1024, | ||||
|              do_sample: bool = True, | ||||
|              temperature: float = 0.8, | ||||
|              top_p: float = 0.8, | ||||
|              **kwargs): | ||||
|         inputs = self.build_inputs(tokenizer, query, history) | ||||
|         inputs = {k: v.to(self.device) for k, v in inputs.items() if torch.is_tensor(v)} | ||||
|         outputs = self.generate(**inputs,  | ||||
|                                 streamer=streamer,  | ||||
|                                 max_new_tokens=max_new_tokens,  | ||||
|                                 do_sample=do_sample,  | ||||
|                                 temperature=temperature,  | ||||
|                                 top_p=top_p,  | ||||
|                                 **kwargs) | ||||
|         outputs = outputs[0].cpu().tolist()[len(inputs["input_ids"][0]):] | ||||
|         response = tokenizer.decode(outputs, skip_special_tokens=True) | ||||
|         response = response.split("<eoa>")[0] | ||||
|         history = history + [(query, response)] | ||||
|         return response, history | ||||
|      | ||||
|     @torch.no_grad() | ||||
|     def stream_chat(self,  | ||||
|                     tokenizer, | ||||
|                     query: str, | ||||
|                     history: List[Tuple[str, str]] = [],  | ||||
|                     max_new_tokens: int = 1024, | ||||
|                     do_sample: bool = True, | ||||
|                     temperature: float = 0.8, | ||||
|                     top_p: float = 0.8, | ||||
|                     **kwargs): | ||||
|         """ | ||||
|         Return a generator in format: (response, history) | ||||
|         Eg. | ||||
|         ('你好,有什么可以帮助您的吗', [('你好', '你好,有什么可以帮助您的吗')]) | ||||
|         ('你好,有什么可以帮助您的吗?', [('你好', '你好,有什么可以帮助您的吗?')]) | ||||
|         """ | ||||
|  | ||||
|         response_queue = queue.Queue(maxsize=20) | ||||
|  | ||||
|         class ChatStreamer(BaseStreamer): | ||||
|             def __init__(self, tokenizer) -> None: | ||||
|                 super().__init__() | ||||
|                 self.tokenizer = tokenizer | ||||
|                 self.queue = response_queue | ||||
|                 self.query = query | ||||
|                 self.history = history | ||||
|                 self.response = "" | ||||
|                 self.received_inputs = False | ||||
|                 self.queue.put((self.response, history + [(self.query, self.response)])) | ||||
|  | ||||
|             def put(self, value): | ||||
|                 if len(value.shape) > 1 and value.shape[0] > 1: | ||||
|                     raise ValueError("ChatStreamer only supports batch size 1") | ||||
|                 elif len(value.shape) > 1: | ||||
|                     value = value[0] | ||||
|  | ||||
|                 if not self.received_inputs: | ||||
|                     # The first received value is input_ids, ignore here | ||||
|                     self.received_inputs = True | ||||
|                     return | ||||
|  | ||||
|                 token = self.tokenizer.decode([value[-1]], skip_special_tokens=True) | ||||
|                 if token.strip() != "<eoa>": | ||||
|                     self.response = self.response + token | ||||
|                     history = self.history + [(self.query, self.response)] | ||||
|                     self.queue.put((self.response, history)) | ||||
|  | ||||
|             def end(self): | ||||
|                 self.queue.put(None) | ||||
|  | ||||
|         def stream_producer(): | ||||
|             return self.chat( | ||||
|                 tokenizer=tokenizer, | ||||
|                 query=query, | ||||
|                 streamer=ChatStreamer(tokenizer=tokenizer), | ||||
|                 history=history,  | ||||
|                 max_new_tokens=max_new_tokens, | ||||
|                 do_sample=do_sample, | ||||
|                 temperature=temperature, | ||||
|                 top_p=top_p, | ||||
|                 **kwargs | ||||
|             ) | ||||
|  | ||||
|         def consumer(): | ||||
|             producer = threading.Thread(target=stream_producer) | ||||
|             producer.start() | ||||
|             while True: | ||||
|                 res = response_queue.get() | ||||
|                 if res is not None: | ||||
|                     return | ||||
|                 yield res | ||||
|  | ||||
|         return consumer() | ||||
|  | ||||
|  | ||||
| @add_start_docstrings( | ||||
|     """ | ||||
|     The InternLM Model transformer with a sequence classification head on top (linear layer). | ||||
|  | ||||
|     [`InternLMForSequenceClassification`] uses the last token in order to do the classification, as other causal models | ||||
|     (e.g. GPT-2) do. | ||||
|  | ||||
|     Since it does classification on the last token, it requires to know the position of the last token. If a | ||||
|     `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If | ||||
|     no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the | ||||
|     padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in | ||||
|     each row of the batch). | ||||
|     """, | ||||
|     INTERNLM_START_DOCSTRING, | ||||
| ) | ||||
| class InternLMForSequenceClassification(InternLMPreTrainedModel): | ||||
|     _keys_to_ignore_on_load_missing = [r"lm_head.weight"] | ||||
|  | ||||
|     def __init__(self, config): | ||||
|         super().__init__(config) | ||||
|         self.num_labels = config.num_labels | ||||
|         self.model = InternLMModel(config) | ||||
|         self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False) | ||||
|  | ||||
|         # Initialize weights and apply final processing | ||||
|         self.post_init() | ||||
|  | ||||
|     def get_input_embeddings(self): | ||||
|         return self.model.embed_tokens | ||||
|  | ||||
|     def set_input_embeddings(self, value): | ||||
|         self.model.embed_tokens = value | ||||
|  | ||||
|     @add_start_docstrings_to_model_forward(INTERNLM_INPUTS_DOCSTRING) | ||||
|     def forward( | ||||
|         self, | ||||
|         input_ids: torch.LongTensor = None, | ||||
|         attention_mask: Optional[torch.Tensor] = None, | ||||
|         position_ids: Optional[torch.LongTensor] = None, | ||||
|         past_key_values: Optional[List[torch.FloatTensor]] = None, | ||||
|         inputs_embeds: Optional[torch.FloatTensor] = None, | ||||
|         labels: Optional[torch.LongTensor] = None, | ||||
|         use_cache: Optional[bool] = None, | ||||
|         output_attentions: Optional[bool] = None, | ||||
|         output_hidden_states: Optional[bool] = None, | ||||
|         return_dict: Optional[bool] = None, | ||||
|     ) -> Union[Tuple, SequenceClassifierOutputWithPast]: | ||||
|         r""" | ||||
|         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): | ||||
|             Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., | ||||
|             config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If | ||||
|             `config.num_labels > 1` a classification loss is computed (Cross-Entropy). | ||||
|         """ | ||||
|         return_dict = return_dict if return_dict is not None else self.config.use_return_dict | ||||
|  | ||||
|         transformer_outputs = self.model( | ||||
|             input_ids, | ||||
|             attention_mask=attention_mask, | ||||
|             position_ids=position_ids, | ||||
|             past_key_values=past_key_values, | ||||
|             inputs_embeds=inputs_embeds, | ||||
|             use_cache=use_cache, | ||||
|             output_attentions=output_attentions, | ||||
|             output_hidden_states=output_hidden_states, | ||||
|             return_dict=return_dict, | ||||
|         ) | ||||
|         hidden_states = transformer_outputs[0] | ||||
|         logits = self.score(hidden_states) | ||||
|  | ||||
|         if input_ids is not None: | ||||
|             batch_size = input_ids.shape[0] | ||||
|         else: | ||||
|             batch_size = inputs_embeds.shape[0] | ||||
|  | ||||
|         if self.config.pad_token_id is None and batch_size != 1: | ||||
|             raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.") | ||||
|         if self.config.pad_token_id is None: | ||||
|             sequence_lengths = -1 | ||||
|         else: | ||||
|             if input_ids is not None: | ||||
|                 sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device) | ||||
|             else: | ||||
|                 sequence_lengths = -1 | ||||
|  | ||||
|         pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths] | ||||
|  | ||||
|         loss = None | ||||
|         if labels is not None: | ||||
|             labels = labels.to(logits.device) | ||||
|             if self.config.problem_type is None: | ||||
|                 if self.num_labels == 1: | ||||
|                     self.config.problem_type = "regression" | ||||
|                 elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): | ||||
|                     self.config.problem_type = "single_label_classification" | ||||
|                 else: | ||||
|                     self.config.problem_type = "multi_label_classification" | ||||
|  | ||||
|             if self.config.problem_type == "regression": | ||||
|                 loss_fct = MSELoss() | ||||
|                 if self.num_labels == 1: | ||||
|                     loss = loss_fct(pooled_logits.squeeze(), labels.squeeze()) | ||||
|                 else: | ||||
|                     loss = loss_fct(pooled_logits, labels) | ||||
|             elif self.config.problem_type == "single_label_classification": | ||||
|                 loss_fct = CrossEntropyLoss() | ||||
|                 loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1)) | ||||
|             elif self.config.problem_type == "multi_label_classification": | ||||
|                 loss_fct = BCEWithLogitsLoss() | ||||
|                 loss = loss_fct(pooled_logits, labels) | ||||
|         if not return_dict: | ||||
|             output = (pooled_logits,) + transformer_outputs[1:] | ||||
|             return ((loss,) + output) if loss is not None else output | ||||
|  | ||||
|         return SequenceClassifierOutputWithPast( | ||||
|             loss=loss, | ||||
|             logits=pooled_logits, | ||||
|             past_key_values=transformer_outputs.past_key_values, | ||||
|             hidden_states=transformer_outputs.hidden_states, | ||||
|             attentions=transformer_outputs.attentions, | ||||
|         ) | ||||
| @@ -0,0 +1,242 @@ | ||||
| # coding=utf-8 | ||||
| # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. | ||||
| # | ||||
| # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX | ||||
| # and OPT implementations in this library. It has been modified from its | ||||
| # original forms to accommodate minor architectural differences compared | ||||
| # to GPT-NeoX and OPT used by the Meta AI team that trained the model. | ||||
| # | ||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||
| # you may not use this file except in compliance with the License. | ||||
| # You may obtain a copy of the License at | ||||
| # | ||||
| #     http://www.apache.org/licenses/LICENSE-2.0 | ||||
| # | ||||
| # Unless required by applicable law or agreed to in writing, software | ||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | ||||
|  | ||||
| """Tokenization classes for IntermLM.""" | ||||
| import os | ||||
| from shutil import copyfile | ||||
| from typing import Any, Dict, List, Optional, Tuple | ||||
|  | ||||
| import sentencepiece as spm | ||||
|  | ||||
| from transformers.tokenization_utils import PreTrainedTokenizer | ||||
| from transformers.utils import logging | ||||
|  | ||||
|  | ||||
| logger = logging.get_logger(__name__) | ||||
|  | ||||
| VOCAB_FILES_NAMES = {"vocab_file": "./tokenizer.model"} | ||||
|  | ||||
| PRETRAINED_VOCAB_FILES_MAP = {} | ||||
|  | ||||
|  | ||||
| class InternLMTokenizer(PreTrainedTokenizer): | ||||
|     """ | ||||
|     Construct a InternLM tokenizer. Based on byte-level Byte-Pair-Encoding. | ||||
|  | ||||
|     Args: | ||||
|         vocab_file (`str`): | ||||
|             Path to the vocabulary file. | ||||
|     """ | ||||
|  | ||||
|     vocab_files_names = VOCAB_FILES_NAMES | ||||
|     pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | ||||
|     model_input_names = ["input_ids", "attention_mask"] | ||||
|     _auto_class = "AutoTokenizer" | ||||
|  | ||||
|     def __init__( | ||||
|         self, | ||||
|         vocab_file, | ||||
|         unk_token="<unk>", | ||||
|         bos_token="<s>", | ||||
|         eos_token="</s>", | ||||
|         pad_token="</s>", | ||||
|         sp_model_kwargs: Optional[Dict[str, Any]] = None, | ||||
|         add_bos_token=True, | ||||
|         add_eos_token=False, | ||||
|         decode_with_prefix_space=False, | ||||
|         clean_up_tokenization_spaces=False, | ||||
|         **kwargs, | ||||
|     ): | ||||
|         self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs | ||||
|         super().__init__( | ||||
|             bos_token=bos_token, | ||||
|             eos_token=eos_token, | ||||
|             unk_token=unk_token, | ||||
|             pad_token=pad_token, | ||||
|             clean_up_tokenization_spaces=clean_up_tokenization_spaces, | ||||
|             **kwargs, | ||||
|         ) | ||||
|         self.vocab_file = vocab_file | ||||
|         self.add_bos_token = add_bos_token | ||||
|         self.add_eos_token = add_eos_token | ||||
|         self.decode_with_prefix_space = decode_with_prefix_space | ||||
|         self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | ||||
|         self.sp_model.Load(vocab_file) | ||||
|         self._no_prefix_space_tokens = None | ||||
|  | ||||
|         """ Initialisation""" | ||||
|  | ||||
|     @property | ||||
|     def no_prefix_space_tokens(self): | ||||
|         if self._no_prefix_space_tokens is None: | ||||
|             vocab = self.convert_ids_to_tokens(list(range(self.vocab_size))) | ||||
|             self._no_prefix_space_tokens = {i for i, tok in enumerate(vocab) if not tok.startswith("▁")} | ||||
|         return self._no_prefix_space_tokens | ||||
|  | ||||
|     @property | ||||
|     def vocab_size(self): | ||||
|         """Returns vocab size""" | ||||
|         return self.sp_model.get_piece_size() | ||||
|  | ||||
|     @property | ||||
|     def bos_token_id(self) -> Optional[int]: | ||||
|         return self.sp_model.bos_id() | ||||
|  | ||||
|     @property | ||||
|     def eos_token_id(self) -> Optional[int]: | ||||
|         return self.sp_model.eos_id() | ||||
|  | ||||
|     def get_vocab(self): | ||||
|         """Returns vocab as a dict""" | ||||
|         vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} | ||||
|         vocab.update(self.added_tokens_encoder) | ||||
|         return vocab | ||||
|  | ||||
|     def _tokenize(self, text): | ||||
|         """Returns a tokenized string.""" | ||||
|         return self.sp_model.encode(text, out_type=str) | ||||
|  | ||||
|     def _convert_token_to_id(self, token): | ||||
|         """Converts a token (str) in an id using the vocab.""" | ||||
|         return self.sp_model.piece_to_id(token) | ||||
|  | ||||
|     def _convert_id_to_token(self, index): | ||||
|         """Converts an index (integer) in a token (str) using the vocab.""" | ||||
|         token = self.sp_model.IdToPiece(index) | ||||
|         return token | ||||
|  | ||||
|     def _maybe_add_prefix_space(self, tokens, decoded): | ||||
|         if tokens and tokens[0] not in self.no_prefix_space_tokens: | ||||
|             return " " + decoded | ||||
|         else: | ||||
|             return decoded | ||||
|  | ||||
|     def convert_tokens_to_string(self, tokens): | ||||
|         """Converts a sequence of tokens (string) in a single string.""" | ||||
|         current_sub_tokens = [] | ||||
|         out_string = "" | ||||
|         prev_is_special = False | ||||
|         for token in tokens: | ||||
|             # make sure that special tokens are not decoded using sentencepiece model | ||||
|             if token in self.all_special_tokens: | ||||
|                 if not prev_is_special: | ||||
|                     out_string += " " | ||||
|                 out_string += self.sp_model.decode(current_sub_tokens) + token | ||||
|                 prev_is_special = True | ||||
|                 current_sub_tokens = [] | ||||
|             else: | ||||
|                 current_sub_tokens.append(token) | ||||
|                 prev_is_special = False | ||||
|         out_string += self.sp_model.decode(current_sub_tokens) | ||||
|         out_string = self.clean_up_tokenization(out_string) | ||||
|         out_string = self._maybe_add_prefix_space(tokens=tokens, decoded=out_string) | ||||
|         return out_string[1:] | ||||
|  | ||||
|     def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]: | ||||
|         """ | ||||
|         Save the vocabulary and special tokens file to a directory. | ||||
|  | ||||
|         Args: | ||||
|             save_directory (`str`): | ||||
|                 The directory in which to save the vocabulary. | ||||
|  | ||||
|         Returns: | ||||
|             `Tuple(str)`: Paths to the files saved. | ||||
|         """ | ||||
|         if not os.path.isdir(save_directory): | ||||
|             logger.error(f"Vocabulary path ({save_directory}) should be a directory") | ||||
|             return | ||||
|         out_vocab_file = os.path.join( | ||||
|             save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] | ||||
|         ) | ||||
|  | ||||
|         if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file): | ||||
|             copyfile(self.vocab_file, out_vocab_file) | ||||
|         elif not os.path.isfile(self.vocab_file): | ||||
|             with open(out_vocab_file, "wb") as fi: | ||||
|                 content_spiece_model = self.sp_model.serialized_model_proto() | ||||
|                 fi.write(content_spiece_model) | ||||
|  | ||||
|         return (out_vocab_file,) | ||||
|  | ||||
|     def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): | ||||
|         if self.add_bos_token: | ||||
|             bos_token_ids = [self.bos_token_id] | ||||
|         else: | ||||
|             bos_token_ids = [] | ||||
|  | ||||
|         output = bos_token_ids + token_ids_0 | ||||
|  | ||||
|         if token_ids_1 is not None: | ||||
|             output = output + token_ids_1 | ||||
|  | ||||
|         if self.add_eos_token: | ||||
|             output = output + [self.eos_token_id] | ||||
|  | ||||
|         return output | ||||
|  | ||||
|     def get_special_tokens_mask( | ||||
|         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False | ||||
|     ) -> List[int]: | ||||
|         """ | ||||
|         Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | ||||
|         special tokens using the tokenizer `prepare_for_model` method. | ||||
|  | ||||
|         Args: | ||||
|             token_ids_0 (`List[int]`): | ||||
|                 List of IDs. | ||||
|             token_ids_1 (`List[int]`, *optional*): | ||||
|                 Optional second list of IDs for sequence pairs. | ||||
|             already_has_special_tokens (`bool`, *optional*, defaults to `False`): | ||||
|                 Whether or not the token list is already formatted with special tokens for the model. | ||||
|  | ||||
|         Returns: | ||||
|             `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | ||||
|         """ | ||||
|         if already_has_special_tokens: | ||||
|             return super().get_special_tokens_mask( | ||||
|                 token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True | ||||
|             ) | ||||
|  | ||||
|         if token_ids_1 is None: | ||||
|             return [1] + ([0] * len(token_ids_0)) + [1] | ||||
|         return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1] | ||||
|  | ||||
|     def create_token_type_ids_from_sequences( | ||||
|         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None | ||||
|     ) -> List[int]: | ||||
|         """ | ||||
|         Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make | ||||
|         use of token type ids, therefore a list of zeros is returned. | ||||
|  | ||||
|         Args: | ||||
|             token_ids_0 (`List[int]`): | ||||
|                 List of IDs. | ||||
|             token_ids_1 (`List[int]`, *optional*): | ||||
|                 Optional second list of IDs for sequence pairs. | ||||
|  | ||||
|         Returns: | ||||
|             `List[int]`: List of zeros. | ||||
|         """ | ||||
|         eos = [self.eos_token_id] | ||||
|  | ||||
|         if token_ids_1 is None: | ||||
|             return len(token_ids_0 + eos) * [0] | ||||
|         return len(token_ids_0 + eos + token_ids_1 + eos) * [0] | ||||
		Reference in New Issue
	
	Block a user