Génération de textes avec aitextgen et GPT-2

Traitement Automatique du Langage Intelligence Artificielle

Cette page est la suite de Générer du texte en python avec textgenrnn

Les chercheurs retardent la publication de leurs recherches car ils estiment que GPT2 a un potentiel « trop dangereux », étant donné que cette IA pourrait à terme servir à des actes mal intentionnées comme générer des avis négatifs ou positifs sur des produits, des spams, des textes complotistes, voire des fausses nouvelles. cf W

Quel carte graphique choisir en Juin 2020

Choosing the Best GPU for Deep Learning in 2020

RTX 2060 (6 GB): if you want to explore deep learning in your spare time. 360€
RTX 2070 or 2080 (8 GB): if you are serious about deep learning, but your GPU budget is $600-800. Eight GB of VRAM can fit the majority of models.
RTX 2080 Ti (11 GB): if you are serious about deep learning and your GPU budget is ~$1,200. The RTX 2080 Ti is ~40% faster than the RTX 2080.
Titan RTX and Quadro RTX 6000 (24 GB): if you are working on SOTA models extensively, but don't have budget for the future-proofing available with the RTX 8000. 4000€
Quadro RTX 8000 (48 GB): you are investing in the future and might even be lucky enough to research SOTA deep learning in 2020. 5500€

Ressources

aitextgen

aitextgen de minimaxir sur github.com est une amélioration de textgenrnn de minimaxir sur github.com
docs.aitextgen.io
aitextgen est un outil Python robuste pour la formation et la génération d' intelligence artificielle basée sur du texte à l'aide de l' architecture GPT-2 d' OpenAI.

OpenAI

OpenAI sur Wikipedia: OpenAI est une entreprise à « but lucratif plafonné » en intelligence artificielle, basée à San Francisco. L'objectif de cette société est de promouvoir et développer une intelligence artificielle à visage humain qui bénéficiera à toute l'humanité. OpenAI a mis au point une intelligence artificielle nommée GPT2 capable d'écrire des articles de presse et des œuvres de fiction. Reposant sur un générateur de texte qui assimile les mots reçus et détermine la suite la plus logique qu'elle retransmet dans le même style, elle s'avère particulièrement performante, à tel point qu'il est impossible de faire la différence avec un texte écrit par un être humain7.
Les chercheurs retardent la publication de leurs recherches car ils estiment que GPT2 a un potentiel « trop dangereux », étant donné que cette IA pourrait à terme servir à des actes mal intentionnées comme générer des avis négatifs ou positifs sur des produits, des spams, des textes complotistes, voire des fausses nouvelles8.

Controverse sur GPT-2

Transformers

Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0. Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pretrained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.

BERT

BERT (traitement automatique du langage) En traitement automatique du langage naturel, BERT, acronyme de Bidirectional Encoder Representations from Transformers, est un modèle de langage (en) développé par Google en 2018. Cette méthode a permis d'améliorer significativement les algorithmes de traitement automatique de la langue.

CamemBERT

The CamemBERT model was proposed in CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook’s RoBERTa model released in 2019. It is a model trained on 138GB of French text.

FlauBERT

https://huggingface.co/transformers/model_doc/flaubert.html

The FlauBERT model was proposed in the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le et al. It’s a transformer pre-trained using a masked language modeling (MLM) objective (BERT-like).

Fables de La Fontaine

Suite de Avec des fables de La Fontaine

training.py

from aitextgen import aitextgen
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU, GPT2Config
from aitextgen import aitextgen
 
def get_config():
    return GPT2Config(
                            vocab_size=20000,
                            n_positions=1024,
                            n_ctx=1024,
                            n_embd=768,
                            n_layer=12,
                            n_head=12,
                            bos_token_id=0,
                            eos_token_id=0,
                            max_length=1024,
                            dropout=0.0
                        )
 
def training():
    file_name  =  "./fables.txt"
    train_tokenizer(file_name)
    vocab_file = "aitextgen-vocab.json"
    merges_file = "aitextgen-merges.txt"
 
    config = get_config()
    ai = aitextgen(vocab_file=vocab_file, merges_file=merges_file, config=config)
 
    data = TokenDataset(file_name,
                        vocab_file=vocab_file,
                        merges_file=merges_file,
                        block_size=64)
 
    ai.train(data, batch_size=32, num_steps=60000)
    ai.generate(5, prompt="Le chien et le lion")
 
training()
 
"""
essai 5
11,472 sets of tokens from ./fables.txt
Loss: 0.094 — Avg: 0.093
"""

testing.py

from aitextgen import aitextgen
from aitextgen.utils import GPT2Config
 
 
class AiTextGen:
 
    def __init__(self):
        """Charge le gpt2 model de /aitextgen si existe,
        sinon le télécharge dans /aitextgen."""
 
        self.prompt = "Romeo: "
        self.config = self.get_config()
        print("Config chargée.")
        print("    soit:\n", self.config)
 
        print("Création de aitextgen():")
        ai = aitextgen()
        print("Done.")
 
        self.vocab_file = "aitextgen-vocab.json"
        self.merges_file = "aitextgen-merges.txt"
 
        print("Chargement du modèle pytorch ...")
        self.ai = aitextgen(model="./trained_model/pytorch_model.bin",
                            vocab_file=self.vocab_file,
                            merges_file=self.merges_file,
                            config=self.config)
 
    def get_config(self):
        return GPT2Config(
                            vocab_size=20000,
                            n_positions=1024,
                            n_ctx=1024,
                            n_embd=768,
                            n_layer=12,
                            n_head=12,
                            bos_token_id=0,
                            eos_token_id=0,
                            max_length=1024,
                            dropout=0.0
                        )
 
    def get_irc_response(self, prompt, len_max, temp):
        if isinstance(prompt, str):
            resp = self.ai.generate(n=1,
                                    prompt=prompt,
                                    max_length=len_max,
                                    temperature=temp,
                                    return_as_list=True)
        return resp
 
 
    def interactif(self):
        while 1:
            try:
                prompt = input("Entrer un début de phrase:\n")
            except:
                prompt = "Ne jouer pas à ce petit jeu !"
            if isinstance(prompt, str) and len(prompt) > 4:
                resp = self.ai.generate(n=1,
                                        prompt=prompt,
                                        max_length=100,
                                        temperature=0.8,
                                        return_as_list=True)
                print(f"\n\nLa Fontaine n'a pas écrit:\n{resp[0]}\n\n")
            else:
                print("Raté")
 
 
if __name__ == "__main__":
 
    atg = AiTextGen()
    atg.interactif()

Configuration

Training a GPT-2 Model From Scratch

config = GPT2Config(vocab_size=20000,
                    n_positions=1024,
                    n_ctx=1024,
                    n_embd=768,
                    n_layer=12,
                    n_head=12,
                    bos_token_id=0,
                    eos_token_id=0,
                    max_length=1024,
                    dropout=0.0 )
 
print(config)
"""
GPT2Config {"activation_function": "gelu_new",
            "attn_pdrop": 0.1,
            "bos_token_id": 0,
            "dropout": 0.0,
            "embd_pdrop": 0.1,
            "eos_token_id": 0,
            "initializer_range": 0.02,
            "layer_norm_epsilon": 1e-05,
            "max_length": 1024,
            "model_type": "gpt2",
            "n_ctx": 1024,
            "n_embd": 768,
            "n_head": 12,
            "n_layer": 12,
            "n_positions": 1024,
            "resid_pdrop": 0.1,
            "summary_activation": null,
            "summary_first_dropout": 0.1,
            "summary_proj_to_labels": true,
            "summary_type": "cls_index",
            "summary_use_proj": true,
            "vocab_size": 20000}
"""
 
ai = aitextgen(model="./trained_model/pytorch_model.bin", vocab_file=vocab_file,
                        merges_file=merges_file, config=config)
 
ai.generate(n=1, prompt=prompt, max_length=100, temperature=0.8, return_as_list=True)
"""
See this article by Huggingface engineer Patrick von Platen for how sampling and
these parameters are used in practice.
 
n: Number of texts generated.
max_length: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024.)
prompt: Prompt that starts the generated text and is included in the generate text. (used to be prefix in previous tools)
temperature: Controls the "craziness" of the text (default: 0.7)
top_k: If nonzero, limits the sampled tokens to the top k values. (default: 0)
top_p: If nonzero, limits the sampled tokens to the cumulative probability
 
Some lesser-known-but-still-useful-parameters that are unique to Transformers:
 
num_beams: If greater than 1, executes beam search for cleaner text.
repetition_penalty: If greater than 1.0, penalizes repetition in a text to avoid infinite loops.
length_penalty: If greater than 1.0, penalizes text proportional to the length
no_repeat_ngram_size: Token length to avoid repeating given phrases.
 
input_context = 'The dog'
input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
"""

Mails de La Labomedia

Pense bête pour NFS

Partage NFS

ia, python, python3, tal, textgenrnn, sb, rnn