Understand and Modularize Generator Optimization in ELECTRA-style Pretraining

Chengyu Dong, Liyuan Liu, Hao Cheng, Jingbo Shang, Jianfeng Gao, Xiaodong Liu
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:8244-8259, 2023.

Abstract

Despite the effectiveness of ELECTRA-style pre-training, their performance is dependent on the careful selection of the model size for the auxiliary generator, leading to high trial-and-error costs. In this paper, we present the first systematic study of this problem. Our theoretical investigation highlights the importance of controlling the generator capacity in ELECTRA-style training. Meanwhile, we found it is not handled properly in the original ELECTRA design, leading to the sensitivity issue. Specifically, since adaptive optimizers like Adam will cripple the weighing of individual losses in the joint optimization, the original design fails to control the generator training effectively. To regain control over the generator, we modularize the generator optimization by decoupling the generator optimizer and discriminator optimizer completely, instead of simply relying on the weighted objective combination. Our simple technique reduced the sensitivity of ELECTRA training significantly and obtains considerable performance gain compared to the original design.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-dong23c, title = {Understand and Modularize Generator Optimization in {ELECTRA}-style Pretraining}, author = {Dong, Chengyu and Liu, Liyuan and Cheng, Hao and Shang, Jingbo and Gao, Jianfeng and Liu, Xiaodong}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {8244--8259}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/dong23c/dong23c.pdf}, url = {https://proceedings.mlr.press/v202/dong23c.html}, abstract = {Despite the effectiveness of ELECTRA-style pre-training, their performance is dependent on the careful selection of the model size for the auxiliary generator, leading to high trial-and-error costs. In this paper, we present the first systematic study of this problem. Our theoretical investigation highlights the importance of controlling the generator capacity in ELECTRA-style training. Meanwhile, we found it is not handled properly in the original ELECTRA design, leading to the sensitivity issue. Specifically, since adaptive optimizers like Adam will cripple the weighing of individual losses in the joint optimization, the original design fails to control the generator training effectively. To regain control over the generator, we modularize the generator optimization by decoupling the generator optimizer and discriminator optimizer completely, instead of simply relying on the weighted objective combination. Our simple technique reduced the sensitivity of ELECTRA training significantly and obtains considerable performance gain compared to the original design.} }
Endnote
%0 Conference Paper %T Understand and Modularize Generator Optimization in ELECTRA-style Pretraining %A Chengyu Dong %A Liyuan Liu %A Hao Cheng %A Jingbo Shang %A Jianfeng Gao %A Xiaodong Liu %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-dong23c %I PMLR %P 8244--8259 %U https://proceedings.mlr.press/v202/dong23c.html %V 202 %X Despite the effectiveness of ELECTRA-style pre-training, their performance is dependent on the careful selection of the model size for the auxiliary generator, leading to high trial-and-error costs. In this paper, we present the first systematic study of this problem. Our theoretical investigation highlights the importance of controlling the generator capacity in ELECTRA-style training. Meanwhile, we found it is not handled properly in the original ELECTRA design, leading to the sensitivity issue. Specifically, since adaptive optimizers like Adam will cripple the weighing of individual losses in the joint optimization, the original design fails to control the generator training effectively. To regain control over the generator, we modularize the generator optimization by decoupling the generator optimizer and discriminator optimizer completely, instead of simply relying on the weighted objective combination. Our simple technique reduced the sensitivity of ELECTRA training significantly and obtains considerable performance gain compared to the original design.
APA
Dong, C., Liu, L., Cheng, H., Shang, J., Gao, J. & Liu, X.. (2023). Understand and Modularize Generator Optimization in ELECTRA-style Pretraining. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:8244-8259 Available from https://proceedings.mlr.press/v202/dong23c.html.

Related Material