While the advent of large language models (LLMs) has marked a transformative phase in AI, existing models often fall short of the specific needs of the public sector and other users in Europe.
Proprietary and 3rd-party LLMs offer powerful capabilities, but have limitations around language diversity, particularly for low-resource languages (those with fewer speakers and less linguistic data available).
While the exact data sources are rarely reported in detail, a widely used resource is Common Crawl, where most EU languages are highly underrepresented. For example, Latvian accounts for only 0.09% of the total dataset, Irish 0.07% and Maltese 0.03%. The least-represented half of all EU official languages add up to merely 2.4%. Other challenges include data quality, copyright safety, transparency and freedom from bias.
Contributing to the European ecosystem of LLMs
The European Commission is working to address these limitations by using the high-quality multilingual data generated by the EU institutions to contribute to the European ecosystem of LLMs – which are better suited to the EU’s multilingual landscape.
This work is part of DG Translation’s partnership with the Directorate-General for Communications Networks, Content and Technology (DG CONNECT) for AI-based multilingual services under the Digital Europe programme.
AI for a multilingual Europe
Models that cover only a limited number of languages and underperform on low-resource languages are major obstacles for multilingual organisations and societies. This is particularly relevant for a multilingual Europe, especially when it comes to European AI projects that require a broad range of EU languages.
- The future EU institutional LLM – enhanced with formal texts from the EU institutions – will be able to power our existing AI-based multilingual services.
- It is intended to be a powerful addition to European sovereign AI, complementing existing efforts and contributing to a diverse landscape of European solutions. Its enhanced EU language capabilities and EU knowledge will be better tailored for use by EU public administrations, small businesses, academia and non-governmental organisations.
- The EU LLM project is a stepping stone in the EU’s ambitions to become a major player in AI innovation and strategic technologies, while being able to rely on its own digital systems and tools.
Creating a high-quality EU LLM
To realise this vision, DG Translation’s expert engineers are continuing to pre-train existing open-source LLMs, with the goal of improving their multilingual capabilities. This work draws on 2 key European assets:
- the supercomputers provided by the European High Performance Computing Joint Undertaking (EuroHPC JU)
- the datasets of the European Advanced Multilingual Information System (Euramis), a unique and voluminous corpus of multilingual text from all the EU institutions
Better coverage of EU languages
In addition, DG Translation’s proximity to language professionals and their direct feedback gives us a major advantage when it comes to preparing data and evaluating models. The resulting models are expected to demonstrate better coverage of all EU official languages and an improved ability to handle EU topics.
In various stages of the EU LLM project, the European Commission has been able to access supercomputing infrastructure through 3 projects with the EuroHPC JU. First, the goal was to develop a cutting-edge skillset and demonstrate the ability to train large AI models with the MeluXina supercomputer in Luxembourg. Then followed more advanced and intense training of large language models with the Leonardo supercomputer in Bologna.
The EU LLM will be built with European technology at its core, which is why an existing European open-source LLM has been selected for the project. Models created by Mistral AI (Mixtral 8x7B and 8x22B) are being enhanced with our Euramis language data.
What are the main features of DG Translation’s project?
- Comprehensive inclusion of all 24 EU official languages
Each low-resource language is represented with at least 1 billion tokens (units of text). - Bottom-up approach
The Euramis data used is meticulously curated in line with stringent quality standards. It is aligned with core European values and devoid of any copyright infringements. Pre-training approach
The project focuses on pre-training a state-of-the-art LLM so that the whole model is updated (rather than being fine-tuned, which would narrow its scope).- Towards effective continued pre-training of EU institutional LLMs on EuroHPC supercomputers – Proceedings of the Second EuroHPC user day (scientific paper)