Technology Innovation Institute (TII), a worldwide analysis centre, has launched NOOR, the world’s largest Arabic pure language processing (NLP) model to this point. The NOOR model carries out different, cross-domain duties merely from pure language directions.
To construct NOOR, researchers at TII designed an end-to-end pipeline for the assortment of high-quality knowledge, together with crawling, filtering, and curation at scale. TII’s specialists additionally constructed optimized companies for extreme-scale distributed coaching and serving – to ship functions with environment friendly inference and model specialization.
TII’s group of superior researchers and specialists at its Artificial Intelligence (AI) Cross-Centre Unit, joined forces on this initiative with LightOn, a expertise firm that unlocks extreme-scale machine intelligence for companies, to revolutionize Arabic NLP fashions.
Prof. Mérouane Debbah, Chief Researcher, Digital Science Research Centre and AI Cross-Centre Unit, TII, stated: “With NOOR, TII has expanded the scope of the modern standard Arabic model by leveraging know-how in large language models to build cross-disciplinary, cutting-edge expertise in this new generation of AI research.”
NOOR’s coaching dataset is the world’s largest high-quality cross-domain Arabic dataset, combining net knowledge with books, poetry, information articles, and technical data to considerably widen the applicability of the model.
Dr. Ebtesam Almazrouei, Director, AI Cross-Centre Unit, TII, stated: “Large language models have taken the world of natural language processing by storm, and we are proud to introduce this cutting-edge model with 10 billion parameters – the world’s largest Arabic NLP model. The uniquely large Arabic dataset collected to train the model is the result of months of work that included curating, scrapping, and filtering of varied sources.”
Dr. Almazrouei identified that the NOOR model relies on the fashionable Transformer structure. As a decoder-only model, comparable in construction to GPT-3, it’s programmed to sort out generative duties with structure upgraded to replicate the newest developments in the world of machine studying, together with enhancements akin to higher positional embeddings. To assist guarantee high quality at scale in the NOOR dataset, the TII group designed an automatic filtering pipeline primarily based on machine studying methods. These instruments establish textual content like high quality references and safeguard the model from publicity to spam content material.
Leveraging state-of-the-art 3D parallelism, NOOR was educated on a High-Performance Computing useful resource with 128 A100 GPUs, permitting for the distribution of computations and guaranteeing environment friendly use of the out there {hardware} sources.
Dr. Almazrouei additionally famous that this was solely the first step in TII’s efforts to contribute to the wider UAE Strategy for Artificial Intelligence, by means of supporting AI integration throughout key sectors of the financial system.