A Not-so-Dangerous AI in the Persian Language

Afshin Khashei
10 min readMar 9, 2020
Persian Calligraphy — Ali Shirazi

If you haven’t heard the story of the AI that was “too dangerous to release”, get yourself a large bag of popcorn and continue reading.

About one year ago on February 14, 2019, OpenAI published a paper about their new language model, GPT-2. GPT-2 could generate coherent paragraphs of text at a quality that was not seen before. It was trained on a 40GB text corpus, extracted from eight million web pages. OpenAI at that time decided not to release this model due to their concerns about malicious applications of such a powerful technology.

Looking into one of the examples that they have published in the initial announcement, could explain why they were worried. Remember that GPT-2 is a language model, so it can predict (i.e. generate a relevant) next word given previous words of a text:

INPUT TEXT (HUMAN-WRITTEN)
Recycling is good for the world.
NO! YOU COULD NOT BE MORE WRONG!!

MODEL COMPLETION (MACHINE-WRITTEN, Best of 25 TRIES)
Recycling is NOT good for the world. It is bad for the environment, it is bad for our health, and it is bad for our economy. I’m not kidding. Recycling is not good for the environment. It is destructive to the earth and it is a major contributor to global warming. Recycling is not good for our health. It contributes to obesity and diseases like heart disease and cancer. Recycling is bad for our economy. It increases the cost of a product, and in turn, the price of everything that is made with that product. Recycling is not good for our nation. We pay a tremendous price for the privilege of having the world’s most advanced and efficient recycling system. Recycling is a huge, colossal waste of time, energy, money, and resources. And THAT is why we need to get back to basics and get back to basics in our recycling efforts. One of the best ways to start is to look at the process of creating a paper product. When you make a paper product, it is basically a long chain of materials. Everything from the raw materials (wood, cardboard, paper, etc.), to the reagents (dyes, solvents, etc.) to the printing equipment (chemicals, glue, paper, ink, etc.), to the packaging, to the packaging materials (mercury, chemicals, etc.) to the processing equipment (heating, cooling, etc.), to the packaging materials, to the packaging materials that are shipped overseas and to the packaging materials that are used in the United States. Each step along the way creates tons of waste that we constantly have to clean up. The process of making a paper product is a very wasteful one. But the end result is something that all of us need to consume. And if we want to keep the recycling process running efficiently, then we really need to think about each and every step that goes into making a paper product.

Interesting isn’t it? Due to the probabilistic approach used in the generation of the text, the model could generate a completely different output, on each execution (inference). This example and more examples published here are hand-picked by a human from multiple attempts to generate continuation of the same input text but without any further edit or refinement. The example above is the best of 25.

After this, many people who were pissed off by “Open”AI not releasing their model, started chatting about the possibility of replicating their work. Apart from some of the details which were missing in the initial paper, the main challenge was the training cost. OpenAI didn’t specify how many hours they spent on training but the hardware which was used was about US$256 an hour to rent at that time (source). Considering the model size, the total cost of running one training session was estimated to be around US$50K but with the try and error required to debug and find the best hyper params, pre-training GPT-2 in a different language could cost up to US$500K. Pretty disappointing and heartbreaking fact for many people.

Finally, in November 2019 after a lot of online philosophical debates around AI ethics, OpenAI changed its mind and decided to go ahead and release its X-Large pre-trained model. A 48 layer model with 1.5 billion parameters.

Thanks to Adam D King’s online tool now everyone could Talk To the GPT-2 Transformer. This online tool is based on the GPT-2 X-Large which was originally used to generate the above example. I do not believe in fortune telling but once in a while, I go to this website and give it an opening line about a friend or one of the important philosophical questions as input text to see what GPT-2 will have to say. The result is usually entertaining!

Pre-training GPT-2 Persian

GPT–2 is a Transformer-based language model. Google introduced Transformer architecture in June 2017, in an influential paper with a funny title: “Attention is all you need”. The key building block of the Transformer model is “Self-Attention”. Self-Attention put recurrent neural networks (RNNs) on the deprecation path for many of the natural language processing (NLP) applications.

While both RNNs and Transformers are suitable for processing sequences (e.g. text as a sequence of words), the Transformer model can deal with the longer sequences much better than RNN ( ~1000s vs ~100s words).

The transformer model, on the other hand, requires a higher memory and processing power, comparing to RNN, for the same input length. At the same time, Transformer architecture allows better parallelization which leads to better utilization of powerful GPUs. RNN by nature requires recursive computation that can not be parallelized.

Despite knowing about the unaffordable (>US$50K) training cost, I wanted to get one of these models trained for the Persian language. Before this, I have trained language models in Persian for a controlled text generation research using RNNs (see: Bobol-Zaban), but this could push my results to a whole new level.

Before everything, I needed a large dataset. GPT-2 was trained on a 40GB text corpus but there was no opensource dataset for the Persian language close to this size. The largest public dataset was the Persian Wikipedia dump for ~650 MB. There were a couple of other datasets owned by private research groups that I couldn't get access to them but even altogether they were not close to 40GB.

Using a crawler pipeline built on top of Scrapy and MongoDB, I started creating my own dataset. I downloaded more than 11 million web pages and extracted their Persian text content. After spending a few weeks cleaning up the documents, normalizing the text and filtering out duplicate and short content, a dataset of raw Persian text with 27GB of data was ready. This data was collected from 20 different sites from blogging to news websites.

Now the next problem was the computation cost. The memory requirement for a Transformer model depends on many factors:

  • The length of input context: GPT-2 was trained to generate the next word by taking 1024 previous words as input.
  • Model size: GPT-2 has been released in four different sizes with 12 to 48 layers, sizing from 124M to 1.5B parameters.
  • Training batch size: A higher batch size usually gives better results.
  • The number of sub-words or Dictionary size: GPT-2 uses byte pair encoding (BPE) that is a technique to break longer words to smaller ones and create a dictionary of valid subwords to encode input. For instance, a word like “training” can be encoded into two subwords “train” and “ing”. BPE does this at Unicode byte-level and supports all Unicode languages. GPT-2 used a dictionary of ~50000 sub words.
  • Floating-point precision: 16bit vs 32bit
  • Optimizer algorithm: Different optimizers have different memory footprints.

I look into many options for all of the above parameters. I reduced the input length to 128 and 256 sub-words, used small and medium GPT-2 hyperparameters, reduced the batch size, normalized the input text, reduced the dictionary size, changed the precision to 16bit and tried different optimizers, but no matter what I did, I couldn’t get a meaningful result on a 6GB GPU that I had at home. The smaller models were not of good quality and the larger models either didn’t fit into memory or didn’t converge fast enough in a matter of days with a very small batch size that I could use.

After a few months of banging my head to the wall, I came across the Google TFRC program which is to support researchers with free cloud TPU hours. Tensor Processing Units (TPUs) are Google’s custom-developed hardware that is optimized to accelerate the machine learning workload. I applied for the program and got access to five beautiful TPU v3–8 instances with 128GB memory. GPT-2 was trained on TPU v3–256 with 4096GB memory. Still far from what I had, but thanks to Google’s generous offer, with a few hundred dollars of my money on top to pay for the VM and storage, I was back in the game!

The 128GB memory that I had with a TPUv3, was still not enough to train a GPT-2 1.5B similar to the OpenAI version but after trying different options, I manage to train a model with a similar hyperparameter set as GPT-2 medium (355M) and smaller input size (256). Instead of BPE, I used unigram encoding with 25000 subwords optimized for Persian words with the largest batch size that fitted into 128GB memory. The training of this model took more than one month. The result was amazing, sometimes scary, but not as dangerous as GPT-2. It's just a good start.

If you can’t read and write Persian, I afraid you have to miss the fun part. The examples below are the top three results out of 25 generated samples for the same input (about recycling). The inference process on a core i7 laptop with16GB RAM and 4GB GPU is pretty slow and takes about 10 seconds for a batch of five samples.

First Sample:

INPUT TEXT (HUMAN-WRITTEN)

آیا بازیافت زباله کاری است که به بهبود محیط زیست می‌انجامد ؟

MODEL COMPLETION (MACHINE-WRITTEN, Best of 25 TRIES)

به نظر می‌رسد که پاسخ مثبت است .
زباله‌های شهری به دلایل مختلف ، بسیاری از افراد را در معرض آلودگی قرار می‌دهد .
بسیاری از این زباله‌ها غیرقابل بازیافت هستند .
امروزه ، بازیافت زباله به یکی از دغدغه‌های اصلی جوامع بشری تبدیل شده است .
در کشورهای توسعه یافته ، بازیافت زباله‌ها ، یکی از مسائل اساسی در برنامه‌ریزی‌های شهری است .
در کشورهای در حال توسعه هم ، بازیافت زباله‌ها ، در دستور کار قرار دارد .
یکی از نکات مهم در بحث بازیافت زباله ، این است که این زباله‌ها را در چه مکان‌هایی می‌توان بازیافت کرد ؟
در کشور ما ، به دلیل اینکه کشور ما فاقد قانون و ضابطه تفکیک زباله است و در این زمینه ، قوانین و ضابطه‌ای وجود ندارد ، بازیافت زباله‌های شهری ، به عنوان یک چالش مطرح می‌شود .
در بسیاری از کشورها ، زباله‌های شهری را به شکل‌های گوناگون در معابر عمومی می‌ریزند ، در حالی که زباله‌های تفکیک شده در محل‌های مشخص و مشخصی ، تفکیک و نگهداری می‌شود .

Second Sample:

INPUT TEXT (HUMAN-WRITTEN)

آیا بازیافت زباله کاری است که به بهبود محیط زیست می‌انجامد ؟

MODEL COMPLETION (MACHINE-WRITTEN, Best of 25 TRIES)

پاسخ این است که بازیافت زباله‌ها وظیفه‌ای است که هیچ کس نمی‌تواند از عهده آن برآید .
به عبارت دیگر هیچ کسی نمی‌تواند محیط زیست را نجات دهد .
با گسترش و توسعه تجارت زباله ، انسان‌ها به تدریج زباله را به زباله دان خانه‌ها ، ادارات ، فروشگاه‌ها ، کارخانجات و سایر جای جای جهان می‌آورند .
اما از آنجا که از نظر زیست محیطی ، زباله‌ها به عنوان زباله‌های خانگی محسوب می‌شوند ، باید به این نکته توجه داشت که مدیریت صحیح و صحیح زباله‌های خانگی ، می‌تواند به حفظ محیط زیست کمک کند

Third Sample:

INPUT TEXT (HUMAN-WRITTEN)

آیا بازیافت زباله کاری است که به بهبود محیط زیست می‌انجامد ؟

MODEL COMPLETION (MACHINE-WRITTEN, Best of 25 TRIES)

بعید به نظر میرسد .اما به نظر میرسد این موضوع بیشتر به یک شوخی شبیه باشد تا یک واقعیت .در این مقاله قصد داریم که به این پرسش بپردازیم .
این پرسش در باره بازیافت زباله‌ها و چگونگی کاهش آن است .
این موضوع در واقع به یک موضوع کاملا علمی تبدیل شده است .
در این مقاله سعی داریم به سوالاتی که در این زمینه مطرح شده است ، پاسخ دهیم .
از جمله این که آیا بازیافت زباله‌ها باید در مکان هایی که افراد کمتری در آنجا زندگی می‌کنند انجام شود یا خیر ؟
برای این که در این زمینه بتوانیم به پاسخ هایی مناسب دست یابیم ، می‌خواهیم به چند سوال زیر پاسخ دهیم .
چه باید کرد ؟
در این مقاله سعی داریم به این سوال پاسخ دهیم که آیا باید زباله‌ها را با مواد شیمیایی ، بهداشتی و یا آب به طور کامل تمیز کرد ؟
این موضوع فقط در مورد زباله‌های پلاستیکی صدق میکند یا مواد پلاستیکی ، شیمیایی و یا آب ؟
آیا می‌توان این زباله‌ها را با مواد شیمیایی ، بهداشتی و یا آب تمیز کرد ؟
چگونه میتوان از این زباله‌ها در مکانی که افراد کمتری در آنجا زندگی می‌کنند استفاده کرد ؟

Conclusion

Today, pre-training a language model like GPT-2 for Persian languages is far beyond the reach of individual researchers and small organizations. It is hard to say when this is going to become easier. Just a couple of weeks after the introduction of google’s new model reformer, which could reduce the memory requirement of future language models, Microsoft introduced a new language model called Turing-NLG with 17B parameters; 11 times larger than GPT-2. Meanwhile, it is possible to train smaller models based on Transformer architecture for specific use-cases or toy projects.

Acknowledgment

This project is supported by Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

Links (updated on Dec 2020)

--

--