Dynamic normalization of microtexts using SPMAT

Norlela Samsudin; Mazidah Puteh; Ahmad Nazmi Fadzal; Mohd Tajul Hasnan Mohd Tajuddin

The advancement in communication and Internet technology leads to mass of online data available on the Internet. People communicate to each other with application such as Facebook, Twitter, Short Message Service and e-forum. Entries or posts from these applications are known as microtexts. Normally a microtext is very short, very noisy and does not follow the correct structure of a sentence either in the English language or the Malay language. High occurrence of noisy texts decreases the accuracy value when microtexts are processed. This paper proposes a prototype of a system known as Sistem Penterjemahan Mesej Atas Talian (SPMAT). The objective of the system is to 'clean' noisy texts in microtexts that are created online by the Malaysian. 5000 Facebook messages, 5000 Twitter messages and 5000 e-forum messages were collected. From these sources, few lists such as common noisy texts list, common acronyms list artificial abbreviations list and Bi-gram index were created and used in the normalization processed. In addition, the system kept messages that it normalized and updated the corpus when instructed by the user. The methods used in SPMAT had been tested with 100 online messages. The result indicated that 80% of the noisy texts incorporated in these messages had been identified and cleaned correctly.

Dynamic normalization of microtexts using SPMAT

Affiliations

Abstract