News
- Inauguration of CLSP.ORG plateform.
It is the time to provide a plateform through which academic researches across the globe can increase their abilities in computational models, logic theories, scientific studies and other practices for the benefit of society, culture and countries….
- CLSP Project Progressing.
CLSP (a non-profit organization) has been initiated with the minimal resources only for the natural languages but then indigenous and foreign experts involved in this formation contributed and then plateforms was then extended to Computation, Logic, Science & Practices. It …
Urdu Parser
- Introduction
- In this project, the development of the Urdu parser for the South Asian language Urdu, was incorporated. The parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The project was started in 2011 and still in progress. By-products of this project till to date includes a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, pattern matching, etc. Resources developed till to date are published and will be available in the following respective sections.
- Urdu Parser
- For the development of an Urdu parser, 1400 annotated sentences in the URDU.KON-TB treebank were divided into 80% training data and 20% test data. A context free grammar was extracted from this training data, which was then given to the Urdu parser after its development. The test data was divided into 10% held out data and 10% test data. The test data then contained 140 sentences with an average length of 13.73 words per sentence. The held out data was used during the development of the Urdu parser. Urdu parser is an extended version of the dynamic programming algorithm known as the Earley parsing algorithm. The extensions made along with the issues faced during the development are presented in the published work below. All items which can occur in a normal text have been considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements of prophets), anaphora with in a sentence, and others. The PARSEVAL measures are used to evaluate the results of the Urdu parser. By applying a sufficiently rich grammar along with the extended parsing model, the parser gives 87% of f-score and outperforms the other parsers lying in its domain. The publication regarding this Urdu parser is as follows:
- Abbas, Q. 2015, Morphologically rich Urdu grammar parsing using Earley algorithm, Natural Language Engineering (NLE), Vol.21(2), PP.1-36, ISSN: 1351-3249, DOI: 10.1017/S1351324915000133, Cambridge University Press, UK
- Abbas, Q. 2014. Building Computational Resources: The URDU.KON-TB Treebank and the Urdu Parser, Doctoral Disseration, University of Konstanz, Germany
- Abbas, Q. 2014. Exploiting Language Variants Via Grammar Parsing Having Morphologically Rich Information. In Proceedings of the EMNLP'2014 Language Technology for Closely Related Languages and Language Variants, Association of Computational Linguistics, P 35-45, Qatar.