Urdu Parser

Introduction
In this project, the development of the Urdu parser for the South Asian language Urdu, was incorporated. The parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The project was started in 2011 and still in progress. By-products of this project till to date includes a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, pattern matching, etc. Resources developed till to date are published and will be available in the following respective sections.
Urdu Parser
For the development of an Urdu parser, 1400 annotated sentences in the URDU.KON-TB treebank were divided into 80% training data and 20% test data. A context free grammar was extracted from this training data, which was then given to the Urdu parser after its development. The test data was divided into 10% held out data and 10% test data. The test data then contained 140 sentences with an average length of 13.73 words per sentence. The held out data was used during the development of the Urdu parser. Urdu parser is an extended version of the dynamic programming algorithm known as the Earley parsing algorithm. The extensions made along with the issues faced during the development are presented in the published work below. All items which can occur in a normal text have been considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements of prophets), anaphora with in a sentence, and others. The PARSEVAL measures are used to evaluate the results of the Urdu parser. By applying a sufficiently rich grammar along with the extended parsing model, the parser gives 87% of f-score and outperforms the other parsers lying in its domain. The publication regarding this Urdu parser is as follows: