Computational Learning Strategies & Practices

An organization to promote computation, logic, science and practices

News

Inauguration of CLSP.ORG plateform.

It is the time to provide a plateform through which academic researches across the globe can increase their abilities in computational models, logic theories, scientific studies and other practices for the benefit of society, culture and countries….
read more
CLSP Project Progressing.

CLSP (a non-profit organization) has been initiated with the minimal resources only for the natural languages but then indigenous and foreign experts involved in this formation contributed and then plateforms was then extended to Computation, Logic, Science & Practices. It …
read more

The URDU.KON-TB Treebank

In this project, the development of the URDU.KON-TB treebank, its annotation scheme, evaluation and guidelines for the South Asian language Urdu, were incorporated. The development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The project was started in 2011 and still in progress. By-products of this project till to date includes a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, pattern matching, etc. Resources developed till to date are published and will be available in the following respective sections.

Corpus Collection

The raw corpus used for the URDU.KON-TB Treebank contained 1400 sentences collected from the Urdu Wikipedia and the Jang newspaper. The corpus contained text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. On going effort to increase the size of this corpus contained 600 sentences, which will increase the size of the corpus from 1400 to 2000. Corpus updates will be provided soon.

Annotation Scheme

The hierarchal annotation scheme adopted has a combination of phrase structure and hyper dependency structure. A semi-semantic part of speech tag set, a semi-semantic syntactic tag set and a functional tag set have been designed, which are further revised during the annotation of the raw corpus. The annotation of the sentences was performed manually. Due to the addition of morphology, part of speech, syntactical, semantical, clausal, grammatical and miscellaneous features, the annotation scheme is linguistically rich. This annotation resulted in a treebank known as the URDU.KON-TB treebank. The published work regarding annotation scheme is as follows:

Abbas, Q. 2014. Semi-Semantic Part of Speech Annotation and Evaluation. In Proceedings of ACL 8th Linguistic Annotation Workshop (COLING), Association of Computational Linguistics, P 75-81, Dublin, Ireland.
Abbas, Q. 2014. Building Computational Resources: The URDU.KON-TB Treebank and the Urdu Parser, Doctoral Disseration, University of Konstanz, Germany
Abbas, Q. 2012. Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank. Lecture Notes in Computer Science (LNCS). Vol. 7181(1), P 66-79, ISSN 0302-9743, Springer-Verlag Berlin/Heidelberg.
Download Semi-Semantic Part Of Speech Tagset Of URDU.KON-TB Treebank

Annotation Evaluation

For an evaluation of the annotation scheme, Krippendorff’s α co-efficient was selected. This is a statistical measure to evaluate inter-annotator agreement. Randomly selected 100 sentences from the URDU.KON-TB treebank were given to five trained annotators for annotation. The annotated sentences then evaluated using the Krippendorff’s α co-efficient. The α values of inter-annotator agreement obtained for part of speech, syntactical and functional annotation were 0.964, 0.817 and 0.806, respectively. All of the three values lie in the range of perfect agreement. The published work regarding annotation evaluation will be provided here soon.

Annotation Guidelines

The annotation guidelines devised in the development of the URDU.KON-TB treebank were revised during and after the annotation evaluation. The updated version will be provided here soon.