Яндекс.Метрика


Start-Tagger (Starikov’s Tagger or StarT) allows the user getting a text annotated with POS tags. It works on a well-known bi-directional inference algorithm according to which a POS tag is assigned to a token depending on POS tags of tokens to the right and to the left of current token.

Part-of speech tagging has been widely used in corpus linguistics and in the last decades has become an indispensable component in such fields as text mining and text classification/categorization. Application of tagging in these fields faces one major problem: it is a time consuming procedure that badly affects the speed of an NLP system when performed dynamically. To make POS-tagging faster we modified the bi-directional inference algorithm excluding from it two parameters that can be computed on the fly when the rest of the parameters are already known. For details see our paper [16] in the list of publications.
 As a result StarT works much faster than its immediate analogue, a tagger developed by the Japanese scientists (TandT tagger) that employs the same algorithm.

The table below displays results of tests conducted on Pentium 4, 2.8 GHz, 768 Mb of RAM machine.

 

Text size

T&T tagger

StarT

10 KB

2 sec.

in no time (<< 1 sec.)

50 KB

9 sec.

< 1 sec.

100 KB

17 sec.

1 sec.

500 KB

1 min. 22 sec.

3 sec.

1000 KB

2 мин. 50 sec.

6 sec.

 

 



We also evaluated quality of StarT against quality of a tagger used in American National Corpus (ANC) by matching their annotated texts against texts annotated manually by human experts. We found out that quality of StarT was 99.27% , i.e. it made 0,73% mistakes per 1000 words while quality ANC tagger was 99.33%, i.e. it made 0,67% mistakes per 1000 words.

The ANC texts were chosen for contrastive analyses because this corpus is the latest one and it employs the most modern software. While developing StarT we took ANC as a model and used the same tagset.

To use StarT open a text form a directory by clicking “load and tag” button. You can copy the annotated text to an external editor.



StarT can process English texts in .txt format on Windows machines and requires .net framework.

 

 

 

 

 

 

 

 

 

 

© Viatcheslav Yatsko, 2011-2013