Release date2010
AuthorPasquier, C.

GeniaJ 1 is a Java implementation of the Genia tagger (Part-of-speech tagging and shallow parsing for biomedical texts) version 3.0.1 of April 16 2007. The original version was developped in C++ by Yoshimasa Tsuruoka from the Tsujii Laboratory at the University of Tokyo] and distributed under the modified BSD licence. The datasets are identical to the original C++ version. The output from this java version should be identical to the output of the original C++ version.

For more information about the original software, see:

  • Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005.


Prepare a text file containing one sentence per line, then execute the program with:

java -Xmx500m -jar GeniaJ.jar < RAWTEXT > TAGGEDTEXT

The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.

word1   base1   POStag1 chunktag1 NEtag1

word2   base2   POStag2 chunktag2 NEtag2

  :       :        :       :        :

Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).


> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | java -Xmx500m -jar GeniaJ.jar

Inhibition      Inhibition      NN      B-NP     O
of              of              IN      B-PP     O
NF-kappaB       NF-kappaB       NN      B-NP     B-protein
activation      activation      NN      I-NP     O
reversed        reverse         VBD     B-VP     O
the             the             DT      B-NP     O
anti-apoptotic  anti-apoptotic  JJ      I-NP     O
effect          effect          NN      I-NP     O
of              of              IN      B-PP     O
isochamaejasmin isochamaejasmin NN      B-NP     O
.               .               .       O        O

You can easily extract four noun phrases (“Inhibition”, “NF-kappaB activation”, “the anti-apoptotic effect”, and “isochamaejasmin”) from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.

