about
people
how to cite
dataset
versions
json schema
resources
browse
search
authors
books
All corpora in PoeTree have the same structure. Every poem is stored in a standalone JSON file with a following schema:
id: strid of the poem (points to a source file)
title: str|Nonetitle of the poem
year_created: int|list|Noneyear when poem created (may precede date of publication); cases when this is given as a time span in source data (e.g. 1800-1802) are stored as a list [1800,1802];
neighbors: list20 most similar poems (Fuzz partial ratio) by the same author - each one is encoded in a list as [id, similarity] in descending order
duplicate: str|Falsewhether poem is considered to be a duplicate of other poem, if so id of the primary variant is given here (this is not a straightforward application of neighbors, but involves a set of rules and a bunch of manually decided cases)
author: {dict|listmetadata on poem's author; when poem has multiple authors this is a list holding multiple dicts
name: strname of the author as appears in the source (may be a pen name); if author is unknown, it is stored here as [anonymous]
viaf: str|NoneVIAF id
wiki: str|Nonewikidata id
country: str|Nonecountry where author from (iso 639-1), None if unknown or of no importance for the corpus
born: int|Noneyear when author born
died: int|Noneyear when author died
},
source: {dictmetadata on poem's source (book)
id: str|Noneid of the book
title: str|Nonebook title
year_published: int|Noneyear when book published
publisher: str|Nonename of the publisher
place: str|Noneplace where published (city)
corpus: strname of the corpus from which the data come from (e.g. Russian corpus comprise rpc [Russian poetic corpus] and zelenkov [texts tagged by J. Zelenkov])
},
body: [{list of dictseach dict is a line of the poem and holds it text and annotation
id: intindex of the line; zero-based indexing, increment through the entire poem (does not restart in new stanza)
stanza_id: intindex of stanza; zero-based indexing
text: strtext of the line
part: str|Falseif verse-line is divided into multiple text-lines, this gives whether this is initial part (value="I"), medial part (value="M"), or final part (value="F"); False if this is a non-divided line
words: [{list of dictsannotation of particular word in CoNLL-U format
id: intword index, integer starting at 1 for each new sentence
form: strword form or punctuation symbol
lemma: strlemma or stem of word form
upos: struniversal part-of-speech tag
xpos: strlanguage-specific part-of-speech tag; underscore if not available
feats: strlist of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available
head: inthead of the current word, which is either a value of ID or zero (0)
deprel: struniversal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one
deps: strenhanced dependency graph in the form of a list of head-deprel pairs
sentence: intsentence index; integer starting at 1
multiword: {dictThis key is optional! Appears only if given word is part of a multiword token
form: strform of the multiword
id: intmultiword index, integer starting at 1 for each new sentence; e.g. "vámonos" produce two words "vámos" & "nos" - they both have a unique id and are assigned to multiword with form "vámonos" and a single id
},
}, ... ],
}, ... ],
Supported by the Czech Science Foundation (GA23-07727S)