All corpora in PoeTree have the same structure. Every poem is stored in a standalone JSON file with a following schema:
id: | str | id of the poem (points to a source file) |
title: | str|None | title of the poem |
year_created: | int|list|None | year when poem created (may precede date of publication); cases when this is given as a time span in source data (e.g. 1800-1802) are stored as a list [1800,1802]; |
neighbors: | list | 20 most similar poems (Fuzz partial ratio) by the same author - each one is encoded in a list as [id, similarity] in descending order |
duplicate: | str|False | whether poem is considered to be a duplicate of other poem, if so id of the primary variant is given here (this is not a straightforward application of neighbors, but involves a set of rules and a bunch of manually decided cases) |
author: { | dict|list | metadata on poem's author; when poem has multiple authors this is a list holding multiple dicts |
name: | str | name of the author as appears in the source (may be a pen name); if author is unknown, it is stored here as [anonymous] |
viaf: | str|None | VIAF id |
wiki: | str|None | wikidata id |
country: | str|None | country where author from (iso 639-1), None if unknown or of no importance for the corpus |
born: | int|None | year when author born |
died: | int|None | year when author died |
}, |
source: { | dict | metadata on poem's source (book) |
id: | str|None | id of the book |
title: | str|None | book title |
year_published: | int|None | year when book published |
publisher: | str|None | name of the publisher |
place: | str|None | place where published (city) |
corpus: | str | name of the corpus from which the data come from (e.g. Russian corpus comprise rpc [Russian poetic corpus] and zelenkov [texts tagged by J. Zelenkov]) |
}, |
body: [{ | list of dicts | each dict is a line of the poem and holds it text and annotation |
id: | int | index of the line; zero-based indexing, increment through the entire poem (does not restart in new stanza) |
stanza_id: | int | index of stanza; zero-based indexing |
text: | str | text of the line |
part: | str|False | if verse-line is divided into multiple text-lines, this gives whether this is initial part (value="I"), medial part (value="M"), or final part (value="F"); False if this is a non-divided line |
words: [{ | list of dicts | annotation of particular word in CoNLL-U format |
id: | int | word index, integer starting at 1 for each new sentence |
form: | str | word form or punctuation symbol |
lemma: | str | lemma or stem of word form |
upos: | str | universal part-of-speech tag |
xpos: | str | language-specific part-of-speech tag; underscore if not available |
feats: | str | list of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available |
head: | int | head of the current word, which is either a value of ID or zero (0) |
deprel: | str | universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one |
deps: | str | enhanced dependency graph in the form of a list of head-deprel pairs |
sentence: | int | sentence index; integer starting at 1 |
multiword: { | dict | This key is optional! Appears only if given word is part of a multiword token |
form: | str | form of the multiword |
id: | int | multiword index, integer starting at 1 for each new sentence; e.g. "vámonos" produce two words "vámos" & "nos" - they both have a unique id and are assigned to multiword with form "vámonos" and a single id |
}, |
}, ... ], |
}, ... ], |
# | VAL | DESCRIPTION | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(1) | alcaic | Alcaic strophe; four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(2) | arte mayor | eight-line stanzas with abbaacca rhyme scheme | |||||||||||||||||||||||||
(3) | asclepiad 4 | four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(4) | burns | Burns stanza; six-line stanzas rhyming aaabab, the lines being I4.I4.I4.I2.I4.I2 or T4.T4.T4.T2.T4.T2 | |||||||||||||||||||||||||
(5) | elegiac | Elegian couplet; seven-line stanzas with rhyme scheme ababbcc | |||||||||||||||||||||||||
(6) | english sonnet | poem consisting of three four-line and one two-line stanzas | |||||||||||||||||||||||||
(7) | glyconic | four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(8) | ghazal | poem containing radif (i.e. a repeated word at the end of a line) distributed according to the scheme rr(xr)+x? | |||||||||||||||||||||||||
(9) | huitain | eight-line stanzas with rhyme scheme ababbcbc or abbaacac | |||||||||||||||||||||||||
(10) | heroic | two-lines stanzas with rhyme scheme aa consisting of lines I5.I5 | |||||||||||||||||||||||||
(11) | italian sonnet | poem consisting of two four-line and two three-line stanzas | |||||||||||||||||||||||||
(12) | limerick | poem consisting of one stanza with rhyme scheme aabba | |||||||||||||||||||||||||
(13) | madrigal | three-line stanzas, the last line being identical in each stanza | |||||||||||||||||||||||||
(14) | onegin | fourteen-line stanzas with rhyme scheme ababccddeffegg | |||||||||||||||||||||||||
(15) | ottava rima | eight-line stanzas with rhyme scheme abababcc | |||||||||||||||||||||||||
(16) | qasida | two-line stanzas with rhyme scheme aa.ax.ax... or aa.aa.aa... or aa.ba.ca... | |||||||||||||||||||||||||
(17) | rhyme royal | seven-line stanzas with rhyme scheme ababbcc | |||||||||||||||||||||||||
(18) | ritornello | poem consisting of one stanza with rhyme scheme axa | |||||||||||||||||||||||||
(19) | rondel | poem consisting of three stanzas, in which only two rhymes are used, the first line being identical with the last, or the first line being identical with the penultimate and at the same time the second line being identical with the last | |||||||||||||||||||||||||
(20) | rondeau | poem consisting of two or three stanzas; only two rhymes; the initial part of the first line (or the whole first line) is repeated at the end of the last line of each stanza except the first | |||||||||||||||||||||||||
(21) | sapphic | four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(22) | sapphic barock | four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(23) | sapphic german | four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(24) | sapphic iamb | four-line stanzas with basic meter-patterns:
| |||||||||||||||||||||||||
(25) | sestina | poem consisting of six-line stanzas, the final stanza having three lines | |||||||||||||||||||||||||
(26) | sicilian | Sicilian octave; eight-line stanzas with rhyme scheme abababab | |||||||||||||||||||||||||
(27) | spenserian | nine-line stanzas with rhyme scheme ababbcbcc, consisting of lines I5.I5.I5.I5.I5.I5.I5.I5.I6 | |||||||||||||||||||||||||
(28) | terza rima | poem consisting of three-line stanzas and final one-line or four-line stanza with rhyme scheme of aba bcb cdc... yzy z, or aba bcb cdc... yzyz | |||||||||||||||||||||||||
(29) | triolet | poem consisting of single eight-line stanza; first, fourth, and seventh line being identical, second line and last one being identical | |||||||||||||||||||||||||
(30) | venus and adonis | six-line stanzas with rhyme scheme ababcc |
# | VAL | APPEARS IN | DESCRIPTION |
---|---|---|---|
(1) | iamb | cs, de, ru1, ru2 | accentual-syllabic iambic meter: wS... |
(2) | trochee | cs, de, ru1, ru2 | accentual-syllabic trochaic meter: Sw... |
(3) | dactyl | cs, de, ru1, ru2 | accentual-syllabic dactylic meter: Sww... |
(4) | amphibrach | cs, de, ru1, ru2 | accentual-syllabic amphibrachic meter: wSw... |
(5) | anapest | de, ru1, ru2 | accentual-syllabic anapestic meter: wwS... |
(6) | creticus | de | accentual-syllabic cretic meter: SwS... |
(7) | paeon | ru2 | accentual-syllabic paeonic (quaternary) meter: Swww...|wSww...|wwSw...|wwwS... |
(8) | logaedic | cs, de, ru1, ru2 | accentual-syllabic logaedic meter: w?(Sw|Sww)+ |
(9) | syllabic | cs, de, fr, ru2 | line in syllabic verse |
(10) | quantitative | cs | line in quantitative verse |
(11) | accentual | de, ru2 | line in accentual verse |
(12) | quotation | cs | if line is a quotation deviating current meter |
# | VAL | APPEARS IN | DESCRIPTION |
---|---|---|---|
(1) | alexandrine | cs, fr | 12- or 13-syllable line with constant word boundary after 6th syllable; in CS this apply only to iambic meter (I6)() |
(2) | p?[dt]+ | cs, de | specification of basic patter for AS logaedic meters: p = initial w-position; d = dactylic foot (Sww); t = trochaic foot (Sw) |
(3) | hexameter | cs, de, ru2 | AS imitation of quantitative dactylic hexameter (logaedic line): Sww?Sww?Sww?Sww?Sww?Sw; in DE meter patterns for hexameter and pentameter are not available and are assigned just XXX* |
(4) | pentameter | cs, de, ru2 | AS imitation of quantitative dactylic pentameter (logaedic line): Sww?Sww?Sw?Sww?Sww?S; in DE meter patterns for hexameter and pentameter are not available and are assigned just XXX* |
(5) | (.)*\+(.)* | fr, ru2 | caesured verse; this may separate integers in syllabic verse (such as 4+6 in fr) or meters in accentucal syllabic (such as T3m+T3m in ru2); in latter case there's however no clear border between this and logaedic line |
(6) | dolnik | ru2 | subtype of accentual verse; no more than 2 unstressed syllables in a row |
(7) | taktovik | ru2 | subtype of accentual verse; no more than 3 unstressed syllables in a row |
# | VAL | APPEARS IN | DESCRIPTION |
---|---|---|---|
(1) | s | cs, de, ru1, ru2 | Strong ending. Meter pattern ends with an S-position |
(2) | w | cs, de, ru1, ru2 | Weak ending. Meter pattern ends with Sw |
(3) | a | cs, de, ru1, ru2 | Acatalectic/dactylic ending. Meter pattern ends with Sww. In ru1 ending is derived from the meter pattern generated by PP, thus annotation correspons to cs & de standards (binary meters can not have a-ending). In ru2, this is taken from the labels in RPC - binary meters may have a-ending. |
(4) | h | ru2 | Hyperdactylic ending. Meter pattern ends with Swww |