PoeTree

dataset » json schema

All corpora in PoeTree have the same structure. Every poem is stored in a standalone JSON file with a following schema:

id:

str

id of the poem (points to a source file)

title:

str|None

title of the poem

year_created:

int|list|None

year when poem created (may precede date of publication); cases when this is given as a time span in source data (e.g. 1800-1802) are stored as a list [1800,1802];

neighbors:

list

20 most similar poems (Fuzz partial ratio) by the same author - each one is encoded in a list as [id, similarity] in descending order

duplicate:

str|False

whether poem is considered to be a duplicate of other poem, if so id of the primary variant is given here (this is not a straightforward application of neighbors, but involves a set of rules and a bunch of manually decided cases)

author: {

dict|list

metadata on poem's author; when poem has multiple authors this is a list holding multiple dicts

name:

str

name of the author as appears in the source (may be a pen name); if author is unknown, it is stored here as [anonymous]

viaf:

str|None

VIAF id

wiki:

str|None

wikidata id

country:

str|None

country where author from (iso 639-1), None if unknown or of no importance for the corpus

born:

int|None

year when author born

died:

int|None

year when author died

source: {

dict

metadata on poem's source (book)

id:

str|None

id of the book

title:

str|None

book title

year_published:

int|None

year when book published

publisher:

str|None

name of the publisher

place:

str|None

place where published (city)

corpus:

str

name of the corpus from which the data come from (e.g. Russian corpus comprise rpc [Russian poetic corpus] and zelenkov [texts tagged by J. Zelenkov])

body: [{

list of dicts

each dict is a line of the poem and holds it text and annotation

id:

int

index of the line; zero-based indexing, increment through the entire poem (does not restart in new stanza)

stanza_id:

int

index of stanza; zero-based indexing

text:

str

text of the line

part:

str|False

if verse-line is divided into multiple text-lines, this gives whether this is initial part (value="I"), medial part (value="M"), or final part (value="F"); False if this is a non-divided line

words: [{

list of dicts

annotation of particular word in CoNLL-U format

id:

int

word index, integer starting at 1 for each new sentence

form:

str

word form or punctuation symbol

lemma:

str

lemma or stem of word form

upos:

str

universal part-of-speech tag

xpos:

str

language-specific part-of-speech tag; underscore if not available

feats:

str

list of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available

head:

int

head of the current word, which is either a value of ID or zero (0)

deprel:

str

universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one

deps:

str

enhanced dependency graph in the form of a list of head-deprel pairs

sentence:

int

sentence index; integer starting at 1

multiword: {

dict

This key is optional! Appears only if given word is part of a multiword token

form:

str

form of the multiword

id:

int

multiword index, integer starting at 1 for each new sentence; e.g. "vámonos" produce two words "vámos" & "nos" - they both have a unique id and are assigned to multiword with form "vámonos" and a single id

}, ... ],

VAL

DESCRIPTION

(1)

alcaic

Alcaic strophe; four-line stanzas with basic meter-patterns:

wSwSwSwwSww	OR	SwSwwSwwSww	OR	wSwSwSwwSwS
wSwSwSwwSww		SwSwwSwwSww		wSwSwSwwSwS
wSwSwSwSw		SwSwSwSww		wSwSwSwSw
SwwSwwSwSw		SwwSwwSwSw		SwwSwwSwSw
(cs)		(cs)		(de)

(2)

arte mayor

eight-line stanzas with abbaacca rhyme scheme

(3)

asclepiad 4

four-line stanzas with basic meter-patterns:

SwSwwSwSwwSww	OR	SwSwwSSwwSwS
SwSwwSwSwwSww		SwSwwSSwwSwS
SwSwwSw		SwSwwSw
SwSwwSww		SwSwwSwS
(cs)		(de)

(4)

burns

Burns stanza; six-line stanzas rhyming aaabab, the lines being I4.I4.I4.I2.I4.I2 or T4.T4.T4.T2.T4.T2

(5)

elegiac

Elegian couplet; seven-line stanzas with rhyme scheme ababbcc

(6)

english sonnet

poem consisting of three four-line and one two-line stanzas

(7)

glyconic

four-line stanzas with basic meter-patterns:

SwSwwSw[Sw]

SwSwwSw

(8)

ghazal

poem containing radif (i.e. a repeated word at the end of a line) distributed according to the scheme rr(xr)+x?

(9)

huitain

eight-line stanzas with rhyme scheme ababbcbc or abbaacac

(10)

heroic

two-lines stanzas with rhyme scheme aa consisting of lines I5.I5

(11)

italian sonnet

poem consisting of two four-line and two three-line stanzas

(12)

limerick

poem consisting of one stanza with rhyme scheme aabba

(13)

madrigal

three-line stanzas, the last line being identical in each stanza

(14)

onegin

fourteen-line stanzas with rhyme scheme ababccddeffegg

(15)

ottava rima

eight-line stanzas with rhyme scheme abababcc

(16)

qasida

two-line stanzas with rhyme scheme aa.ax.ax... or aa.aa.aa... or aa.ba.ca...

(17)

rhyme royal

seven-line stanzas with rhyme scheme ababbcc

(18)

ritornello

poem consisting of one stanza with rhyme scheme axa

(19)

rondel

poem consisting of three stanzas, in which only two rhymes are used, the first line being identical with the last, or the first line being identical with the penultimate and at the same time the second line being identical with the last

(20)

rondeau

poem consisting of two or three stanzas; only two rhymes; the initial part of the first line (or the whole first line) is repeated at the end of the last line of each stanza except the first

(21)

sapphic

four-line stanzas with basic meter-patterns:

SwSwSwwSwSw

SwwSw

(22)

sapphic barock

four-line stanzas with basic meter-patterns:

SwwSwSwSwSw

SwwSw

(23)

sapphic german

four-line stanzas with basic meter-patterns:

SwwSwSwSwSw

SwSwwSwSwSw

SwSwSwwSwSw

SwwSw

(24)

sapphic iamb

four-line stanzas with basic meter-patterns:

wSwSwSwSwSw

wSwSw

(25)

sestina

poem consisting of six-line stanzas, the final stanza having three lines

(26)

sicilian

Sicilian octave; eight-line stanzas with rhyme scheme abababab

(27)

spenserian

nine-line stanzas with rhyme scheme ababbcbcc, consisting of lines I5.I5.I5.I5.I5.I5.I5.I5.I6

(28)

terza rima

poem consisting of three-line stanzas and final one-line or four-line stanza with rhyme scheme of aba bcb cdc... yzy z, or aba bcb cdc... yzyz

(29)

triolet

poem consisting of single eight-line stanza; first, fourth, and seventh line being identical, second line and last one being identical

(30)

venus and adonis

six-line stanzas with rhyme scheme ababcc

#	VAL	APPEARS IN	DESCRIPTION
(1)	iamb	cs, de, ru1, ru2	accentual-syllabic iambic meter: wS...
(2)	trochee	cs, de, ru1, ru2	accentual-syllabic trochaic meter: Sw...
(3)	dactyl	cs, de, ru1, ru2	accentual-syllabic dactylic meter: Sww...
(4)	amphibrach	cs, de, ru1, ru2	accentual-syllabic amphibrachic meter: wSw...
(5)	anapest	de, ru1, ru2	accentual-syllabic anapestic meter: wwS...
(6)	creticus	de	accentual-syllabic cretic meter: SwS...
(7)	paeon	ru2	accentual-syllabic paeonic (quaternary) meter: Swww...\|wSww...\|wwSw...\|wwwS...
(8)	logaedic	cs, de, ru1, ru2	accentual-syllabic logaedic meter: w?(Sw\|Sww)+
(9)	syllabic	cs, de, fr, ru2	line in syllabic verse
(10)	quantitative	cs	line in quantitative verse
(11)	accentual	de, ru2	line in accentual verse
(12)	quotation	cs	if line is a quotation deviating current meter

#	VAL	APPEARS IN	DESCRIPTION
(1)	alexandrine	cs, fr	12- or 13-syllable line with constant word boundary after 6th syllable; in CS this apply only to iambic meter (I6)()
(2)	p?[dt]+	cs, de	specification of basic patter for AS logaedic meters: p = initial w-position; d = dactylic foot (Sww); t = trochaic foot (Sw)
(3)	hexameter	cs, de, ru2	AS imitation of quantitative dactylic hexameter (logaedic line): Sww?Sww?Sww?Sww?Sww?Sw; in DE meter patterns for hexameter and pentameter are not available and are assigned just XXX*
(4)	pentameter	cs, de, ru2	AS imitation of quantitative dactylic pentameter (logaedic line): Sww?Sww?Sw?Sww?Sww?S; in DE meter patterns for hexameter and pentameter are not available and are assigned just XXX*
(5)	(.)\+(.)	fr, ru2	caesured verse; this may separate integers in syllabic verse (such as 4+6 in fr) or meters in accentucal syllabic (such as T3m+T3m in ru2); in latter case there's however no clear border between this and logaedic line
(6)	dolnik	ru2	subtype of accentual verse; no more than 2 unstressed syllables in a row
(7)	taktovik	ru2	subtype of accentual verse; no more than 3 unstressed syllables in a row

#	VAL	APPEARS IN	DESCRIPTION
(1)	s	cs, de, ru1, ru2	Strong ending. Meter pattern ends with an S-position
(2)	w	cs, de, ru1, ru2	Weak ending. Meter pattern ends with Sw
(3)	a	cs, de, ru1, ru2	Acatalectic/dactylic ending. Meter pattern ends with Sww. In ru1 ending is derived from the meter pattern generated by PP, thus annotation correspons to cs & de standards (binary meters can not have a-ending). In ru2, this is taken from the labels in RPC - binary meters may have a-ending.
(4)	h	ru2	Hyperdactylic ending. Meter pattern ends with Swww