Graphs representing duplicates found among the works of particular authors. Use buttons below to switch between corpora. Use slider to set the desired value of the threshold. The number following the authors name gives a number of components found in their works given the threshold. Click an author's name to display their graph.
We define the similarity of two poems \(A\) and \(B\) containing \(|A| \geq |B|\) characters respectively as:
$$\textrm{sim}(A,B) = 1 - \frac{\textrm{min}\{\textrm{lev}(a_1, B), \ldots, \textrm{lev}(a_n, B)\}}{|B|}$$
where \(\{a_1, \ldots, a_n\}\) is a set of all possible substrings of \(A\) and \(\textrm{lev}(a_x, B)\) is the Levenshtein distance between \(a_x\) and \(B\).
In PoeTree we keep the list of 20 most similar texts to each poem under they key
neighbors.
The deduplication itself is done in a following way:
For each author in each corpus construct an undirected graph where nodes represent their poems and an edge exists between \(A\) and \(B\) if \(sim(A, B) > 0.75\) (empirically set threshold above which poems are considered duplicate).
For each component of each graph mark one of its nodes as a primary variant and the rest as its duplicates in a following way:
if component is complete:
limit the primary variant candidates to the poems with the highest number of lines
if multiple candidates remain and if the year of creation/publication is known for all of them, limit the candidate set to the earliest ones
if multiple candidates remain, select primary variant by random
else if component is a star and the central node is a poem with the highest number of lines, mark central node as a primary variant
else: determine primary variant manually
If poem is found this way to be a duplicate of another one, the id of its primary variant is stored under the key duplicate .
THRESHOLD:
0.7
Supported by the Czech Science Foundation (GA23-07727S)