Similarity detection based on document matrix model and edit distance algorithm

Niewiarowski, Artur

doi:10.24423/cames.277

Back

Similarity detection based on document matrix model and edit distance algorithm

Authors:

Artur Niewiarowski

Abstract

This paper presents a new algorithm with an objective of analyzing the similarity measure between twotext documents. Specifically, the main idea of the implemented method is based on the structure of theso-called "edit distance matrix" (similarity matrix). Elements of this matrix are filled with a formula basedon Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) isused as a replacement for various implementations of stemming or lemmatization methods. Additionally,the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g.,books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for allemployees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

Metrics

Record ID: CUT8cb55a7a15ca4ab29f3949bf5afbcfbe
Publication categories: scientific article/chapter; reviewed work
Author: Artur Niewiarowski Artur Niewiarowski,, Department of Computer Science (F/F-1)Faculty of Computer Science and Telecommunications (F)
Journal series: Computer Assisted Methods in Engineering and Science, ISSN 2299-3649
Issue year: 2019
Vol: 26
No: 3-4
Pages: 163-175
Other elements of collation: tab.; wykr.; Bibliografia (na s.) - 174-175; Bibliografia (liczba pozycji) - 18; Oznaczenie streszczenia - Streszcz. ang.; Numeracja w czasopiśmie - Vol. 26, No. 3-4
Keywords in English: plagiarism detection, plagiarism system, edit distance, Levenshtein distance, similarity measure, text mining, information retrieval
DOI: DOI:10.24423/cames.277 Opening in a new tab
URL: https://cames.ippt.pan.pl/index.php/cames/article/view/277 Opening in a new tab
Language: eng (en) English
License: Open licence other than CC
Score (nominal): 70

Cite

Uniform Resource Identifier: https://cris.pk.edu.pl/info/article/CUT8cb55a7a15ca4ab29f3949bf5afbcfbe/

URN: urn:pkr-prod:CUT8cb55a7a15ca4ab29f3949bf5afbcfbe

* presented citation count is obtained through Internet information analysis, and it is close to the number calculated by the Publish or PerishOpening in a new tab system.

Back

Knowledge base: Cracow University of Technology

Settings and your account

Similarity detection based on document matrix model and edit distance algorithm

Authors:

Abstract

Metrics

Cite