Similarity detection based on document matrix model and edit distance algorithm
Authors:
- Artur Niewiarowski
Abstract
This paper presents a new algorithm with an objective of analyzing the similarity measure between twotext documents. Specifically, the main idea of the implemented method is based on the structure of theso-called "edit distance matrix" (similarity matrix). Elements of this matrix are filled with a formula basedon Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) isused as a replacement for various implementations of stemming or lemmatization methods. Additionally,the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g.,books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for allemployees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.
- Record ID
- CUT8cb55a7a15ca4ab29f3949bf5afbcfbe
- Publication categories
- ;
- Author
- Journal series
- Computer Assisted Methods in Engineering and Science, ISSN 2299-3649
- Issue year
- 2019
- Vol
- 26
- No
- 3-4
- Pages
- 163-175
- Other elements of collation
- tab.; wykr.; Bibliografia (na s.) - 174-175; Bibliografia (liczba pozycji) - 18; Oznaczenie streszczenia - Streszcz. ang.; Numeracja w czasopiśmie - Vol. 26, No. 3-4
- Keywords in English
- plagiarism detection, plagiarism system, edit distance, Levenshtein distance, similarity measure, text mining, information retrieval
- DOI
- DOI:10.24423/cames.277 Opening in a new tab
- URL
- https://cames.ippt.pan.pl/index.php/cames/article/view/277 Opening in a new tab
- Language
- eng (en) English
- License
- Score (nominal)
- 70
- Uniform Resource Identifier
- https://cris.pk.edu.pl/info/article/CUT8cb55a7a15ca4ab29f3949bf5afbcfbe/
- URN
urn:pkr-prod:CUT8cb55a7a15ca4ab29f3949bf5afbcfbe
* presented citation count is obtained through Internet information analysis, and it is close to the number calculated by the Publish or PerishOpening in a new tab system.