Luận án Đánh giá mức độ giống nhau của văn bản Tiếng Việt

Ngày nay, cùng với sự phát triển của Internet, hoạt động trao đổi, chia sẻ tài

liệu diễn ra rất phổ biến, các tài liệu như bài báo, sách, luận văn tốt nghiệp, báo cáo,

đồ án, được số hóa và phổ biến trên mạng Internet ngày càng nhiều. Người sử dụng

có thể tìm thấy những thông tin cần thiết một cách nhanh chóng và dễ dàng. Tuy

nhiên, bên cạnh ưu điểm là cung cấp một nguồn tài liệu tham khảo phong phú thì tình

trạng “sao chép” cũng đang trở thành một vấn nạn. Để góp phần giải quyết bài toán

này, vấn đề đặt ra là làm thế nào để đánh giá được mức độ giống nhau của văn bản

và chỉ ra được những nội dung sao chép trên văn bản.

Trên thế giới, các nghiên cứu và ứng dụng về xử lý ngôn ngữ tự nhiên (NLP)

đã có một lịch sử phát triển lâu dài và đạt được những thành tựu nhất định. Trong

những năm gần đây, NLP đã trở thành một lĩnh vực khoa học công nghệ mũi nhọn,

ngày càng phát triển với nhiều ứng dụng phổ biến đã đem lại hiệu quả lớn cho xã hội

như: tìm kiếm, dịch tự động, trích chọn thông tin, tóm tắt văn bản, khai phá văn bản,

web ngữ nghĩa, trí tuệ nhân tạo, và trong đó có bài toán so sánh, đánh giá độ tương

tự của văn bản. Hiện có nhiều kết quả nghiên cứu về đánh giá độ tương tự trong văn

bản tiếng Anh, trong đó có nhiều công trình nghiên cứu và ứng dụng hữu ích, đặc biệt

là trong việc phát hiện “sao chép” hay phát hiện “đạo văn” [15, 39, 80, 90, 109].

Trong bối cảnh NLP hiện đang là một lĩnh vực ngày càng có nhiều nhà khoa học trên

thế giới quan tâm thì ở Việt Nam, việc nghiên cứu xử lý ngôn ngữ tiếng Việt vẫn

chưa đạt được kết quả khả quan, các công trình nghiên cứu còn hạn chế, nên rất

cần có sự đóng góp của các nhà khoa học, các nhóm nghiên cứu để góp phần đem lại

hiệu quả trong lĩnh vực xử lý tiếng Việt. Vì vậy, vấn đề đánh giá sự tương tự của các

đơn vị văn bản và trên cơ sở đó phát hiện ra nội dung sao chép vẫn còn nhiều thách

thức cần được nghiên cứu. Đặc biệt, đối với việc xử lý tiếng Việt, do mang nhiều đặc

trưng ngôn ngữ riêng nên xử lý tiếng Việt đòi hỏi các phương pháp, kỹ thuật khác so

với các ngôn ngữ khác.

150 trang dienloan 27080 Free

Download

Bạn đang xem 20 trang mẫu của tài liệu "Luận án Đánh giá mức độ giống nhau của văn bản Tiếng Việt", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Luận án Đánh giá mức độ giống nhau của văn bản Tiếng Việt

BỘ GIÁO DỤC VÀ ĐÀO TẠO
ĐẠI HỌC ĐÀ NẴNG
--- ---
HỒ PHAN HIẾU
ĐÁNH GIÁ MỨC ĐỘ GIỐNG NHAU
CỦA VĂN BẢN TIẾNG VIỆT
LUẬN ÁN TIẾN SĨ KỸ THUẬT
Đà Nẵng, 10/2019
BỘ GIÁO DỤC VÀ ĐÀO TẠO
ĐẠI HỌC ĐÀ NẴNG
--- ---
HỒ PHAN HIẾU
ĐÁNH GIÁ MỨC ĐỘ GIỐNG NHAU
CỦA VĂN BẢN TIẾNG VIỆT
Chuyên ngành : KHOA HỌC MÁY TÍNH
Mã số : 62 48 01 01
LUẬN ÁN TIẾN SĨ KỸ THUẬT
Người hướng dẫn khoa học:
1. PGS.TS. Võ Trung Hùng
2. TS. Nguyễn Thị Ngọc Anh
Đà Nẵng, 10/2019
LỜI CAM ĐOAN
Tôi tên là Hồ Phan Hiếu. Tôi xin cam đoan đây là công trình nghiên cứu do
tôi thực hiện. Các nội dung và kết quả nghiên cứu được trình bày trong Luận án là
trung thực và mọi tham khảo đều được trích dẫn, chỉ rõ nguồn tham khảo theo đúng
quy định.
Tác giả
NCS. Hồ Phan Hiếu
- i -
MỤC LỤC
 
LỜI CAM ĐOAN
MỤC LỤC ................................................................................................................... i
DANH MỤC CÁC TỪ VIẾT TẮT ......................................................................... iv
DANH MỤC BẢNG BIỂU ....................................................................................... v
DANH MỤC HÌNH VẼ ........................................................................................... vi
DANH MỤC THUẬT TOÁN ............................................................................... viii
LỜI MỞ ĐẦU ............................................................................................................ 1
1. Đặt vấn đề ......................................................................................................... 1
2. Mục tiêu nghiên cứu ......................................................................................... 3
3. Đối tượng và phạm vi nghiên cứu .................................................................... 4
4. Phương pháp nghiên cứu .................................................................................. 4
5. Nhiệm vụ nghiên cứu và kết quả đạt được ....................................................... 5
6. Bố cục của luận án ........................................................................................... 5
7. Đóng góp chính của luận án ............................................................................. 6
TỔNG QUAN TÌNH HÌNH NGHIÊN CỨU .................................. 8
Một số khái niệm sử dụng trong luận án .......................................................... 8
Một số đặc điểm của ngôn ngữ tiếng Việt ..................................................... 12
Khái quát ............................................................................................... 12
Một số khó khăn và nhập nhằng trong xử lý văn bản tiếng Việt .......... 13
Mô hình biểu diễn văn bản ............................................................................. 15
Giới thiệu ............................................................................................... 15
Mô hình biểu diễn văn bản .................................................................... 16
Nhận xét và đánh giá ............................................................................. 25
Các phương pháp tính độ tương tự văn bản ................................................... 27
Hướng tiếp cận ...................................................................................... 27
Bài toán so khớp chuỗi .......................................................................... 28
So sánh văn bản và ứng dụng trong phát hiện sao chép ................................. 33
Giới thiệu ............................................................................................... 33
Các vấn đề liên quan về sao chép .......................................................... 34
Phát hiện sao chép tại PAN ................................................................... 38
Kết luận Chương 1 ......................................................................................... 41
SO SÁNH VĂN BẢN DỰA TRÊN MÔ HÌNH VECTOR........... 42
Giới thiệu ........................................................................................................ 42
- ii -
Tính độ tương tự văn bản trong mô hình vector ............................................ 43
Biểu diễn văn bản theo mô hình vector ................................................. 43
Phương pháp tính trọng số từ chỉ mục .................................................. 45
Phương pháp tính độ tương tự ............................................................... 49
Nhận xét ................................................................................................. 51
Một số phương pháp so sánh văn bản dựa trên mô hình vector ..................... 52
Mô hình vector hóa văn bản .................................................................. 52
Phương pháp cải tiến sử dụng độ đo Cosine ......................................... 57
Đánh giá các phương pháp dựa trên mô hình vector ..................................... 64
Tạo bộ dữ liệu để đánh giá các thuật toán ............................................. 64
Đánh giá các thuật toán dựa trên mô hình vector .................................. 65
Nhận xét ................................................................................................. 68
Kết luận Chương 2 ......................................................................................... 68
PHÁT HIỆN SAO CHÉP VĂN BẢN DỰA TRÊN BIẾN ĐỔI
WAVELET RỜI RẠC ............................................................................................ 70
Đặt vấn đề ....................................................................................................... 70
Phát biểu bài toán .................................................................................. 70
Đề xuất ý tưởng ..................................................................................... 72
Cơ sở lý thuyết về DWT và bộ lọc Haar ........................................................ 72
Cơ sở lý thuyết về DWT ........................................................................ 72
Bộ lọc Haar ............................................................................................ 75
Chuỗi DNA ............................................................................................ 77
Đề xuất mô hình hệ thống phát hiện sao chép ............................................... 77
Giới thiệu ............................................................................................... 77
Đề xuất mô hình hệ thống áp dụng cho phương pháp dựa trên DWT .. 78
Đề xuất quy trình chuyển đổi dữ liệu ............................................................. 81
Đề xuất phương pháp và giải thuật xử lý ....................................................... 81
Tiền xử lý dữ liệu .................................................................................. 82
Quy trình số hóa .................................................................................... 82
Giải thuật cho bộ lọc Haar ..................................................................... 85
Tổ chức dữ liệu cho bộ DNA nguồn ..................................................... 88
Đề xuất thuật toán phát hiện sự giống nhau ................................................... 90
Mã hóa dữ liệu và tính DNA của văn bản đánh giá .............................. 90
So sánh và đưa ra quyết định ................................................................. 90
Độ phức tạp của thuật toán phát hiện sự giống nhau ............................ 91
Kết quả thử nghiệm phương pháp dựa trên DWT .......................................... 92
- iii -
Dữ liệu thử nghiệm ................................................................................ 92
Kết quả thử nghiệm ............................................................................... 96
Đánh giá ....................................................................................................... 100
Kết luận Chương 3 ....................................................................................... 103
PHÁT TRIỂN HỆ THỐNG PHÁT HIỆN SAO CHÉP VĂN BẢN
TIẾNG VIỆT ......................................................................................................... 106
Mô tả hệ thống .............................................................................................. 106
Mục đích .............................................................................................. 106
Các đối tượng sử dụng ......................................................................... 106
Mô hình tổng quát ............................................................................... 107
Xây dựng kho dữ liệu văn bản tiếng Việt .................................................... 108
Giới thiệu ............................................................................................. 108
Kiến trúc hệ thống kho dữ liệu ............................................................ 109
Giải pháp xây dựng kho dữ liệu .......................................................... 111
Đánh giá về kho dữ liệu ....................................................................... 115
Triển khai hệ thống phát hiện sao chép văn bản .......................................... 116
Đề xuất hướng phát triển để xử lý dữ liệu lớn ............................................. 121
Giới thiệu ............................................................................................. 121
Đề xuất giải pháp xử lý ....................................................................... 121
Đề xuất phương pháp biểu diễn DNA bằng Tensor ............................ 123
Kết luận Chương 4 ....................................................................................... 124
KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN ........................................................... 126
1. Kết luận ........................................................................................................ 126
2. Hướng phát triển .......................................................................................... 127
DANH MỤC CÁC CÔNG TRÌNH KHOA HỌC ĐÃ CÔNG BỐ .................... 128
TÀI LIỆU THAM KHẢO .................................................................................... 129
- iv -
DANH MỤC CÁC TỪ VIẾT TẮT
CGs Conceptual Graphs
(Đồ thị khái niệm)
DM Data Mart
(Kho dữ liệu cục bộ)
DNA DeoxyriboNucleic Acid
(Chuỗi DNA)
DW Data Warehouse
(Kho dữ liệu)
DWT Discrete Wavelet Transform
(Phép biến đổi Wavelet rời rạc)
GA Genetic Algorithms
(Giải thuật di truyền)
IDF Inverse Document Frequency
(Nghịch đảo tần số văn bản)
LSI Latent Semantic Indexing
(Chỉ mục ngữ nghĩa tiềm ẩn)
NDD Near Dupplicate Detection
(Phát hiện gần trùng lặp)
NLP Natural Language Processing
(Xử lý ngôn ngữ tự nhiên)
PAN Plagiarism Analysis, Authorship Identification, and Near-Duplicate
detection
(Hội nghị quốc tế thường niên về đạo văn)
SVD Singular Value Decomposition
(Phân tích giá trị đơn)
TF Term Frequency
(Tần số từ khóa)
CSDL Cơ sở dữ liệu
ĐHĐN Đại học Đà Nẵng
- v -
DANH MỤC BẢNG BIỂU
Bảng 1.1. Phương pháp và thuật toán đánh giá sự giống nhau của văn bản ............. 28
Bảng 1.2. So sánh và đánh giá một số thuật toán so khớp chuỗi .............................. 32
Bảng 1.3. Một số phương pháp phát hiện sao chép văn bản ..................................... 35
Bảng 1.4. Kết quả các nhóm xếp thứ nhất trong nhiệm vụ EPD .............................. 40
Bảng 2.1. Các tài liệu mẫu để so với giá trị ước lượng ............................................. 64
Bảng 2.2. Tổng hợp kết quả của các phương pháp ................................................... 66
Bảng 3.1. Tổng hợp so sánh các kho dữ liệu tại các cuộc thi PAN .......................... 92
Bảng 3.2. Các giá trị thiết lập cho quá trình thử nghiệm .......................................... 97
Bảng 3.3. Kết quả thực nghiệm ................................................................................. 98
Bảng 4.1. Số tài liệu thử nghiệm được cập nhật vào kho dữ liệu ........................... 115
- vi -
DANH MỤC HÌNH VẼ
Hình 1.1. Mối quan hệ giữa và ........................................................................... 12
Hình 1.2. Quá trình mô hình hóa văn bản ................................................................. 16
Hình 1.3. Mô hình xử lý tổng quát để phát hiện sao chép [124]............................... 39
Hình 2.1. Mô hình vector tạo thành ma trận trọng số Từ/Tài liệu ............................ 44
Hình 2.2. Ví dụ về góc tạo bởi hai vector d1 và d2 .................................................... 44
Hình 2.3. Quá trình vector hóa theo đơn vị từ .......................................................... 53
Hình 2.4. Quá trình vector hóa theo đơn vị câu ........................................................ 62
Hình 2.5. Biểu đồ so sánh kết quả thuật toán với tập tài liệu ................................... 66
Hình 2.6. Biểu đồ so sánh văn bản theo đơn vị từ và câu ......................................... 67
Hình 3.1. Mô tả cách xử lý để phát hiện sao chép văn bản ....................................... 71
Hình 3.2. Phân tích đa phân giải sử dụng DWT ....................................................... 73
Hình 3.3. Đường tín hiệu qua DWT [50] .................................................................. 75
Hình 3.4. Đường sóng Haar Wavelet ........................................................................ 76
Hình 3.5. Đề xuất mô hình hệ thống phát hiện sự giống nhau của văn bản ............. 79
Hình 3.6. Quá trình xử lý để đánh giá văn bản cần kiểm tra .................................... 80
Hình 3.7. Mô hình tạo bộ dữ liệu thử nghiệm tiếng Việt .......................................... 94
Hình 3.8. Giá trị prec và rec đạt được qua các mức ngưỡng khác nhau................... 98
Hình 3.9. Kết quả thực nghiệm với ngưỡng ε = 10-11 ............................................... 99
Hình 3.10. Giao diện kết quả một lần thực nghiệm ................................................ 100
Hình 4.1. Quy trình phát hiện sao chép ................................................................... 107
Hình 4.2. Kiến trúc hệ thống kho dữ liệu chi tiết .................................................... 110
Hình 4.3. Quy trình xây dựng kho dữ liệu .............................................................. 111
Hình 4.4. Quy trình xử lý, cập nhật tài liệu vào kho dữ liệu .................................. 115
Hình 4.5. Giao diện hệ thống thử nghiệm ............................................................... 117
Hình 4.6. Mô hình phát hiện đánh dấu nội dung giống nhau .................................. 118
Hình 4 ...
[65] Li, Y., Bandar, Z., McLean, D., and O'shea, J., "A Method for Measuring
Sentence Similarity and iIts Application to Conversational Agents", in
FLAIRS Conference, 2004, pp. 820-825.
[66] Li, Y., McLean, D., Bandar, Z. A., and Crockett, K., "Sentence similarity
based on semantic nets and corpus statistics", IEEE Transactions on
Knowledge & Data Engineering, pp. 1138-1150, 2006.
[67] Li, Y. and Ngom, A., "Classification of clinical gene-sample-time microarray
expression data via tensor decomposition methods", in International Meeting
on Computational Intelligence Methods for Bioinformatics and Biostatistics,
Springer, 2010, pp. 275-286.
[68] Liang, C.-W. and Chen, P.-Y., "DWT based text localization", International
Journal of Applied Science and Engineering, vol. 2, pp. 105-116, 2004.
[69] Lin, C.-Y., "Rouge: A package for automatic evaluation of summaries", in
Text Summarization Branches Out: Proceedings of the ACL-04 Workshop,
Barcelona, Spain, 2004, pp. 74-81.
- 134 -
[70] Lin, D., "An information-theoretic definition of similarity", in Proceedings of
the Fifteenth International Conference on Machine Learning (ICML),
Madison, Wisconsin, USA, 1998, pp. 296-304.
[71] Liu, N., Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., et al., "Text
representation: From vector to tensor", in Fifth IEEE International Conference
on Data Mining (ICDM’05), IEEE, 2005, pp. 725-728.
[72] Ljubesic, N., Boras, D., Bakaric, N., and Njavro, J., "Comparing measures of
semantic similarity", in Proceedings of the 30th International Conference on
Information Technology Interfaces (ITI2008), IEEE, 2008, pp. 675-682.
[73] Lyon, C., Barrett, R., and Malcolm, J., "Plagiarism is easy, but also easy to
detect", pp. 57-65, 2006.
[74] M K, V. and K, K., "A Survey on Similarity Measures in Text Mining",
Machine Learning and Applications: An International Journal (MLAIJ) vol.
3, pp. 19-28, 2016.
[75] Majumder, G., Pakray, P., Gelbukh, A., and Pinto, D., "Semantic textual
similarity methods, tools, and applications: A survey", Computación y
Sistemas, vol. 20, pp. 647-665, 2016.
[76] Mallat, S. G., "A theory for multiresolution signal decomposition: the wavelet
representation", IEEE transactions on pattern analysis and machine
intelligence, vol. 11, pp. 674-693, 1989.
[77] Manku, G. S., Jain, A., and Das Sarma, A., "Detecting near-duplicates for web
crawling", in Proceedings of the 16th International Conference on World Wide
Web, ACM, 2007, pp. 141-150.
[78] Matsuo, Y. and Ishizuka, M., "Keyword extraction from a single document
using word co-occurrence statistical information", International Journal on
Artificial Intelligence Tools, vol. 13, pp. 157-169, 2004.
[79] Melamed, I. D., Green, R., and Turian, J. P., "Precision and recall of machine
translation", in Companion Volume of the Proceedings of HLT-NAACL 2003-
Short Papers, San Francisco, CA, USA, Morgan Kaufmann, 2003, pp. 61-63.
[80] Meuschke, N. and Gipp, B., "State-of-the-art in detecting academic
plagiarism", International Journal for Educational Integrity, vol. 9, pp. 50-71,
2013.
[81] Michailidis, P. D. and Margaritis, K. G., "On-line string matching algorithms:
Survey and experimental results", International Journal of Computer
Mathematics, vol. 76, pp. 411-434, 2001.
[82] Miller, G. and Fellbaum, C., Wordnet: An electronic lexical database, MIT
Press Cambridge, 1998.
[83] Mozgovoy, M., Kakkonen, T., and Sutinen, E., "Using natural language
parsers in plagiarism detection", in Workshop on Speech and Language
Technology in Education (SLaTE'07), Farmington, PA, USA, 2007, pp. 77-79.
- 135 -
[84] Nahnsen, T., Uzuner, O., and Katz, B., "Lexical chains and sliding locality
windows in content-based text similarity detection", in Companion Volume to
the Proceedings of Conference including Posters/Demos and tutorial
abstracts, AIM, 2005, pp. 150-154.
[85] Nguyen, L. T., Toan, N. X., and Dien, D., "Vietnamese plagiarism detection
method", in Proceedings of the Seventh Symposium on Information and
Communication Technology, ACM, 2016, pp. 44-51.
[86] Nguyen, N. A. T., Yang, H.-J., and Kim, S., "HOKF: High Order Kalman
Filter for Epilepsy Forecasting Modeling", Biosystems, vol. 158, pp. 57-67,
2017.
[87] Niwattanakul, S., Singthongchai, J., Naenudorn, E., and Wanapu, S., "Using
of Jaccard coefficient for keywords similarity", in Proceedings of the
International MultiConference of Engineers and Computer Scientists, Hong
Kong, 2013, pp. 380-384.
[88] Okada, I., Saito, M., Oida, Y., Yamato, H., Hiekata, K., and Miura, S.,
"Development of the Method for the Appropriate Selection of the Successor
by Applying Metadata to the Standardization Reports and Members", in Joint
International Semantic Technology Conference, Springer, 2012, pp. 255-266.
[89] Olkkonen, J. T., Discrete Wavelet Transforms-Theory and Application,
InTechOpen, 2011.
[90] Osman, A. H., Salim, N., and Abuobieda, A., "Survey of text plagiarism
detection", Computer Engineering and Applications Journal (ComEngApp),
vol. 1, pp. 37-45, 2012.
[91] Osman, A. H., Salim, N., Binwahlan, M. S., Alteeb, R., and Abuobieda, A.,
"An improved plagiarism detection scheme based on semantic role labeling",
Applied Soft Computing, vol. 12, pp. 1493-1502, 2012.
[92] Paliwal, S., Singh, R. S., and Mandoria, H., "A Survey on various text
detection and extraction techniques from videos and images", International
Journal of Computer Science Engineering and Information Technology
Research (IJCSEITR) vol. 6, pp. 1-10, 2016.
[93] Park, B. K. and Song, I. Y., "Toward total business intelligence incorporating
structured and unstructured data", in Proceedings of the 2nd International
Workshop on Business intelligencE and the WEB, ACM, 2011, pp. 12-19.
[94] Park, L. A., Ramamohanarao, K., and Palaniswami, M., "A novel document
retrieval method using the discrete wavelet transform", ACM Transactions on
Information Systems (TOIS), vol. 23, pp. 267-298, 2005.
[95] Pataki, M., "Distributed similarity and plagiarism search", in Proceedings of
the Automation and Applied Computer Science Workshop, Budapest,
Hungary, 2006, pp. 121-130.
- 136 -
[96] Phan, A. H. and Cichocki, A., "Tensor decompositions for feature extraction
and classification of high dimensional datasets", Nonlinear theory and its
applications, IEICE, vol. 1, pp. 37-68, 2010.
[97] Philip, S., Shola, P., and Ovye, A., "Application of content-based approach in
research paper recommendation system for a digital library", International
Journal of Advanced Computer Science and Applications, vol. 5, 2014.
[98] Phương, N. H. and Sơn, V. M., "Tách ảnh dùng biến đổi Wavelet và phân tích
thành phần độc lập", Tạp chí Phát triển Khoa học và Công nghệ, vol. 11, pp.
5-16, 2008.
[99] Popivanov, I., "Efficient similarity queries over time series data using
wavelets", in Proceedings of the 18th International Conference on Data
Engineering, San Jose, Calif, USA, 2002, pp. 273-282.
[100] Potthast, M., et al, "Overview of the 1st International Competition on
Plagiarism Detection", In Stein, B., et al (Ed), PAN’09, pp. 1-9, 2009.
[101] Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., et
al., "Overview of the 5th International Competition on Plagiarism Detection",
in CLEF (Working Notes), 2013.
[102] Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., and Rosso, P.,
"Overview of the 2nd International Competition on Plagiarism Detection",
Proceedings of PAN at CLEF, 2010.
[103] Rahman, M. and Chow, T. W., "Content-based hierarchical document
organization using multi-layer hybrid network and tree-structured features",
Expert Systems with Applications, vol. 37, pp. 2874-2881, 2010.
[104] Rahman, M., Yang, W. P., Chow, T. W., and Wu, S., "A flexible multi-layer
self-organizing map for generic processing of tree-structured data", Pattern
Recognition, vol. 40, pp. 1406-1424, 2007.
[105] Raviraj, P. and Sanavullah, M., "The modified 2D-Haar Wavelet
Transformation in image compression", Middle-East Journal of Scientific
Research, vol. 2, pp. 73-78, 2007.
[106] Reddy, G. S., Rajinikanth, T., and Rao, A. A., "Clustering and Classification
of Text Documents Using Improved Similarity Measure", International
Journal of Computer Science and Information Security, vol. 14, p. 39, 2016.
[107] Řehůřek, R., "Semantic-based plagiarism detection", Masaryk University,
2008.
[108] Ritter, H. and Kohonen, T., "Self-organizing semantic maps", Biological
cybernetics, vol. 61, pp. 241-254, 1989.
[109] Rubini, P. and Leela, M. S., "A survey on plagiarism detection in text mining",
International Journal of Research in Computer Applications and Robotics,
vol. 1, pp. 117-119, 2013.
[110] Runeson, P., Alexandersson, M., and Nyholm, O., "Detection of duplicate
defect reports using natural language processing", in Proceedings of the 29th
- 137 -
International Conference on Software Engineering (ICSE'07), IEEE, 2007,
pp. 499-510.
[111] Salton, G., Automatic text processing: The transformation, analysis, and
retrieval of information by computer, 1989.
[112] Salton, G., "Developments in automatic text retrieval", Science, vol. 253, pp.
974-980, 1991.
[113] Salton, G. and Buckley, C., "Term-weighting approaches in automatic text
retrieval", Information Processing & Management, vol. 24, pp. 513-523, 1988.
[114] Salton, G., Wong, A., and Yang, C. S., "A vector space model for automatic
indexing", Communications of the ACM, vol. 18, pp. 613-620, 1975.
[115] Si, A., Leong, H. V., and Lau, R. W., "Check: a document plagiarism detection
system", in Proceedings of the 1997 ACM Symposium on Applied Computing,
ACM, 1997, pp. 70-77.
[116] Sidorov, G., Gelbukh, A., Gómez-Adorno, H., and Pinto, D., "Soft similarity
and soft cosine measure: Similarity of features in vector space model",
Computación y Sistemas, vol. 18, pp. 491-504, 2014.
[117] Singh Choudhry, M., Kapoor, R., Abhishek, Gupta, A., and Bharat, B., A
survey on different discrete wavelet transforms and thresholding techniques
for EEG denoising, 2016.
[118] Singla, N. and Garg, D., "String matching algorithms and their applicability in
various applications", International Journal of Soft Computing and
Engineering, vol. 1, pp. 218-222, 2012.
[119] Sorokina, D., Gehrke, J., Warner, S., and Ginsparg, P., "Plagiarism detection
in arXiv", in Proceedings of the 6th International Conference on Data Mining
(ICDM'06), IEEE, 2006, pp. 1070-1075.
[120] Sowa, J. F., "Conceptual graphs for a data base interface", IBM Journal of
Research and Development, vol. 20, pp. 336-357, 1976.
[121] Stanković, R. S. and Falkowski, B. J., "The Haar wavelet transform: its status
and achievements", Computers & Electrical Engineering, vol. 29, pp. 25-44,
2003.
[122] Stein, B., Barrón Cedeño, L. A., Eiselt, A., Potthast, M., and Rosso, P.,
"Overview of the 3rd International Competition on Plagiarism Detection", in
CEUR Workshop Proceedings, CEUR Workshop Proceedings, 2011.
[123] Stein, B., Rosso, P., Stamatatos, E., Koppel, M., and Agirre, E., "3rd PAN
workshop on uncovering plagiarism, authorship and social software misuse",
in 25th Annual Conference of the Spanish Society for Natural Language
Processing (SEPLN), San Sebastian, Spain, 2009, pp. 1-77.
[124] Stein, B., zu Eissen, S. M., and Potthast, M., "Strategies for retrieving
plagiarized documents", in Proceedings of the 30th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
ACM, 2007, pp. 825-826.
- 138 -
[125] Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., and Kim, M.-K.,
"Plagiarism detection using the Levenshtein distance and Smith-Waterman
algorithm", in Proceedings of the 3rd International Conference on Innovative
Computing Information and Control (ICICIC'08), IEEE, 2008, pp. 569-569.
[126] Tang, X. L., Wang, X. R., and Wang, M., "Text Summarization Using Hybrid
Parallel Genetic Algorithm", in Advanced Materials Research, Trans Tech
Publ, 2011, pp. 1073-1076.
[127] Taufin M Jeeralbhavi, D. J. D. P., Shivananda V. Seeri, "Text Extraction and
Localization From Captured Images", International Journal on Recent and
Innovation Trends in Computing and Communication (IJRITCC), vol. 4, pp.
119 -121, 2016.
[128] Tesar, R., Strnad, V., Jezek, K., and Poesio, M., "Extending the single words-
based document model: a comparison of bigrams and 2-itemsets", in
Proceedings of the 2006 ACM Symposium on Document Engineering, ACM,
2006, pp. 138-146.
[129] Thụy, H. Q., Giáo trình Khai phá Dữ liệu Web, Giáo dục Việt Nam, 2009.
[130] Toi, N. X., Hung, N. V., and Son, P. B., "A unified plagiarism detection
framework", VNU Journal of Science: Mathematics-Physics, vol. 27, pp. 55-
62, 2011.
[131] Torres, S. and Gelbukh, A., "Comparing similarity measures for original WSD
lesk algorithm", Research in Computing Science, vol. 43, pp. 155-166, 2009.
[132] Vidakovic, B., Statistical modeling by wavelets, John Wiley & Sons, 2009.
[133] Wahlstrom, S., "Evaluation of String Searching Algorithms", in IDT Mini-
conference on Interesting Results in Computer Science and Engineering,
2004.
[134] Weber-Wulff, D., Möller, C., Touras, J., and Zincke, E., "Plagiarism detection
software test 2013", 2013.
[135] Xexéo, G., de Souza, J., Castro, P. F., and Pinheiro, W. A., "Using wavelets
to classify documents", in 2008 IEEE/WIC/ACM International Conference on
Web Intelligence and Intelligent Agent Technology, IEEE, 2008, pp. 272-278.
[136] Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G., "Efficient similarity
joins for near-duplicate detection", Proceedings of the ACM International
Conference on Management of Data, pp. 1033-1044, 2011.
[137] Yang, H. and Callan, J., "Near-duplicate detection by instance-level
constrained clustering", in Proceedings of the 29th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
ACM, 2006, pp. 421-428.
[138] Zhanga, J., Suna, X., and Wangc, J., "Semantic Keyword-based Text Copy
Detection Method", Advanced Science and Technology Letters, vol. 49, pp.
253-261, 2014.
- 139 -
[139] Zimmermann, H. J., Fuzzy set theory and its applications, Kluwer Academic
Publishers, Boston/Dordrecht/London, 2001.
[140] Zini, M., Fabbri, M., Moneglia, M., and Panunzi, A., "Plagiarism detection
through multilevel text comparison", in Proceedings of the Second
International Conference on Automated Production of Cross Media Content
for Multi-Channel Distribution (AXMEDIS'06), IEEE, 2006, pp. 181-185.
[141] Zou, D., Long, W.-J., and Ling, Z., "A cluster-based plagiarism detection
method", in Notebook Papers of CLEF 2010 LABs and Workshops, 2010.

File đính kèm:

luan_an_danh_gia_muc_do_giong_nhau_cua_van_ban_tieng_viet.pdf
2-HoPhanHieu-Tomtat_tiengAnh.pdf
3-HoPhanHieu-Tomtat_tiengViet.pdf
4-HoPhanHieu-Dong_gop_moi_cua_LA.pdf
5-HoPhanHieu-Trich_yeu_cua_LA.pdf