W.S. Rogoza, G.V. Ishchenko
Èlektron. model. 2024, 46(6):08-28
https://doi.org/10.15407/emodel.46.06.008
ABSTRACT
Information retrieval on the Web, databases and other sources of text documents includes tasks that require analyzing the relationships between documents and the constituent elements of documents. Establishing such relations allows search engine users to retrieve the documents they need from endless sources of information using concise search request, as well as to choose effective methods of processing found documents to solve various tasks of analyzing the content of documents. A classification of methods for processing text documents using forward and reverse indices is proposed, which allows generalizing the properties of document search and processing methods.
Elementary examples of application of the methods are given, which allow the reader to enter the essence of the issues quickly, discussed in the article, and tounderstand better the principles of construction of these methods and their suitability for solving specific information retrieval tasks.
KEYWORDS
information retrieval, text documents, models of direct and reverse document indexing.
REFERENCES
- Bard, G. (2007). Spelling-error tolerant, order-independent pass-phrases via the Damerau–Levenshtein string-edit distance metric. УProceedings of the fifth australasian symposium on ACSW frontiers (с. 117-124). Australian Computer Society.
- Levandowsky, M., & Winter, D. (1971). Distance between sets. Nature, 234(5), 34-35.
- Min, C., Shiwen, M., Yin, Z., Victor, C., & Leung, M. (2014). Big Data. Related Technologies, Challenges, and Future Prospects. Springer.
- Amer-Yahia, S., &Lalmas, M. (2006). XML search: Languages, INEX and scoring. SIGMOD Record, 35(4), 16-23.
- Zhang, Z., Gentile, A.L., &Ciravegna, F. (2013). Recent advances in methods of lexical semantic relatedness — a survey. Natural Language Engineering, 19(4), 411-479.
- Baader, F., Calvanese, D., McGuinness, D., Nardi, D., & Patel-Schneider, P.F. (Ред.). (2003). The description logic handbook. Cambridge University Press, 2003. 574 p.
- Goczyła, K. (2011). Ontologie w systemach informatycznych. Akademicka Oficyna Wydawnicza EXIT.
- Martin, P. (б. д.). Translations between RDF+OWL, N3, KIF, UML, FL, FCG and FE. URL: http://www.webkb.org/doc/model/comparisons.html (date of access: 20.09.2024).
- Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets. Cambridge University Press.
- Mander, U. (1994). Finding similar files in a large file systems. У USENIX conference (с. 1-10).
- Павел. (б.д.). Триллион маленьких шинглов. Хабр. URL: https://habr.com/ru/companies/antiplagiat/articles/445952
- Andoni, A., &Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. ACM, 51(1), 117-122.
- Broder, A.Z., Charikar, M., Frieze, A.M., & Mitzenmacher, M. (1998). Min-wise independent permutations. The thirtieth annual ACM symposium on theory of computing (с. 327-336). Dallas, Texas, USA, 24-26 May.
- Wooldridge, M. (2009). An introduction to multiagent systems. John Wiley & Sons Ltd.
- Hensinger, M. (2006). Finding near-duplicate web pages: A large-scale evaluation of algorithms. У 29th SIGIR conf. (с. 284-291).
- Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. У int. conf. on very large databases (с. 518-529).
- Deans, J., &Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. ACM, 51(1), 107-113.
- Manning, C.D., Ragnavan, P., &Schütze, H. (2008). Introduction to informal retrieval. Cambridge University Press.
- Havelivala, T.H. (2002). Topic-sensitive PageRank. У 11th int. world-wide-web conference (с. 517-526).
- Gyöngi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating link spam with trust-rank. УProc. 30th intl. conf. on very large databases (с. 576-587).