DIRECT AND INVERSE PROBLEMS OF INFORMATION RETRIEVAL OF TEXT DOCUMENTS

W.S. Rogoza, G.V. Ishchenko

Èlektron. model. 2024, 46(6):08-28

https://doi.org/10.15407/emodel.46.06.008

ABSTRACT

Information retrieval on the Web, databases and other sources of text documents includes tasks that require analyzing the relationships between documents and the constituent elements of documents. Establishing such relations allows search engine users to retrieve the documents they need from endless sources of information using concise search request, as well as to choose effective methods of processing found documents to solve various tasks of analyzing the content of documents. A classification of methods for processing text documents using forward and reverse indices is proposed, which allows generalizing the properties of document search and processing methods.

Elementary examples of application of the methods are given, which allow the reader to enter the essence of the issues quickly, discussed in the article, and tounderstand better the principles of construction of these methods and their suitability for solving specific information retrieval tasks.

KEYWORDS

information retrieval, text documents, models of direct and reverse document indexing.

REFERENCES

Bard, G. (2007). Spelling-error tolerant, order-independent pass-phrases via the Damerau–Levenshtein string-edit distance metric. УProceedings of the fifth australasian symposium on ACSW frontiers (с. 117-124). Australian Computer Society.
Levandowsky, M., & Winter, D. (1971). Distance between sets. Nature, 234(5), 34-35.
https://doi.org/10.1038/234034a0
Min, C., Shiwen, M., Yin, Z., Victor, C., & Leung, M. (2014). Big Data. Related Technologies, Challenges, and Future Prospects. Springer.
Amer-Yahia, S., &Lalmas, M. (2006). XML search: Languages, INEX and scoring. SIGMOD Record, 35(4), 16-23.
https://doi.org/10.1145/1228268.1228271
Zhang, Z., Gentile, A.L., &Ciravegna, F. (2013). Recent advances in methods of lexical semantic relatedness — a survey. Natural Language Engineering, 19(4), 411-479.
https://doi.org/10.1017/S1351324912000125
Baader, F., Calvanese, D., McGuinness, D., Nardi, D., & Patel-Schneider, P.F. (Ред.). (2003). The description logic handbook. Cambridge University Press, 2003. 574 p.
Goczyła, K. (2011). Ontologie w systemach informatycznych. Akademicka Oficyna Wydawnicza EXIT.
Martin, P. (б. д.). Translations between RDF+OWL, N3, KIF, UML, FL, FCG and FE. URL: http://www.webkb.org/doc/model/comparisons.html (date of access: 20.09.2024).
Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets. Cambridge University Press.
https://doi.org/10.1017/CBO9781139924801
Mander, U. (1994). Finding similar files in a large file systems. У USENIX conference (с. 1-10).
Павел. (б.д.). Триллион маленьких шинглов. Хабр. URL: https://habr.com/ru/companies/antiplagiat/articles/445952
Andoni, A., &Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. ACM, 51(1), 117-122.
https://doi.org/10.1145/1327452.1327494
Broder, A.Z., Charikar, M., Frieze, A.M., & Mitzenmacher, M. (1998). Min-wise independent permutations. The thirtieth annual ACM symposium on theory of computing (с. 327-336). Dallas, Texas, USA, 24-26 May.
https://doi.org/10.1145/276698.276781
Wooldridge, M. (2009). An introduction to multiagent systems. John Wiley & Sons Ltd.
Hensinger, M. (2006). Finding near-duplicate web pages: A large-scale evaluation of algorithms. У 29th SIGIR conf. (с. 284-291).
https://doi.org/10.1145/1148170.1148222
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. У int. conf. on very large databases (с. 518-529).
Deans, J., &Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. ACM, 51(1), 107-113.
https://doi.org/10.1145/1327452.1327492
Manning, C.D., Ragnavan, P., &Schütze, H. (2008). Introduction to informal retrieval. Cambridge University Press.
https://doi.org/10.1017/CBO9780511809071
Havelivala, T.H. (2002). Topic-sensitive PageRank. У 11th int. world-wide-web conference (с. 517-526).
https://doi.org/10.1145/511446.511513
Gyöngi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating link spam with trust-rank. УProc. 30th intl. conf. on very large databases (с. 576-587).
https://doi.org/10.1016/B978-012088469-8.50052-8

Full text: PDF