DIRECT AND INVERSE PROBLEMS OF INFORMATION RETRIEVAL OF TEXT DOCUMENTS

W.S. Rogoza, G.V. Ishchenko

Èlektron. model. 2024, 46(6):08-28

https://doi.org/10.15407/emodel.46.06.008

ABSTRACT

Information retrieval on the Web, databases and other sources of text documents includes tasks that require analyzing the relationships between documents and the constituent elements of documents. Establishing such relations allows search engine users to retrieve the documents they need from endless sources of information using concise search request, as well as to choose effective methods of processing found documents to solve various tasks of analyzing the content of documents. A classification of methods for processing text documents using forward and reverse indices is proposed, which allows generalizing the properties of document search and processing methods.

Elementary examples of application of the methods are given, which allow the reader to enter the essence of the issues quickly, discussed in the article, and tounderstand better the principles of construction of these methods and their suitability for solving specific information retrieval tasks.

KEYWORDS

information retrieval, text documents, models of direct and reverse document indexing.

REFERENCES

  1. Bard, G. (2007). Spelling-error tolerant, order-independent pass-phrases via the Damerau–Levenshtein string-edit distance metric. УProceedings of the fifth australasian symposium on ACSW frontiers (с. 117-124). Australian Computer Society.
  2. Levandowsky, M., & Winter, D. (1971). Distance between sets. Nature, 234(5), 34-35.
    https://doi.org/10.1038/234034a0
  3. Min, C., Shiwen, M., Yin, Z., Victor, C., & Leung, M. (2014). Big Data. Related Techno­logies, Challenges, and Future Prospects. Springer.
  4. Amer-Yahia, S., &Lalmas, M. (2006). XML search: Languages, INEX and scoring. SIGMOD Record, 35(4), 16-23.
    https://doi.org/10.1145/1228268.1228271
  5. Zhang, Z., Gentile, A.L., &Ciravegna, F. (2013). Recent advances in methods of lexical semantic relatedness — a survey. Natural Language Engineering, 19(4), 411-479.
    https://doi.org/10.1017/S1351324912000125
  6. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., & Patel-Schneider, P.F. (Ред.). (2003). The description logic handbook. Cambridge University Press, 2003. 574 p.
  7. Goczyła, K. (2011). Ontologie w systemach informatycznych. Akademicka Oficyna Wydawnicza EXIT.
  8. Martin, P. (б. д.). Translations between RDF+OWL, N3, KIF, UML, FL, FCG and FE. URL: http://www.webkb.org/doc/model/comparisons.html (date of access: 20.09.2024).
  9. Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets. Cambridge University Press.
    https://doi.org/10.1017/CBO9781139924801
  10. Mander, U. (1994). Finding similar files in a large file systems. У USENIX confe­rence (с. 1-10).
  11. Павел. (б.д.). Триллион маленьких шинглов. Хабр. URL: https://habr.com/ru/companies/antiplagiat/articles/445952
  12. Andoni, A., &Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. ACM, 51(1), 117-122.
    https://doi.org/10.1145/1327452.1327494
  13. Broder, A.Z., Charikar, M., Frieze, A.M., & Mitzenmacher, M. (1998). Min-wise independent permutations. The thirtieth annual ACM symposium on theory of computing (с. 327-336). Dallas, Texas, USA, 24-26 May.
    https://doi.org/10.1145/276698.276781
  14. Wooldridge, M. (2009). An introduction to multiagent systems. John Wiley & Sons Ltd.
  15. Hensinger, M. (2006). Finding near-duplicate web pages: A large-scale evaluation of algorithms. У 29th SIGIR conf. (с. 284-291).
    https://doi.org/10.1145/1148170.1148222
  16. Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. У int. conf. on very large databases (с. 518-529).
  17. Deans, J., &Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. ACM, 51(1), 107-113.
    https://doi.org/10.1145/1327452.1327492
  18. Manning, C.D., Ragnavan, P., &Schütze, H. (2008). Introduction to informal retrieval. Cambridge University Press.
    https://doi.org/10.1017/CBO9780511809071
  19. Havelivala, T.H. (2002). Topic-sensitive PageRank. У 11th int. world-wide-web conference (с. 517-526).
    https://doi.org/10.1145/511446.511513
  20. Gyöngi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating link spam with trust-rank. УProc. 30th intl. conf. on very large databases (с. 576-587).
    https://doi.org/10.1016/B978-012088469-8.50052-8

Full text: PDF