About: HTTP 404 is a research topic. Over the lifetime, 6 publications have been published within this topic receiving 44 citations. The topic is also known as: 404 response code & error 404.
TL;DR: It is found that only 18.91% (1290 out of 6820) of URLs cited in two Indian LIS journals articles published between 2002 and 2010 were extracted and the half-life of URL citations was increased from 6.33 years to 13.85 years after recovering missing URLs from Wayback machine.
TL;DR: In this article, a system and related techniques monitor a user's attempt to access a Web site or other network site, and detect failed access attempts such as HTTP 404 messages or others.
Abstract: A system and related techniques monitor a user's attempt to access a Web site or other network site, and detect failed access attempts such as HTTP 404 messages or others. Rather than pass the access failure message directly through to the user, the system may communicate with a search service or other index of stored or cached Web pages or other content images. The user may be given a choice via a dialogue to view a stored version of the site they were attempting to access, so that some or all of the desired information may still be accessed. In embodiments, the user may be directed to differing sources of the identical or similar desired content, via a priority stack. If further embodiments, the operator of the Web site or other content source may choose to apply a cached content override to opt out of making stored content available to searchers or other users, for example for digital rights management purposes.
TL;DR: The present study attempts to ascertain the proportion of missing web references of 5-10 year-old research papers of the five leading open access journals in library and information science to suggest that the number of web citations has increased and goes on increasing with each passing year.
Abstract: The present study attempts to ascertain the proportion of missing web references of 5-10 year-old research papers of the five leading open access (OA) journals in library and information science. The results suggest that the number of web citations has increased from 41.60% of all citations in 1998 to 53.32% in 2002. But a substantial quantity of web citations (32.09%) was found to be missing. The percentage of missing web citations goes on increasing with each passing year – ten-year-old publications having the highest number of missing citations, i.e., 39.96% and five-year-old publications having the lowest number of missing citations (25.89%). 0.92% of citations had moved to a new URL address and 74.14% of missing citations resulted in an HTTP 404 (page not found) error.
TL;DR: This paper proposes augmenting these binary responses with a model for selecting and ranking recommended web pages in a Web archive to enhance both HTTP 404 responses and HTTP 200 responses by surfacingweb pages in the archive that the user may not know existed.
Abstract: When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by Uniform Resource Identifier (URI) lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are potentially similar to the requested web page. In this paper, we propose augmenting these binary responses with a model for selecting and ranking recommended web pages in a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing web pages in the archive that the user may not know existed. First, we check if the URI is already classified in DMOZ or Wikipedia. If the requested URI is not found, we use machine learning to classify the URI using DMOZ as our ontology and collect candidate URIs to recommended to the user. The classification is in two parts, a first-level classification and a deep classification. Next, we filter the candidates based on if they are present in the archive. Finally, we rank candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity. We calculated the F1 score for different methods of classifying the requested web page at the first level. We found that using all-grams from the URI after removing numerals and the top-level domain (TLD) produced the best result with F1 =0.59. For the deep-level classification, we measured the accuracy at each classification level. For second-level classification, the micro-average F1=0.30 and for third-level classification, F1=0.15. We also found that 44.89% of the correctly classified URIs contained at least one word that exists in a dictionary and 50.07% of the correctly classified URIs contained long strings in the domain. In comparison with the URIs from our Wayback access logs, only 5.39% of those URIs contained only words from a dictionary, and 26.74% contained at least one word from a dictionary. These percentages are low and may affect the ability for the requested URI to be correctly classified.
TL;DR: In this paper, the authors examined the vanishing nature of URLs and recovery of vanished URLs through Internet Archive and Google search engine, and found that 66.19 percent of the total vanished URLs were recovered by the Internet Archive while Google managed to recover only 30.70 percent.
Abstract: This article examines the vanishing nature of URLs and recovery of vanished URLs through Internet Archive and Google search engine. For that purpose study investigates the URLs cited in the articles of two LIS journals published during 2009-2013. A total of 226 articles published in two open access LIS journals were selected. Of 5197 citations cited in 226 articles, 21.05 percent were URLs (1094). Study found that 38.12 percent (417 out of 5197) URLs were found missing and remaining 61.88 percent of URLs were active at the time of URL check with W3C link checker. The HTTP 404 error message – “page not found” was the overwhelming message encountered and represented 54.2 percent of all HTTP error message. Internet Archive and Google search engine were used to recover vanished URLs. However, the Internet Archive recovered 66.19 percent of the total vanished URLs, whereas, Google manages to recover only 30.70 percent of the total vanished URLs. The recovery of vanishing URLs through Internet Archive and Google increased the active URL’s rate from 61.88 per cent to 87.11 per cent and 73.58 per cent respectively. Study found that Internet Archive is a most efficient tool to recover vanished URLs compared to Google search engine.