论文标题

缓存HTTP 404响应消除了不必要的档案重播请求

Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

论文作者

Garg, Kritika, Jayanetti, Himarsha R., Alam, Sawood, Weigle, Michele C., Nelson, Michael L.

论文摘要

重播后,存档网页上的JavaScript可以生成重复出现的HTTP请求,从而导致Web存档的不必要流量。在一个示例中,一个存档的页面平均每分钟的要求超过1000个请求。这些请求对用户看不到,因此,如果用户在浏览器选项卡中打开这样的存档页面,则他们将不知道其浏览器正在继续生成Web档案库的流量。我们发现需要定期更新的网页(例如,无线电播放列表,运动得分的更新,图像旋转木马)更有可能发出此类重复的请求。如果未归档网页要求的资源,则某些Web档案可能会尝试通过请求实时网络中的资源来修补档案。如果所请求的资源在实时网络上不可用,则无法存档资源,并且响应仍然是HTTP 404。一些存档的页面继续像在实时网络上一样频繁地对服务器进行投票,而某些页面则在其请求返回HTTP 404响应时更加频繁地对服务器进行投票,从而造成了不必要的不​​必要的流量。在大规模上,此类网页实际上是对网络档案库的服务攻击的一种拒绝。 Web档案存档需要大量的计算,网络和存储资源,然后在实时网络上成功重播页面,并且这些资源不应用于不必要的HTTP流量。我们提出的解决方案是使用Cache-Control HTTP响应标头优化档案重播。我们在测试环境中实现了这种方法,并缓存了HTTP 404响应,该响应阻止了浏览器的请求到达Web Archive Server。

Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to the web archive. We found that web pages that require regular updates (e.g., radio playlists, updates for sports scores, image carousels) are more likely to make such recurring requests. If the resources requested by the web page are not archived, some web archives may attempt to patch the archive by requesting the resources from the live web. If the requested resources are unavailable on the live web, the resources cannot be archived, and the responses remain HTTP 404. Some archived pages continue to poll the server as frequently as they did on the live web, while some pages poll the server even more frequently if their requests return HTTP 404 responses, creating a high amount of unnecessary traffic. On a large scale, such web pages are effectively a denial of service attack on the web archive. Significant computational, network and storage resources are required for web archives to archive and then successfully replay pages as they were on the live web, and these resources should not be spent on unnecessary HTTP traffic. Our proposed solution is to optimize archival replay using Cache-Control HTTP response headers. We implemented this approach in a test environment and cached HTTP 404 responses that prevented the browser's requests from reaching the web archive server.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源