Wayback Machine

History and Technical Information
– Wayback Machine began archiving cached web pages in 1996.
– Founders Brewster Kahle and Bruce Gilliat launched the machine in October 2001.
– The machine addresses the problem of web content vanishing or websites shutting down.
– Users can see archived versions of web pages across time.
– The name ‘Wayback Machine’ is a reference to a fictional time-traveling device.
– The software crawls the web and downloads publicly accessible information and data files.
– Archive-It.org was developed in 2005 to overcome inconsistencies in partially cached websites.
– Crawls are contributed from various sources, including third parties and internal sources.
– Worldwide Web Crawls capture the global web since 2010.
– Documents and resources are stored with time stamp URLs.

Storage Capacity and Accessibility
– In 2003, the Wayback Machine was growing at a rate of 12 terabytes per month.
– The data is stored on custom-designed PetaBox rack systems.
– As of 2009, the Wayback Machine contained approximately three petabytes of data.
– The storage architecture was migrated to Sun Open Storage in 2009.
– The Wayback Machine’s storage capacity was increased by 700 terabytes in 2011.
– The Wayback Machine allows users to access archived versions of websites.
– A new version of the machine with an updated interface was made available in 2011.
– Features like Changes, Summary, and a graphical site map were added.
– The Beta version of the new Wayback Machine has a more complete index of crawled materials.
– The classic Wayback Machine’s index updates were phased out in 2011.

Milestones and Updates
– As of 2009, the Wayback Machine had saved over 38.2 billion web pages.
– The Wayback Machine has archived more than 865 billion web pages as of November 2023.
– In January 2013, the company announced a milestone of 240 billion URLs.
– The Save a Page feature was introduced in October 2013.
– In December 2014, the Wayback Machine contained 435 billion web pages.

Legal Issues and Policies
– Wayback Machine respected robots.txt to determine if a website would be crawled or if its archives would be publicly viewable.
– Website owners could opt-out of Wayback Machine through robots.txt.
– Previously archived pages became unavailable if a site blocked the Internet Archive.
– Website owners could directly request the Internet Archive to stop crawling or archiving their site.
– The Internet Archive is not interested in preserving materials of those who don’t want them in the collection.
– Wayback Machine’s retroactive exclusion policy is based on recommendations from the University of California, Berkeley.
– It complies with the policy to avoid litigation.
– The policy relaxed in 2017, no longer honoring robots.txt for U.S. government websites.
– The Wayback Machine requires an explicit exclusion request to remove a site.
– It changed the policy due to reports of defunct sites inadvertently excluded from the archive.
– The Wayback Machine may violate copyright laws in Europe.
– The content creator can request the removal of their pages from the Archive.
– The Archive has faced legal cases specifically related to its archiving efforts.
– The exclusion policies for the Wayback Machine can be found in the FAQ section.
– Only the content creator can decide where their content is published or duplicated.
– The Internet Archive removed critical sites about Scientology from the Wayback Machine.
– Lawyers from the Church of Scientology demanded the removal.
– The Archive was accused of copyright infringement and violations of the DMCA and the Computer Fraud and Abuse Act.
– The Archive settled the lawsuit after fixing the problem.
– A client’s trademark dispute was defended using the Wayback Machine as evidence.
– Various sites critical of Scientology were removed from the Wayback Machine.
– The removal was initially attributed to a request by the site owner.
– It was later clarified that lawyers from the Church of Scientology demanded the removal.
– Healthcare Advocates accused the Internet Archive of copyright infringement.
– The Archive was accused of violating the DMCA and the Computer Fraud and Abuse Act.
– The plaintiff claimed that the Archive should have removed previous copies of their website.
– The lawsuit was settled out of court after the Archive fixed the problem.
– The Archive was accused of not removing material that was publicly visible on Wayback.
– Suzanne Shell filed a lawsuit demanding payment for archiving her website.
– The Internet Archive filed a declaratory judgment action seeking a determination of non-violation of copyright.
– The lawsuit was settled, with the Archive recognizing Shell’s valid copyright.
– Shell respected the historical value of the Archive’s goal and never intended harm.
– Counterclaims were dismissed except for breach of contract.

Uses, Limitations, and Technical Details
– Scholars have written about 350 articles on the Wayback Machine, studying its data storage and the pages in its archive.
– The Wayback Machine saves most hyperlinks, keeping them active.
– Journalists use the Wayback Machine to view dead websites, track changes, and hold politicians accountable.
– The Wayback Machine has been used to expose battlefield lies and verify information.
– It is heavily used by Wikipedia editors for content creation and reference access.
– In 2014, there was a six-month lag time between crawling and availability, which is now reduced to 3 to 10 hours.
– The Wayback Machine has limited search facilities, relying on site descriptions rather than page content.
– It cannot completely archive web pages with interactive features like Flash or JavaScript forms.
YouTube comments are no longer displayed in archived videos.
– The web crawler may result in broken hyperlinks, missing images, and cannot archive orphan pages.
– In the Netbula LLC v. Chordiant Software Inc. case, the Wayback Machine’s robots.txt exclusion caused access to archived pages to be removed.
– Netbula objected, but the court ordered them to disable the blockage temporarily.
– In the Telewizja Polska USASources: https://en.wikipedia.org/wiki/Wayback_Machine