LinuxDevCenter.com

oreilly.comSafari Books Online.Conferences.

We've expanded our Linux news coverage and improved our search! Search for all things Linux across O'Reilly!

Search
Search Tips

advertisement

Listen Print Subscribe to Linux Subscribe to Newsletters

Beyond Browsing the Web
Pages: 1, 2

Getting files from the Web

When I want to save the contents of a URL to a file, I often use GNU wget to do it. It keeps the file's original timestamp, it's smaller and faster to use than a browser, and it shows a visual display of the download progress. (You can get it from the Debian wget package or direct from any GNU archive).



So if I'm grabbing a webcam image, I'll do something like:

wget http://example.org/cam/cam.jpeg

This will save a copy of the image file as cam.jpeg, which will have the same timestamp attributes as the file on the example.org server.

If you interrupt a download before it's finished, use the -c option to resume from the point it left off:

wget -c http://example.org/cam/image.jpeg

Archiving an entire site

To archive a single Web site, use the -m ("mirror") option, which saves files with the exact timestamp of the originals, if possible, and sets the "recursive retrieval" option to download everything. To specify the number of retries to use when an error occurs in retrieval, use the -t option with a numeric argument -- -t3 is usually good for safely retrieving across the net; use -t0 to allow an infinite number of retries when your network connection is really bad but you really want to archive something, regardless of how long it takes. Finally, use the -o option with a filename as an argument to write a progress log to the file -- it can be useful to examine in case anything goes wrong. Once the archival process is complete and you've determined that it was successful, you can delete the logfile.

For example, to mirror the Web site at http://www.bloofga.org, giving up to three retries for retrieval of files and putting error messages in a logfile called mirror.log, type:

wget -m -t3 -o mirror.log http://www.bloofga.org/

To continue an archive that you've left off, use the -nc ("no clobber") option; it doesn't retrieve files that have already been downloaded. For this option to work the way you want it to, be sure to be in the same directory that you were in when you started to archive the site.

For example, to continue an interrupted mirror of the www.bloofga.org site, while making sure that existing files aren't downloaded and giving up to three retries for retrieval of files, type:

wget -nc -m -t3 http://www.bloofga.org/

Next week: Quick tools for command-line image transformations.

Michael Stutz was one of the first reporters to cover Linux and the free software movement in the mainstream press.


Read more Living Linux columns.




Tagged Articles

Be the first to post this article to del.icio.us

Sponsored Resources

  • Inside Lightroom
Advertisement

Sponsored by:

O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com