wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'
wget -r -H -nc -np -nH --cut-dirs=1 -A .pdf,.epub
-e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'
wget -r -H -nc -np -nH --cut-dirs=1 -R .tar,.zip
-e robots=off -l1
-i ./itemlist.txt -B 'http://archive.org/download/'
-r
recursive download; required in order to move from the item identifier down into its individual files-H
enable spanning across hosts when doing recursive retrieving (the initial URL for the directory will be on archive.org, and the individual file locations will be on a specific datanode)-nc
no clobber; if a local copy already exists of a file, don’t download it again (useful if you have to restart the wget at some point, as it avoids re-downloading all the files that were already done during the first pass)-np
no parent; ensures that the recursion doesn’t climb back up the directory tree to other items (by, for instance, following the “../” link in the directory listing)-nH
no host directories; when using -r
, wget will create a directory tree to stick the local copies in, starting with the hostname ({datanode}.us.archive.org/), unless -nH is provided--cut-dirs=1
completes what -nH started by skipping the hostname; when saving files on the local disk (from a URL likehttp://{datanode}.us.archive.org/{drive}/items/{identifier}/{identifier}.pdf), skip the /{drive}/items/ portion of the URL, too, so that all {identifier} directories appear together in the current directory, instead of being buried several levels down in multiple {drive}/items/ directories-e robots=off
archive.org datanodes contain robots.txt files telling robotic crawlers not to traverse the directory structure; in order to recurse from the directory to the individual files, we need to tell wget to ignore the robots.txt directive-i ../itemlist.txt
location of input file listing all the URLs to use; “../itemlist” means the list of items should appear one level up in the directory structure, in a file called “itemlist.txt” (you can call the file anything you want, so long as you specify its actual name after -i)-B 'http://archive.org/download/'
base URL; gets prepended to the text read from the -i file (this is what allows us to have just the identifiers in the itemlist file, rather than the full URL on each line)-l depth --level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5. This option is helpful when you are downloading items that contain external links or URL’s in either the items metadata or other text files within the item. Here’s an example command to avoid downloading external links contained in an items metadata:
wget -r -H -nc -np -nH --cut-dirs=1 -l 1 -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'
-A -R
accept-list and reject-list, either limiting the download to certain kinds of file, or excluding certain kinds of file; for instance, adding the following options to your wget command would download all files except those whose names end with _orig_jp2.tar or _jpg.pdf:
wget -r -H -nc -np -nH --cut-dirs=1 -R _orig_jp2.tar,_jpg.pdf
-e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'
wget -r -H -nc -np -nH --cut-dirs=1 -A '*zelazny*' -R .ps
-e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'