Today our link checking script reported 225 broken links. Most of these were due to Adobe changing the location of their Acrobat Reader download page. Usually when this happens I’m too lazy to figure out how to script the update. But this number of links finally tipped the scales in favor of my being too lazy to update them by hand. It turned out that most of them were produced by one or two dynamic pages, but at least I learned something
.
First I used grep to store the list of files containing the broken link in a text file:
steve@oracledev:~/perforce/depot/mainline/weblive/wwwroot$ grep -rl http://www.adobe.com/products/acrobat/readstep2.html . > /tmp/acrobat_link_files.txt
Then I marked those files for edit in Perforce:
steve@oracledev:~/perforce/depot/mainline/weblive/wwwroot$ cat /tmp/acrobat_link_files.txt | p4 -x - edit
Then I used sed to update the link in those files. If you’ve used perl-style regular expressions this will look familiar:
steve@oracledev:~/perforce/depot/mainline/weblive/wwwroot$ cat /tmp/acrobat_link_files.txt | xargs sed -i 's|http://www.adobe.com/products/acrobat/readstep2.html|http://get.adobe.com/reader/|g'
The xargs command calls the sed command for each line of the acrobat_link_files.txt file, passing the line as an argument. The -i switch to sed tells it to update the given file in place.
Perhaps next time I’ll really get my unix geek on and figure out how to do it in 1 line instead of 3.
Update: I’ve got it down to 1 command! The tee command can redistribute stdin to multiple outputs. Here it redirects stdin to the p4 edit command and also to stdout. We need to redirect p4 edit’s output to /dev/null or else that will also get sent to stdout and sed won’t know what to do with it.
steve@oracledev:~/perforce/depot/mainline/weblive/wwwroot/students/ugrad$ grep -rl http://certification.cornell.edu . | tee >(p4 -x - edit 1>/dev/null) | xargs sed -i 's|http://certification.cornell.edu/\?|https://certification.cornell.edu/|g'
Update: I’ve created a shell script to make this easier:
steve@oracledev:~/depot/mainline/common/scripts/bin$ ./bulk_update_urls.sh
Usage: ./bulk_update_urls.sh http://original.url.net/ http://new.url.net/ /path/to/target/dir
steve@oracledev:~/depot/mainline/common/scripts/bin$ ./bulk_update_urls.sh http://www.payments.cornell.edu/Travel_Forms.cfm http://www.dfa.cornell.edu/dfa/payments/essentials/advances/index.cfm ~/depot/mainline/websw/intraroot/
//depot/mainline/websw/intraroot/howdoi/mngcourse/host.html#3 - opened for edit
//depot/mainline/websw/intraroot/howdoi/travel/cashProcedures.html#1 - opened for edit
//depot/mainline/websw/intraroot/howdoi/travel/tranform.html#3 - opened for edit
//depot/mainline/websw/intraroot/howdoi/travel.html#14 - opened for edit