Final finally comes. I need all slides for courses to review. But manually downloading is such a heavy load, both physically and mentally. I used to crawl files using
cURL, but if I want to crawl a certain kind of file (i.e. PDF in this case), the solution is not straight-forward. I found
wget works perfectly in supporting given type crawling.
wget -r -A.pdf http://my.course.domain/~category/
recursively crawling all pdf from given root. (wild cards *.pdf works too).
After a few seconds, all PDF were crawled except for SLIDES (which were exactly what I want). When I scrolled up, I see
HTTP request sent, awaiting response... 401 Authorization Required Username/Password Authentication Failed.
Well, the slides were protected by credentials, which only students know. All right then, make some minimal change would solve this
wget -r --auth-no-challenge --user='secret' --password='secret_pwd' http://my.course.domain/~category/
But the same failure message still kept popping up.
After doing some search, it was because the website had blocked direct crawling by crawler. I tried browser agent, and went through it.
wget -r -A.pdf --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0" --user='secret' --password='secret_pwd' http://my.course.domain/~category/
It works like a charm. All 200MB slides can be found after a few seconds.