Background

Final finally comes. I need all slides for courses to review. But manually downloading is such a heavy load, both physically and mentally. I used to crawl files using cURL, but if I want to crawl a certain kind of file (i.e. PDF in this case), the solution is not straight-forward. I found wget works perfectly in supporting given type crawling.

Basics

wget -r -A.pdf http://my.course.domain/~category/

recursively crawling all pdf from given root. (wild cards *.pdf works too).

Authentication.

After a few seconds, all PDF were crawled except for SLIDES (which were exactly what I want). When I scrolled up, I see

HTTP request sent, awaiting response... 401 Authorization Required

Username/Password Authentication Failed.

Well, the slides were protected by credentials, which only students know. All right then, make some minimal change would solve this

wget -r --auth-no-challenge --user='secret'  --password='secret_pwd' http://my.course.domain/~category/

But the same failure message still kept popping up.

Solution

After doing some search, it was because the website had blocked direct crawling by crawler. I tried browser agent, and went through it.

wget -r -A.pdf --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0" --user='secret'  --password='secret_pwd' http://my.course.domain/~category/

It works like a charm. All 200MB slides can be found after a few seconds.