Techie Frequently Asked QuestionsQuestion: How did you download an entire website?
Answer: First, I didn't download all of it. I downloaded what I wanted of it (which was alot of it).

There are actually several 'website download' programs out there. Httracker is the best I have seen.

However, for this, I did something else.

Question: Okay, so what did you specifically do??
Answer:

First, I set the WOTC Forums to show me all threads, back to the beginning.

Next, I copied the link for the first, second, and LAST page of each discussion group (i.e Magic and Spells)

I then used that to create the links to the other pages in that discussion group

i.e
Example Group has 60 pages worth of thread topics. I have Page 1, Page 2, and Page 60

the format is

http://www.wizards.com/forums/F=1 for page 1

http://www.wizards.com/forums/F=1&Page=1 for page 2

http://www.wizards.com/forums/F=1&Page=59 for page 60

and we can see the common set up/format.

So I drop that into excel, without the page #

I then copy it under itself a bunch of times.

Then, in the cell beside the first link I put a '3'

Under that, I put a formula that increases as it goes down cells. i.e 4, 5

I then copy those columns to notepad, and remove the tabs.

I now have the links to every page in that discussion group.

Repeat for all discussion groups.

I dropped those into a program called Flashget, and set it up so each Discussion Group's pages to into their own directory on my computer

i.e
C:\Downloads\WOTC\F1_MagicandSpells

I then tell Flashget to download them all.

Once that was done, I opened each downloaded file with Word, and told it to replace all instances of "<" with "^p<", and all instances of "><" with ">^p<"

that seperates all html tags, including links, into seperate lines.

I did this via a big macro

(i.e if this is the loop #, open file #, regardless of file, run these replaces, save, then indrease the loop # by 1)

I then merged all the downloaded files in a given directory via the old dos type >> command (specifically, type *.* >> BIGFILE.txt)

Next, I imported each of the merged files into a Microsoft Access database (into a longtext/memo field), with an extra field for the name of the original discussion topic

Then, it was a simple matter to find how threads were in that big table, and delete anything that was not a thread.

the end result was a list of all thread#s, with first, second, and last page, and Thread topic.

I copied all the first pages + thread topics to their own list

I copied the thread pages to another list, and seperated out the page # (i.e last page is page 30). Any without page #s, got a nice 1 (or 0, I don't remember how the WOTC forums were set up in that regard)

so i had something like http://www.wizards.com/forums/Thread=1000 Pages: 30

and I knew from looking at the html, the second page was http://www.wizards.com/forums/Thread=1000&Page=2, etc

So I made a little program that took that list, and for each thread, generate page 1 to whatever # of pages it had

then end reults
http://www.wizards.com/forums/Thread=1000 Pages: 30 became
http://www.wizards.com/forums/Thread=1000&Page=1
http://www.wizards.com/forums/Thread=1000&Page=2
http://www.wizards.com/forums/Thread=1000&Page=3
http://www.wizards.com/forums/Thread=1000&Page=4
...
http://www.wizards.com/forums/Thread=1000&Page=29
http://www.wizards.com/forums/Thread=1000&Page=30

Drop all that into Flashget again, and tell it to download them all.

I then used Flashgets download log to rename the downloaded files (Flashget will rename files with characters that are not in the alphabet), to something closer to what they were on the WOTC Forums.
i.e
http://www.wizards.com/forums/Thread=1000&Page=1 was saved as Thread(somenumber), and was then renamed to Thread_1000_P_01
http://www.wizards.com/forums/Thread=1000&Page=2 was saved as Thread(somenumber), and was then renamed to Thread_1000_P_02
http://www.wizards.com/forums/Thread=1000&Page=29 was saved as Thread(somenumber), and was then renamed to Thread_1000_P_29
http://www.wizards.com/forums/Thread=1000&Page=30 was saved as Thread(somenumber), and was then renamed to Thread_1000_P_30

I then merged all those files, in one Thread at a time, then pulled them into another program that grab the Author/TIme of Each post, and then the post itself, and put in into the simplier format that's on the website.

That remove all the formatting, imbedded images, links to the wotc website, and other things that would no longer work, out. That also made it easier to view (and print if anyone wants to do that)

Since then, however, I've written my own database program in Microsoft Access 2010 using Visual Basic for Applications, instead of using Flashget and Excel.