+ Reply to Thread
Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: Scraping a site from archive .org

Hybrid View

  1. #1
    Established Member
    Join Date
    Apr 2012
    Posts
    294
    Thanks
    99
    Thanked 217 Times in 110 Posts
    Rep Power
    9

    Scraping a site from archive .org

    Just bought another couple of domains.

    Am still looking for a tool that can scrape archive.org (or google cache) to get embro content.

    Does anyone know of such a tool?

    By the way for those that don't know it , winHTTrack is really useful for scraping existing (non wordpress) sites, with plugins that can import the HTML.

    The problem is scraping from cache or archive. Anyone know of such a thing?

  2. The Following User Says Thank You to mikeb For This Useful Post:

    Chabrenas (April 11th, 2012)

  3. #2
    Established Member
    Join Date
    Apr 2012
    Posts
    118
    Thanks
    8
    Thanked 81 Times in 47 Posts
    Rep Power
    3
    I had a similar situation with about 50 articles. I just setup my pages/posts on WP first and then copied and pasted the articles from wayback machine to WP. It didn't take long.

    If it's "thousands of pages" my answer would be different. You would probably need a custom script.
    Last edited by Clinton; April 11th, 2012 at 11:10 AM. Reason: forum rules

  4. The Following User Says Thank You to Matteo For This Useful Post:

    mikeb (April 11th, 2012)

  5. #3
    Top Contributor
    Join Date
    Oct 2010
    Location
    Cotswolds
    Posts
    787
    Thanks
    175
    Thanked 739 Times in 373 Posts
    Rep Power
    23
    have you tried httrack on archive.org?
    it seems that their structure for URLs is simply: http://web.archive.org/web/xxxxxxxxxxxxx/the website's normal domain / structure / urls
    and all they do is replace all links in the code with their bit in front - so should work...arguably
    you could then just run a text editor search and replace over it all to remove their bit...

    if not - then using CURL you can grab a page into a local variable - read the URLs and go off and read any linking pages in the same way - fairly simple scraper to write...

    Alasdair

  6. The Following 2 Users Say Thank You to akirk For This Useful Post:

    Chabrenas (April 11th, 2012), mikeb (April 11th, 2012)

  7. #4
    Established Member
    Join Date
    Apr 2012
    Posts
    294
    Thanks
    99
    Thanked 217 Times in 110 Posts
    Rep Power
    9
    Quote Originally Posted by akirk View Post
    have you tried httrack on archive.org?
    it seems that their structure for URLs is simply: http://web.archive.org/web/xxxxxxxxxxxxx/the website's normal domain / structure / urls
    and all they do is replace all links in the code with their bit in front - so should work...arguably
    you could then just run a text editor search and replace over it all to remove their bit...

    if not - then using CURL you can grab a page into a local variable - read the URLs and go off and read any linking pages in the same way - fairly simple scraper to write...

    Alasdair
    Tried it , and it fails. Don't know enough about httrack to try.

    Are you volunteering? If you can and want to write that script - drop me a quote by pm

  8. #5
    Top Contributor
    Join Date
    Oct 2010
    Location
    Cotswolds
    Posts
    787
    Thanks
    175
    Thanked 739 Times in 373 Posts
    Rep Power
    23
    Mike,j

    ust tried it and it is possible with httrack - on the simple webiste I tried...
    it is, I think, to do with the settings - I had it going up and down / and sitting on the same top level domain I think...
    may be worth a bit more playing with it

    Alasdair

  9. The Following User Says Thank You to akirk For This Useful Post:

    mikeb (April 12th, 2012)

  10. #6
    Dormant Account
    Join Date
    Jan 2011
    Location
    USA
    Posts
    97
    Thanks
    49
    Thanked 64 Times in 39 Posts
    Rep Power
    2
    Quote Originally Posted by mikeb View Post
    Just bought another couple of domains.

    Am still looking for a tool that can scrape archive.org (or google cache) to get embro content.

    Does anyone know of such a tool?
    Isn't Warrick the standard tool for this job?

  11. The Following 3 Users Say Thank You to sitemaster For This Useful Post:

    Chabrenas (April 11th, 2012), Clinton (April 11th, 2012), mikeb (April 11th, 2012)

  12. #7
    Premium Member
    Join Date
    Oct 2010
    Location
    East Yorkshire
    Posts
    1,674
    Blog Entries
    6
    Thanks
    284
    Thanked 1,466 Times in 756 Posts
    Rep Power
    46
    Warrick didn't work, the last time I tried, but grynge came up with some fixes that made it work a while back. The details can be found if you look around - READ THE SITE, we've got it somewhere.

  13. #8
    Established Member
    Join Date
    Apr 2012
    Posts
    294
    Thanks
    99
    Thanked 217 Times in 110 Posts
    Rep Power
    9
    Warrick seems to have closed the doors

  14. #9
    Premium Member
    Join Date
    Aug 2010
    Location
    Adelaide
    Posts
    2,553
    Blog Entries
    6
    Thanks
    1,344
    Thanked 1,570 Times in 840 Posts
    Rep Power
    52
    Warrick is back up for business, but there is a massive wait, unless you can get the perl script running. Normally I can do a backup for you but I am away on holidays, if it can wait for another couple of weeks I won't have a problem doing the backup then.
    I got out of bed today staring at a ghost. Who forgot to float away, didnt have all that much to say. Wouldn't even tell me his own name.
    Non ducor, duco

  15. The Following User Says Thank You to grynge For This Useful Post:

    mikeb (April 12th, 2012)

  16. #10
    Established Member
    Join Date
    Apr 2012
    Posts
    294
    Thanks
    99
    Thanked 217 Times in 110 Posts
    Rep Power
    9
    Many thanks. Where can i find a copy of the perl script?

+ Reply to Thread

Similar Threads

  1. Bulletin Archive - a week by week collection of some of our best threads
    By Clinton in forum Forum Rules, News & Feedback
    Replies: 68
    Last Post: March 26th, 2013, 6:34 AM
  2. Site Scraping Software Loses in Canadian Case
    By benitez17 in forum General & Miscellaneous
    Replies: 21
    Last Post: October 5th, 2011, 8:42 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts