having fun with code

Crawling sitemaps with Python

This a basic script I have created to crawl an xml sitemap file (does not support nested sitemaps). It will report if the request was successfully processed by the server or if, instead, it returned some kind of error.

  1. #!/usr/bin/env python
  2. from sys import argv
  3. from re import findall
  4. from socket import setdefaulttimeout
  5. from urllib2 import Request, urlopen
  6. from datetime import datetime
  7.  
  8. # Initialization
  9. procId = argv[2]
  10. sitemapUrl = argv[1]
  11. print '[%s]'%procId, "Crawling sitemap:", sitemapUrl
  12.  
  13. # Test url
  14. def testURL(url):
  15.   start = datetime.now()
  16.   msg = ''
  17.   code = -1
  18.   req  = Request(url)
  19.  
  20.   try:
  21.     response = urlopen(req)
  22.     code = response.code
  23.   except IOError, e:
  24.     if hasattr(e, 'reason'):
  25.       msg = '[Error: %s]' % e.reason
  26.     elif hasattr(e, 'code'):
  27.       msg = '[Error: %s]' % e.code
  28.  
  29.   delta = datetime.now() – start
  30.   print '[%02s]'%procId, '[%d]'%code, '[%03dms]'%(delta.microseconds/1000), msg, '>>', url
  31.   return
  32.  
  33. # Load sitemap and process
  34. req = Request(sitemapUrl)
  35. htmlSource = urlopen(req).read()
  36. linksList = findall('<loc>(.*?)</loc>', htmlSource)
  37. print len(linksList), "links found."
  38.  
  39. for link in linksList:
  40.   testURL(link)

The script expects 2 parameters, the url for the xml sitemap and a identifier that will be printed to the log.

It is not very fast, but you can easily run multiple instances from the command line:

  1. ./sitemap_crawler.py http://example.com/sitemap.xml 1 &
  2. ./sitemap_crawler.py http://example.com/sitemap.xml 2 &
  3. ./sitemap_crawler.py http://example.com/sitemap.xml 3 &
  4. ./sitemap_crawler.py http://example.com/sitemap.xml 4 &
  5. ./sitemap_crawler.py http://example.com/sitemap.xml 5 &

Enjoy!

Related Posts:

Leave a Reply

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Additional comments powered by BackType

About the blog

This is a blog about development, focused mainly on Javascript but also other languages like python, shell scripts and more.

About the author

Eneko Alonso is a software engineer and UI developer with more than eight years of experience in software and web development. He lives in San Luis Obispo, California and works at LEVEL Studios.

Contact Info

Contact Info

PromoteJS

JavaScript JS Documentation