This a basic script I have created to crawl an xml sitemap file (does not support nested sitemaps). It will report if the request was successfully processed by the server or if, instead, it returned some kind of error.
-
#!/usr/bin/env python
-
from sys import argv
-
from re import findall
-
from socket import setdefaulttimeout
-
from urllib2 import Request, urlopen
-
from datetime import datetime
-
-
# Initialization
-
procId = argv[2]
-
sitemapUrl = argv[1]
-
print '[%s]'%procId, "Crawling sitemap:", sitemapUrl
-
-
# Test url
-
def testURL(url):
-
start = datetime.now()
-
msg = ''
-
code = -1
-
req = Request(url)
-
-
try:
-
response = urlopen(req)
-
code = response.code
-
except IOError, e:
-
if hasattr(e, 'reason'):
-
msg = '[Error: %s]' % e.reason
-
elif hasattr(e, 'code'):
-
msg = '[Error: %s]' % e.code
-
-
delta = datetime.now() – start
-
print '[%02s]'%procId, '[%d]'%code, '[%03dms]'%(delta.microseconds/1000), msg, '>>', url
-
return
-
-
# Load sitemap and process
-
req = Request(sitemapUrl)
-
htmlSource = urlopen(req).read()
-
linksList = findall('<loc>(.*?)</loc>', htmlSource)
-
print len(linksList), "links found."
-
-
for link in linksList:
-
testURL(link)
The script expects 2 parameters, the url for the xml sitemap and a identifier that will be printed to the log.
It is not very fast, but you can easily run multiple instances from the command line:
-
./sitemap_crawler.py http://example.com/sitemap.xml 1 &
-
./sitemap_crawler.py http://example.com/sitemap.xml 2 &
-
./sitemap_crawler.py http://example.com/sitemap.xml 3 &
-
./sitemap_crawler.py http://example.com/sitemap.xml 4 &
-
./sitemap_crawler.py http://example.com/sitemap.xml 5 &
Enjoy!
