Crawling sitemaps with Python

Sep 09 2009 Published by Eneko Alonso under uncategorized

This a basic script I have created to crawl an xml sitemap file (does not support nested sitemaps). It will report if the request was successfully processed by the server or if, instead, it returned some kind of error.

  1. #!/usr/bin/env python
  2. from sys import argv
  3. from re import findall
  4. from socket import setdefaulttimeout
  5. from urllib2 import Request, urlopen
  6. from datetime import datetime
  7.  
  8. # Initialization
  9. procId = argv[2]
  10. sitemapUrl = argv[1]
  11. print '[%s]'%procId, "Crawling sitemap:", sitemapUrl
  12.  
  13. # Test url
  14. def testURL(url):
  15.   start = datetime.now()
  16.   msg = ''
  17.   code = -1
  18.   req  = Request(url)
  19.  
  20.   try:
  21.     response = urlopen(req)
  22.     code = response.code
  23.   except IOError, e:
  24.     if hasattr(e, 'reason'):
  25.       msg = '[Error: %s]' % e.reason
  26.     elif hasattr(e, 'code'):
  27.       msg = '[Error: %s]' % e.code
  28.  
  29.   delta = datetime.now() – start
  30.   print '[%02s]'%procId, '[%d]'%code, '[%03dms]'%(delta.microseconds/1000), msg, '>>', url
  31.   return
  32.  
  33. # Load sitemap and process
  34. req = Request(sitemapUrl)
  35. htmlSource = urlopen(req).read()
  36. linksList = findall('<loc>(.*?)</loc>', htmlSource)
  37. print len(linksList), "links found."
  38.  
  39. for link in linksList:
  40.   testURL(link)

The script expects 2 parameters, the url for the xml sitemap and a identifier that will be printed to the log.

It is not very fast, but you can easily run multiple instances from the command line:

  1. ./sitemap_crawler.py http://example.com/sitemap.xml 1 &
  2. ./sitemap_crawler.py http://example.com/sitemap.xml 2 &
  3. ./sitemap_crawler.py http://example.com/sitemap.xml 3 &
  4. ./sitemap_crawler.py http://example.com/sitemap.xml 4 &
  5. ./sitemap_crawler.py http://example.com/sitemap.xml 5 &

Enjoy!

No responses yet

I’m ready for some fun with Python 3.0

Dec 06 2008 Published by Eneko Alonso under uncategorized

I just found out Python 3.0/3k has been finally released. This is very good news :) Let’s find out how to install it on Mac OS X Leopard.

2 responses so far

Making your objects sortable

Nov 28 2008 Published by Eneko Alonso under uncategorized

Making your objects sortable in Python is very simple: add the __cmp__ function and the logic to compare the two objects and you are done!

  1. class person:
  2.   def __init__(self, name, age):
  3.     self.name = name
  4.     self.age = age
  5.  
  6.   def __str__(self):
  7.     return 'Person %s (%d)' % (self.name, self.age)
  8.  
  9.   def __cmp__(self, other):
  10.     return cmp(self.name+str(self.age), other.name+str(other.age))
  11.  
  12. lista = [
  13.   person('Ren Smith', 24),
  14.   person('Aohn Doe', 31),
  15.   person('Aohn Doe', 22),
  16.   person('Eneko Alonso', 30),
  17.   person('Ren Gomas', 34)
  18. ]
  19.  
  20. for person in lista:
  21.   print person
  22. print '—–'
  23. for person in sorted(lista):
  24.   print person

As you can see, you can totally customize your __cmp__ method and compare any class members.

Download the code: http://enekoalonso.com/svn/python/classes/sorting-objects.py

No responses yet

Copying objects vs. copying references

Nov 26 2008 Published by Eneko Alonso under uncategorized

C++ is a language that allows creation of static instances of objects, this is without using pointers. This is why copy-constructors are needed, since it is common to copy or clone objects. Other languages like Delphi don’t allow to create static variables to instantiate objects. Instead, all objects are pointers. So in C++, assignments of instances of the same class create a copy of the object, while in Delphi, only the reference to the second instance is copied.

When I started learning Python one year ago, I wanted to know if objects were cloned on assignments or if only the reference to the object was assigned instead. Here is how I found it:

  1. class Item():
  2.   def __init__ (self):
  3.     self.text = ""
  4.   def sayIt(self):
  5.     print self.text
  6.  
  7. A = Item()
  8. B = Item()
  9. A.text = "testing A"
  10. B.text = "testing B"
  11. A.sayIt()
  12. B.sayIt()
  13. A = B
  14. A.sayIt()
  15. B.sayIt()
  16. A.text = "testing A 2"
  17. A.sayIt()
  18. B.sayIt()

This code will output 6 messages. The first two will be obviously different, since A and B are to separate instances on memory. When B is assigned to A, two things can happen. B is cloned and A is a new instance equal to B or the reference to B is assigned to A so A and B become the same instance in memory (B). No matter which one happens, both messages after the assignment will correspond to the B instance (“testing B”).

But what will happen when we modify A again? Will B be also modified?

Here is the output:

testing A
testing B
testing B
testing B
testing A 2
testing A 2

So A and B are now the same object in memory.

2 responses so far

Fibonacci

Nov 26 2008 Published by Eneko Alonso under uncategorized

Fibonacci is a fast growing sequence that can easily overflow your integer variables. Fortunately, Python doesn’t have this problem, since it can handle huge integer numbers.

  1. #!/usr/bin/env python
  2. def fib(n):
  3.   if n==0 or n==1:
  4.     ret = n
  5.   else:
  6.     ret = fib(n-1) + fib(n-2)
  7.   return ret
  8.  
  9. for i in range(1000):
  10.   print fib(i)

The example avobe has a big performance issue, due to the recursive calls to calculate n-1 and n-2 values, which are calculated multiple times over and over. By storing the results on a list we solve this issue, although we now relay on the amount of available memory.

  1. #!/usr/bin/env python
  2. fibs = []
  3. def fib(n):
  4.   if n==0 or n==1:
  5.     ret = n
  6.   elif n < len(fibs):
  7.     ret = fibs[n]
  8.   else:
  9.     ret = fib(n-1) + fib(n-2)
  10.     fibs.append(ret)
  11.   return ret
  12.  
  13. for i in range(1000):
  14.   print fib(i)

Thus, Fib(1000) = 43,466,557,686,937,456,435,688,527,675,040,625,802,
564,660,517,371,780,402,481,729,089,536,555,417,949,051,890,403,879,840,079,
255,169,295,922,593,080,322,634,775,209,689,623,239,873,322,471,161,642,996,
440,906,533,187,938,298,969,649,928,516,003,704,476,137,795,166,849,228,875

No responses yet