Jayd Saucedo

Blog

Working in Unfamiliar Languages
Recently I got the idea that I should make my own search engine to scrape a site that I frequent and find pages that are relevant to my interests. It seemed like a pretty simple task, make a list of pages to scan, scan the pages, and search for whatever keyword I specified. If that keyword(or keywords) existed on that page I'd store that page in a database for later inspection. The only problem I foresaw was the fact that I might accidentally DDOS the server if I wasn't careful about my timing.

My area of expertise is in web based applications. The languages I know best are pretty much designed for the web. Even though this program is web related, it obviously shouldn't be a web based application. This is the sort of program that will need to run for several hours before it completes its task. That's not a task you want a web server to perform, that wouldn't make any sense. So I decided to make this program in Python. I know the syntax of python. I've made a couple simple applications in it, but the difference between the ease of me coding in PHP and Python is comparable to the difference between swimming upstream and swimming downnstream.

Luckily there are resources that are designed to help out people like me! The first website I visited was Rosetta code. First thing I needed to know is how python handles opening webpages.

For example in PHP the code would be:

print(file_get_contents("http://www.rosettacode.org"))

but in python it would be:

import urllib
url = urllib.urlopen("http://www.rosettacode.org")
print url.read()
url.close()
Which got my off to a good start. From there I needed to loop through every page I wanted to search and run a regular expression match on the html contents of every page. When I found something, or even checked a page I would make a note of it in a separate database file.

Eventually my loop got to looking like this:

while(i < len(poop) and cont):
    print str(i)+".) opening page "+str(poop[i])+' -- '+time.strftime("%H:%M:%S")
    try:
        resp = opener.open('http://www.mysearch.com/pagenum?id='+str(poop[i]), [], 10)
        res = resp.read()
        print "finding match"
        checked = open('db/'+section+"_checked.txt", "a")
        found = open('db/'+section+"_found.txt", "a")
        if re.search('my search term', res):
            print "found!"
            found.write(poop[i]+'\n')
            ratioless.append(poop[i])
        else:
            print "nope"
        checked.write(poop[i]+'\n')
        time.sleep(6)
        i = i + 1
        failStreak = 0
        checked.close()
        found.close()
    except:
        if(errorOn == poop[i]):
            i = i + 1
            print "still can't open, moving on"
        else:
            errorOn = poop[i]
            print "couldn't open"
        failStreak = failStreak + 1
        print str(failStreak) + " failures"
        time.sleep(60)
        continue;
Complete with error checking, and DDOS prevention techniques. If the site times out then I'll wait a minute before trying again, and I wait 6 seconds before downloading each page. If the same page times out two times in a row then just move on.

Rosetta code is a good site, but PHP is so easy because of the number of functions that just do everything for you. Eventually I got so uncreative with trying to describe my problem to google that I just typed in something along the lines of "php explode function for python", and such a query led me directly to Php2Python. A website designed to give you all the PHP functions in python syntax.

Using Php2Python I learned that Python has an equivalent explode function and now I use it for DB management:

poop = open('db/'+section+".txt", "r").read().split('\n')
I'm glad to see there are such sources that allow you to easily translate your knowledge of one language to the syntax of another. Now my programs chugging away and feeding me the results I've been waiting for!