Archive for the ‘Programming Languages’ Category

Quick Python search and replace script

Friday, October 28th, 2011

Have a client machine that is a little loaded that has a ton of modified files. Normally we just restore off the last backup or the previous generation backup, but, over 120k files since June 2011 have been exploited. Since the machine is doing quite a bit of work, we need to throttle our replacements so that we don’t kill the server.

#!/usr/bin/python
"""

Quick search and replace to replace an exploit on a client's site while
trying to keep the load disruption on the machine to a minimum.

Replace the variable exploit with the code to be replaced. By default, 
this script starts at the current directory. max_load controls our five
second sleep until the load drops.

"""

import glob
import os
import re
import time

path = '.'
max_load = 10

exploit = """
<script>var i,y,x="3cblahblahblah3e";y='';for(i=0;i
""".strip()

file_exclude = re.compile('\.(gif|jpe?g|swf|css|js|flv|wmv|mp3|mp4|pdf|ico|png|zip)$', \
                          re.IGNORECASE)

def check_load():
    load_avg = int(os.getloadavg()[0])
    while load_avg > max_load:
        time.sleep(30)
        load_avg = int(os.getloadavg()[0])

def getdir(path):
    check_load()
    for file in os.listdir(path):
        file_path = os.path.join(path,file)
        if os.path.isdir(file_path):
            getdir(file_path)
        else:
            if not file_exclude.search(file_path):
                process_file(file_path)

def process_file(file_path):
    file = open(file_path, 'r+')
    contents = file.read()
    if exploit in contents:
        print 'fixing:', file_path
        contents = contents.replace(exploit, '')
        file.truncate(0)
        file.seek(0, os.SEEK_SET )
        file.write(contents)
    file.close()

getdir(path)

Thankfully, since this server is run as www-data rather than SetUID, the damage wasn’t as bad as it could have been.

AVL Tree, doubly linked lists and iterations

Wednesday, September 28th, 2011

Over the past few months I’ve been mulling over a Network Performance and Intrusion Detection box that could sit on a network and make route suggestions to the border.

There are some very basic CompSci fundamental processes involved in the data analysis to find the source and preferred path for each incoming packet. When you’re dealing with flows and traffic capture of multiple gigabits per second, you need something that is very quick.

The problem we have is the representation of an IP address in memory. Obviously we can use a tuple to store the octets or words, but, the Internet isn’t fully allocated and there are a lot of empty netblocks.

Since the data is sparse, and we need to know whether a packet is in a netblock that we’ve seen, the ASN, and the preferred path we have a small problem.

Our first issue comes from the fact that we might see an announcement of:

66.244.0.0/18

which contains all of the IP addresses from 66.244.0.0 to 66.244.64.255. So, large block announcements (and overlapping announcements) become a potential problem. Currently, there is an announcement for 4.0.0.0/8 and 4.0.0.0/9. Any IP address in 4.0.0.0/9 (over 8.3 million) is handled by one announcement but also contained within 4.0.0.0/8.

With this in mind, we’ve eliminated the ability to do a binary search as our list would be quite imbalanced. We could store the individual tuples in a tree which would be three levels deep, but, there are portions of the internet filled with /24s that would be quite dense. Since those portions of netspace are not only densely packed, but quite active, we know we’re going to be spending a lot of time traversing those nodes.

Back to the issue at hand, we have an IP address, 66.244.19.68/32 that we need to find the source announcement for. Assuming for the moment we ignore overlapping address announcements, we need to find that the IP address is announced in 66.244.0.0/18, originates from ASN306 and which of our current peers is the ‘best path’ according to BGP4.

BGP4 deals with routes based on shortest path. Each hop through a provider adds an ASN to the path so that we know how to get to a destination. While the shortest ASPATH seems logical that it would be the fastest path, that isn’t always the case. There are times when a packet might traverse a provider’s backbone that has some congestion where avoiding that provider could result in faster connection/transfer speeds for that destination. This happens much more frequently than you would think where providers dump traffic on free exchanges rather than paying for the transit and running those connections at 95% capacity during peak hours. Performance analysis is a separate issue, we just need to find the most specific announcement for a particular destination.

There are several data structures that come to mind:

* Sorted List – iterate through using a number of different methods sequentially, binary search, etc.
* Red/Black tree – Faster than binary, but, route announcements can be withdrawn and we might have some deletions leading to a more imbalanced tree.
* AVL tree – Faster than red/black, more evenly balanced, great for lookups, slight performance penalty on deletions, but, we don’t delete too many items too often.

So, the AVL tree probably wins in this case, but, we’re not comparing for equality. We need to find the IP address within in announcement, and while memory is cheap storing every single IP address and an associative pointer to the source announcement would be the fastest if we used a simple hash lookup, we need to modify our AVL tree lookup slightly. With IPv4, if we used only the first three octets, we would only have 16.4 million entries if we represented each netblock with an associative pointer to the source. With IPv6, our same hash would need 1.84×10^19th entries. In either case, populating that full hash would take a bit of time.

We need to find the node of the tree that has a value greater than our current IP address, then, take the previous node and verify that the IP address is within that announcement. It should be if our routing table is complete, but, if not, it could be a ‘martian’ or an announcement to a location that would be returned to our default route.

Since we’ve chosen Erlang, it does have a Generally Balanced Tree built in which has some decent performance improvements over a generic AVL tree, but we need to modify the traverser slightly for our use case. Luckily we can do this.

I raise this issue because I read a blog post about a student claiming he’d never put anything he learned in college to use. I think he’s just not solving the right problems. I learned about trees, sorts, linked lists in college and still put things like this to use today.

I remember back in the mid ’80s I was attending UMBC to get a degree in Computer Science. Since I had computer programming experience, I was able to take two experimental courses to replace the general CompSci track – introducing me to Ada by one of the few experts at the time. However, a friend that was taking CS270 – Pascal mentioned his teacher was giving a discussion on doubly linked lists and a new method of traversal he had come up with. As I lived on campus and my advisors had completely screwed my class schedules, I had a lot of free time between classes and opted to attend the class. Of course, the teacher decided to give a pop quiz that day prior to the lecture which I passed over to my neighbor. The TA asked why I wasn’t taking the quiz and I said, I’m not in the class and was here to hear the lecture. This raised a small commotion which brought the professor over to ask why I refused to take the quiz. I explained that I had heard through a friend he was giving a lecture on doubly linked lists with a different method of traversal and I was interested in hearing the discussion – I wasn’t a member of the class and if he preferred, I would leave, but, I wanted to hear the lecture.

I think something happened – this teacher had a reputation for being very intelligent but very gruff. He asked me to come with him to the front of the class and said,

Ladies and Gentlemen, this is Chris Davies. He doesn’t attend this class, but, wanted to hear my lecture. Here’s a student that actually wanted to be here to learn while you’re all disappointed with having to show up to hear me talk.

He gave his lecture and we did talk a few more times after that about data retrieval and data structures.

Yes, twenty-six years later, I am still using the education I paid for.

Pyramid Apex – putting it in production

Monday, August 15th, 2011

After quite a bit of work we’ve finally gotten Pyramid Apex to a point where I can deploy it on two production apps to make sure things are working as I expect they should.

If you’re developing a Pyramid Application and are using Authentication/Authorization, I18N/L10N, Flash Messages and a Form Library, take a look at Pyramid Apex, a library Matthew Housden and I wrote to make it easier to quickly develop Pyramid applications.

It supports OpenID, Local authentication storage using bcrypt and a number of other basic features.

Google+, Python, and mechanize

Sunday, July 3rd, 2011

Since Google+’s release, I’ve wanted access to an API. I’m told soon. I couldn’t wait.

#!/usr/bin/env python

import mechanize

cj = mechanize.LWPCookieJar()
cj.load("cookies.txt")

br = mechanize.Browser()
br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.810.0 Safari/535.1 cd34/0.9b')]
br.open('https://www.google.com/accounts/ServiceLogin?service=oz&passive=1209600&continue=https://plus.google.com/up/start/')

br.select_form(nr=0)

br.form.find_control("Email").readonly = False
br.form['Email'] = 'email@address.com'
br.form['Passwd'] = 'supersecretpasswordhere'

br.submit()

for l in br.links():
    print l

cj.save("cookies.txt")

A weekend with Tornado

Tuesday, June 29th, 2010

After working on a Pylons project for a week or so, there was a minor part of it that I felt didn’t need the complexity of a framework. Some quick benchmarking of the most minimal Pylons/SQLAlchemy project I could muster came in around 200 requests per second which put me at roughly 12 million requests per day based on the typical curve.

Within 15 minutes of installing Tornado and using their simple hello world example, I imported SQLAlchemy and ended up boosting this to 280 requests per second. As I really didn’t need any of the features from the ORM, I decided to use tornado.database which isn’t much more than a bare wrapper to python-mysql. Even with a single worker process, I was able to get 870 requests per second. 56 million requests per day, without any tuning?

I’m reasonably impressed. Once I put it on production hardware, I’m thinking I’ll easily be able to count on double those numbers if not more.

Next weekend, Traffic Server.

Entries (RSS) and Comments (RSS).
Cluster host: li