Anandtech uses Trackback spam on my blog to increase traffic?

February 20th, 2012

I’ve been dealing with a rather persistent comment spam issue with my blog. Akismet seems to ignore this particular type of trackback spam and has let hundreds slip through over the last few months. Since my blog really isn’t that busy, it is very easy to identify the spam. I can’t turn off trackbacks as I have gotten trackbacks from other people referencing my site showing how they solved a problem.

A couple weeks ago I installed Simple Trackback Validation with Topsy Blocker which has done a good job of catching the ones that Akismet seems to have problems detecting. As I run a rather complex setup for testing my plugins, I submitted a bug fix which the author promptly installed. I don’t know why Akismet doesn’t detect keyword keyword… as potential spam, but, almost every trackback Akismet has missed, follows that pattern and so far, the other plugin has caught every single one.

This morning, I noticed two comments had been posted in the last seven hours, and, when I looked, I saw:

While my site isn’t that busy, it ranks fairly well on some search phrases and social circles. The most popular post on my blog was written over two years ago and was for Mac OS/X Snow Leopard. Other posts have been more popular over a thirty day period, but, that post has stood the test of time even though it is now two generations of Operating Systems behind.

But, I do have sites that do trackback spam from that particular page quite frequently, and, Akismet misses them about 85% of the time. You wouldn’t know from the Akismet graphs, it claims a much higher success rate, almost the inverse of the fail rate.

I know when I get links from Anandtech.com, I do often check them to make sure they are linking, and, in a few spot checks in the past, they were, but, those links appear to have been cleaned from their forums and the two posts where I remember them being listed. Today however, the trackback was on a very new article which had no relation to the page it linked to, and obviously, no link from their site pointed to my site.

I looked back through the approved comments and found 71 other trackbacks, investigated a number of pages, and, as you might suspect, my link wasn’t present anywhere.

This is where the analysis turns a bit sinister. Why did they pick a page that detailed an issue with Varnish and gzip compressed pages to link to an article Anandtech wrote yesterday? Age of the post? The original post was from Dec 2009, though, it is the outlier in the stats.

Upon looking at thirty of the trackbacks, a curious pattern emerged. Since adding the social media buttons for Google+, Twitter and Facebook, I’ve had a quick metric to gauge post popularity, and, lo and behold, Anandtech is targeting posts that have high tweet counts with the exception of the original outlier.

The original post is linked to a post that deals with Social Game Design which is also a popular post. There may have been a trackback on that page which I deleted ages ago and their trackback bot just spidered it.

Or, it could be completely dumb and just taken the urls from topsy.com if they looked far enough back through my history – except for the original outlier.

But Anandtech, Welcome to the Comment Blacklist.

30 Days Review from Concept to Beta – SnapReplay.com

February 15th, 2012

It’s been a long road of late nights, early mornings, and quite a bit of learning and coding. When I came up with the idea, I selected node.js and socket.io to develop the project. Shortly thereafter, I started working on the Android App, testing my mettle with Java, a language I hadn’t used in 9+ years. Seventeen days after the concept, we had our first test, resulting in an archive of Roger Waters ‘The Wall’.

The Android app was refined from a usability standpoint, the site UX lagging behind the infrastructure and databases being built to support the site, but, always moving forward, pushing 133 commits, sometimes massive rewrites to change functionality to work much differently than originally anticipated. Iteration sometimes led to massive backend changes while maintaining the site functioning and live as more people poked and prodded and were generally impressed with the idea.

Reminding myself that this is an app that was built ‘in my spare time’, I’m still somewhat amused that 1k+ lines of Javascript, 1.2k+ lines of Java, 100+ lines of PHP and a few dozen lines of Python were written and I have a workable site. I have started to write some Objective C, but, nothing more than a few tests and demo scripts to dive into writing the IPhone app.

With that said, I’m looking for people to beta test the Android App and give feedback. Extra Bonus points if you’re willing to go to a concert or event and use the app. Even more bonus points if you have the ability to recruit a second (or third) person to take pictures at the same concert. If you have an IPhone and are willing to Beta Test, get in contact with me. I don’t have that app completed, but, it is based very heavily on the Android App’s flow as we have spent a lot of time working out the interface. It is ugly, but, we believe it flows well and is very usable. Still have a minor lag bug that appears to be related to JSON parsing.

I’m still happy with Node.js and socket.io. There are some things I will probably change with expressjs which I might handle with my own middleware, or, might do a fork/pull as it is functionality others may find valuable. I’ve reworked a lot of functionality to help the User Experience and have gotten a lot of good feedback.

I’m unsure whether I would write it entirely in node.js again, but, it was still a great learning experience. I might rewrite it later, but, for now, it seems to do just fine.

So, if you can help me out and try the app or find a few of your friends to help out, it would be appreciated.

Thank you.

SnapReplay.com

285 WordPress Sites, upgraded in 11 minutes – and they weren’t MultiSite

February 12th, 2012

A number of our clients run WordPress, but, for some reason, keeping them updated is a problem. Sites are uploaded and run on autopilot and are forgotten… until they are hacked. Last week a client asked why his WordPress 2.8 site was hacked. WordPress 2.8 was released in June 2009 with 25 WordPress releases since. We checked a few of his sites and found a few different versions running, but, how many other clients were running old WordPress versions? The results were shocking.

Finding WordPress sites on shared storage

First, we need to find each of the individual client’s WordPress installations.

find /var/www -type f -wholename \*wp-includes/version.php|awk '{ print "grep -H \"wp_version =\" " $1 }' | sh > /var/tmp/wpversions

From this, with a little cut and sort trickery, we end up with the following histogram:

35 3.1
29 2.8.5
25 3.2.1
24 2.9.2
20 2.8.4
19 3.0.1
16 2.8.2
16 2.1
15 2.6
13 2.7.1
8 3.3.1
6 2.9.1
6 2.8
6 2.3
5 3.3
5 2.7
4 3.1.2
4 2.8.1
3 3.1.1
3 3.0.5
3 3.0.4
3 2.8.6
2 3.1.3
2 3.0
2 2.0.3
1 3.1.4
1 2.9
1 2.6.3
1 2.5.1
1 2.3.3
1 2.3.2
1 2.2.2
1 2.2.1
1 2.2
1 2.1.3
1 2.0.5
1 2.0.4

Yes, we have 2 2.0.3 installations in production use out of 285 sites. Of them, 8, or, less than 3% are running the current version, 3.3.1.

Clearly this is a problem.

We have a few options, one of which is to utilize the upgrade process inside WordPress which requires us to communicate with each client, or, write a quick script to give us admin privileges to do the upgrade. Or, we could use bash.

The magic

Our filesystem structure is set up so that each user has their own UID/GID, and the paths where the domains are located are fairly static. However, the script just takes the path of the wp-content/version.php file, strips off the correct pieces, copies the uncompressed WordPress .tar.gz file, changes ownership from root to the user that owns the directory.

There are two variables that need to be set:

WORDPRESS_TMPDIR – set this to the directory where you have untarred and ungzipped the WordPress archive

BASE_PATH – set this to the machine’s root path.

The script

#!/bin/bash

# cd34, 20120212
# 
# find /var/www -type f -wholename \*wp-includes/version.php|awk '{ print "grep -H \"wp_version =\" " $1 }' | sh > /var/tmp/wpversions
# 
# if you want to really save time:
# awk < /var/tmp/wpversion '{ print "/path/to/wpu.sh " $1 }' | sh -x

# set this to match your temporary directory location for WordPress
WORDPRESS_TMPDIR=/var/tmp/wordpress
# wget -O /var/tmp http://wordpress.org/latest.tar.gz
# cd /var/tmp
# tar xzf latest.tar.gz

#set this to signify the base path of your machine's web root
BASE_PATH=/var/www

if [ "X" == "$1X" ];
then
  echo "Needs a pathname for the version.php file"
  echo
  echo "$0 /var/www/domain.com/wp-includes/version.php"
  echo
  echo "You can include data after version.php, i.e. :$version from find command"
else
  WP_INCLUDE_PATH=$1
  WP_PATH=${WP_INCLUDE_PATH%%/wp-includes/version.php*}
  DOMAIN=${WP_PATH##$BASE_PATH/}

  TMP=`stat $WP_PATH|grep Uid:`
  TMP_GID=${TMP##*Gid: ( }
  DGID=${TMP_GID%%/*}
  TMP_UID=${TMP##*Uid: ( }
  DUID=${TMP_UID%%/*}

  `cp -Rp $WORDPRESS_TMPDIR/* $WP_PATH`
  `chown -R --from=root $DUID.$DGID $WP_PATH`
  `/usr/bin/wget -q -O /dev/null "http://$DOMAIN/wp-admin/upgrade.php?step=1"`
  echo "Upgraded: http://$DOMAIN"
fi

The code for this and a few other tools that I’ve written can be found at cd34-tools, hosted on code.google.com

Designing MySQL Indexes (Indices)

February 7th, 2012

I’ve imported some data, I know how I want to query the data, but, using the provided schema, we end up with a table that is 8.1 million rows and we know very little about the composition of the data.

We know our query will be against two fields, name and timezone.

The two fields we’re dealing with:

name              | varchar(200)
timezone          | varchar(60)

The first thing we are faced with is the fact that our name is 200 characters. We could do:

create index name_timezone on geo_city (name,timezone);

But, that’s not going to be very efficient. Lets do a little checking to see what our data really looks like:

mysql> select avg(length(name)) from geo_city;
+-------------------+
| avg(length(name)) |
+-------------------+
|           13.6786 |
+-------------------+
1 row in set (13.53 sec)
mysql> select max(length(name)) from geo_city;
+-------------------+
| max(length(name)) |
+-------------------+
|               154 |
+-------------------+
1 row in set (12.71 sec)

So, now that we’ve got an idea of what our data looks like, we can adjust our index. You might be thinking, lets just create our index:

create index name_timezone on geo_city (name(20),timezone);

Note: When an index is created in MySQL, any key value is space padded to the full length of the key specified when the index was created. An index on a varchar(255) will have 255 bytes taken up per row, even if the data you are storing only has 35 bytes. Smaller index files mean faster lookups, especially if we can get the index to a size that can be cached in memory.

That might be fine in most cases, but, when we are testing for ranges, we throw any equality conditions after the range condition out of the key lookup. For example:

select * from tablename where condition_a>=1 and condition_a< =2 and condition_b=3;

In the above case, an index on condition_a,condition_b would ignore the fact that the index contained condition_b. Range checks must go after equality checks. In the above case, we want an index on condition_b,condition_a.

Note: As stated above, indexes are space padded to the length of the key. On a range check, using a BTree index, the query plan will only look at the conditions that match the index order until the first ranged query is hit, then, will do a memory or disk temp table based on the size of the results. Remember, any time you use a TEXT, BLOB field, any temporary table created is ALWAYS created on disk. Make sure your /tmp spool is not on a software raid partition and mount it with noatime.

Since we intend to use this table for JQuery Autocomplete lookups, our query will look something like:

select * from geo_city where name like 'wel%';
select * from geo_city where substr(name,0,3)='wel';

Note: when doing this, make sure you set your index collation to a _ci. You can do: show table status like ‘geo_city’; to see that your collation is set to utf8_general_ci. If the collation is not set to something that is Case Insensitive, condition checks for Wel and wel will return different results.

Rule number one of a good index is to make sure you have a high cardinality or uniqueness. A city name has a much higher cardinality than our timezone, but, since we’ll be doing range checks, the cardinality of the timezone+city will make our index lookups quicker.

Some basics on mysql query plan execution

select * from geo_city where name like 'wel%';

select * from geo_city where substr(name,0,3)='wel';

The first does a range check which we can verify with an explain, but the second one uses no index. You might be inclined to believe that the second condition would be faster, but, when you have a calculated field on the left hand side of the condition, MySQL will be forced to do a tablescan to calculate all rows. This will certainly cause performance problems.

Now we’re faced with the Timezone. Timezones have created problems for ages. The names aren’t consistent, but the Olson Database contains a fairly consistent table of names by using the largest populated cities in each of the Timezones as a marker. While this generally works, it is a little confusing for people in some timezones as they wouldn’t associate a city in China with their location in Australia. Moreover, the timezone names are somewhat inconsistent, so, we will convert the Humanized timezone names to a timezone offset. Our offsets now span a maximum of 6 characters, (-05:00, +00:00, +08:30, etc.) and we can now create an index of tz_offset,name(20) which should give us a 26 character index, by 8.1 million rows which results in an index of roughly 260mb. With our primary key index, the index on the geoname_id (for data integrity during upgrades):

-rw-rw---- 1 mysql mysql     25868 Feb  7 13:46 geo_city.frm
-rw-rw---- 1 mysql mysql 968730120 Feb  7 13:49 geo_city.MYD
-rw-rw---- 1 mysql mysql 392612864 Feb  7 13:51 geo_city.MYI

Now, our query:

select * from geo_city where tz_offset='-05:00' and name like 'wel%';

will use an index and should return results very quickly.

A quick test:

select name,admin1_code,latitude,longitude from geo_city where tz_offset='-05:00' and name like 'Wel%';
1036 rows in set (0.04 sec)

mysql> explain select name,admin1_code,latitude,longitude from geo_city where tz_offset='-05:00' and name like 'Wel%';
+----+-------------+----------+-------+---------------+--------+---------+------+------+-------------+
| id | select_type | table    | type  | possible_keys | key    | key_len | ref  | rows | Extra       |
+----+-------------+----------+-------+---------------+--------+---------+------+------+-------------+
|  1 | SIMPLE      | geo_city | range | nt,tzname     | tzname | 82      | NULL | 4231 | Using where |
+----+-------------+----------+-------+---------------+--------+---------+------+------+-------------+
1 row in set (0.00 sec)

Now, on to the next scaling issue – fixing GIS to not drill through the planet to determine distances.

Android App Graphics

February 6th, 2012

While writing an Android App, I found no list anywhere that talked about the numerous graphic items that would be needed.

* Two Screenshots – one of 320 x 480, 480 x 800, 480 x 854, 1280 x 720, 1280 x 800
* High Res Icon – 512 x 512
* Promo Graphic – 180w x 120h
* Feature Graphic – 1024 x 500
* Launcher Graphic 72×72, 48×48, 36×36

The screenshots are relatively easy to take from ddms. The rest of the graphic items need to be created in addition to the graphics used within the application itself.

* Android Brand Guidelines
* Icon Design Guidelines
* Launcher Design Icon Guidelines
* Graphic Assets for your application

Apple Guidelines

* Icons/Images

Entries (RSS) and Comments (RSS).
Cluster host: li