a

Murmurhash2 in PHP without the extension

Murmurhash is a nice and speedy hashing algorithm that is handy for creating hash values based on strings. I use it often as benchmarks suggest it is one of the speedier implementations out there. Murmurhash can create 32-bit or 128-bit outputs.

In PHP, if you are able to install extensions, then you can simply install the murmurhash extension * (see bottom of page for instructions) and be done with it. If you’re on shared hosting, here is an extensionless alternative to produce 32-bit outputs based on the 2nd version of the murmurhash algorithm.

Do note, it is many times slower than the extension implementation, simply because it’s a user-created function. The code itself is relatively efficient and mostly bitshifting anyway. I had to knock this together because I needed murmurhash in a shared hosting environment where installing extensions is not an option.

* More recently that particular link for instructions on how to install the murmurhash extension is no available. Here is the general gist of how to install the extension:

Tip: Storing MD5 Values (and other string/binary representations)

A common occurrence I have noticed in MySQL apps is that MD5 values are stored as 32 byte values rather than 16. Just to ‘rehash’, an MD5 value is a 16 byte hexadecimal value, typically used as a unique fixed-length signature of a string, useful for identifying unique strings or one-way encryption of passwords. The binary representation takes 16 bytes, though a human readable hexadecimal version takes twice as many.

The same goes for any of the other hashing techniques. They tend to output a friendly hex format, useful in a number of cases like in Javascript or within a particular format such as CSV or TSV (the random binary bytes would mess up the delimiting of data). When you’re looking to store these values though, most of the time it makes sense to have them in their shorter binary representation.

Another common example is IP addresses, I often see VARCHAR(16) for IPv4 addresses. Perhaps when IPv6 is more commonplace we will see VARCHAR(64) instead. IPv4 addresses are 32-bit values and can be stored as an UNSIGNED INT (4 bytes), while IPv6 addresses are 128-bit. There isn’t a native 16-byte integer type in MySQL so a BINARY(16) or two UNSIGNED BIGINT fields would do, though perhaps software will address this as IPv6 gains adoption.

When doing lookups on these kinds of fields, you want them as small as possible so that they can fit neatly into indexes and less processing time is spent evaluating them.

The following is a simple test to compare speeds of a CHAR(32) MD5 column versus a BINARY(16)

The MD5 values that are inserted are deliberately left-padded with 0’s to emphasise the fact that field lengths do make a difference when searching on a field, regardless of whether the field is indexed or not. This is because we’re only populating the table with ~2^20 rows, whereas random MD5s have 2^128 possible values. If we just used random MD5s then MySQL would only have to examine the 1st byte or two due of our small dataset and there would be negligible difference in our small sample. Over millions of runs, or a larger dataset… the difference grows.

Output may be similar to

A Quick and Efficient URL Shortener Using PHP and MySQL

URL shortener’s have proliferated in the past few years, mainly due to the confines of data length that mobile and social networks like Twitter apply. The following code example shows how to make a simple and efficient URL shortener, with plenty scope for improvement.

Although in this example I am going to use localhost as the serving domain, you’d be looking to use as short a domain name as possible in production. You have a fairly good chance of securing a 5 character domain that’d result in 7 character short URLs to begin with.

This example assumes you’re on a 64-bit system that uses a web server capable of rewriting, and uses PHP/MySQL. I use the hashing function murmurhash() as it’s known to be very quick and effective with short strings, though it’s not native to standard PHP installations. You can follow these instructions to install murmurhash. If you do change the hashing method, just ensure you also update the table schema in MySQL (one field is used for the hash values).

Also, I’ve used MySQL partitioning which makes lookups more efficient, but it’s not necessary for the working of the script.

The HTML layout of it is extremely simple, in fact you’ll only see a form and the display of newly created short URLs.

Other than displaying HTML contents, the concept is quite simple:

— Allow a user to submit a long URL and convert it into a short one
— Deal with requests for short URLs and redirect them to long ones.

This, essentially is what a URL shortener is. Some services will give you nice statistics and graphs about the people who visit a short URL. The code provided here is simple and extensible enough for you to do that should you wish.

There are 3 locations where data is stored in this script, 2 MySQL tables and one flatfile. One table is for inserts, one for selects and the flatfile contains the long URLs in the system.

Insert Table

When new URLs are added, a quickish method is needed to see whether the URL already exists in the database or not. This is done by creating a 64-bit hash from the contents of the URL, though you can use whichever hashing method and size of data you wish. 8 bytes is a fairly good size for avoiding collisions while not being too large a key.

insert_table is partitioned (if you choose to) and contains three other fields…

— fileoffset – A pointer to a position in the flatfile that contains the URL in question. Since hashes can collide, all matching hash values are checked until the corresponding URL is found (or not found).
— urllength – Also part of the primary key, this is used to further reduce the potential result set in the case of collisions. Only more than one result will appear for URLs that match the hash and also match the urllength.
— id – The unique incremental ID of the URL, this is converted into short URLs. In cases where someone submits a URL to be shortened that already exists in the database, this datapoint is used.

After a long URL is submitted, the short version is returned to the end user that they can use.

Select Table

select_table simply holds the unique incremental ID of the URL and a fileoffset for where to find the long URL. It is used when someone load up a short URL and needs redirected to a long URL.

URL file

The URL file is simply a raw list of long URLs entered into the system

The Code

Without further ado, here’s the code in order to try it out.

First off, we want to redirect all requests that may be shortened URLs to our single script. Something like this in .htaccess does the trick.

This rule basically means “if the URL is not shortener.php and does not contain a forward slash, redirect to the URL shortener”. This will allow you to create extra pages on the domain that won’t be redirected, but they’ll need to have a forward slash included in the URL.

Now, add this SQL schema to a database of your choice.

Not required but recommended, also apply the following SQL. Partitions can be created during table creation but I’ve separated the two concepts here for clarity.

Note that partitions are usually stored as separate files and mostly treated as separate tables. You can have up to 8192 partitions on more recent versions of MySQL, which take longer to create but certainly should help scaling. You may come into some issues with operating/user system open file limits if you put a higher number, though it is trivial to change after a quick Google.

In more recent versions of MySQL you can also reference particular partitions directory, which can help the query optimiser pick the correct/minimal partitions, particularly when using JOIN’s.

Here’s the PHP driving it all…

Some Notes and Possible Improvements

— As is, the shortener is fine for personal use or use within a trusted network.

— You may want to log some details about the users submitted URLs, in case particular users become problematic by submitting garbage, filling up your database or link to material you believe should not be linked to.

— Along the same lines, you may want to rate limit new URL submissions per IP or per session cookie.

— Some basic statistics about the number of visitors to each URL may be useful.

— Along the same lines, you may want to build a memory cache based on popular URLs.

— Occasionally you will want to rebuild the partitions in the insert_table. This is because the hash values are inserted in a random order and gradually fragment the table. Having partitions helps you do this process incrementally and continue to be able to serve requests (you would need some kind of indication that a particular partition is ‘busy’ and copy the contents of the partition somewhere temporary in order to continue serving requests for data within it, until the partition is fully rebuilt).

— You could sacrifice some processing for more optimal storage of URLs by converting common components of a URL into an integer flag. For example, most if not all URLs would be either HTTP or HTTPS and that only requires 1 bit of information to distinguish them from each other. “www.” in hostnames is another common component. TLD’s are another, as are file extensions.

— Pre filling the URL file with a large amount of space would avoid fragmentation of the file due to the small incremental writes on it. You’d want to mark somewhere how much data is actually in the file as the script currently just seeks to the end of it to write new data. (The same idea could be applied to the MySQL insert_table)

— Having multiple disks (preferably SSD) would obviously help with an IO contention. Also having a more logical ordering of the URLs (by length for instance) could rid you of fileoffsets in the database altogether, because you’d only need to know the unique ID and the length of a short URL in order to find its longer counterpart. I deliberately made this example code simple in order to not have too many open file handles.

Storing InnoDB Tables on Multiple Directories and Disks

InnoDB by default stores all data in one large file, typically referred to as the InnoDB tablespace. Without customisation, A file named ibdata1 at the root of your MySQL data directory will contain all your data and indexes.

One problem that has been noted with this setup is that data cannot be reclaimed when you delete data from your tables, so in the long run the data file can grow to an awkward (and redundant size).

However there is the option to have each InnoDB table as a separate file. For this you must use the innodb_file_per_table option in your my.cnf file.

MyISAM tables have 3 files per table, a table.FRM for table format, table.MYD for data, and table.MYI for indexes. InnoDB also has the .frm file but stores both data and indexes in table.ibd. This setup has the slight advantage of having less files open to access your tables, which can become important particularly for partitioned tables or setups with a low open_files_limit setting.

By using separate files per table, the opportunity arises to split your data up across different directories and disks. This can be done easily with MyISAM tables by specifying a DATA DIRECTORY and INDEX DIRECTORY in your CREATE TABLE syntax, allowing you spread your tables across directories and disks with ease. The problem is that InnoDB ignores these specifications…

I encountered this problem whilst trying to take advantage of a 2xHDD and 2xSSD (solid state disk) setup, with the intention of putting regularly accessed tables on the SSD to speed things up. InnoDB default behaviour seemed to prevent me from doing this, however there is a workaround. The solution is to create a separate database and create a symbolic link of the folder containing your new database, which is an eloquent or messy solution depending on whether you have an existing or new project.

Consider the following example where your default MySQL data directory is /var/lib/mysql and you have another folder/partition/disk you wish to use at /home/mysql.

Login to MySQL to create the following database (just for testing)

Exit MySQL into the command line to create the symbolic link. The following command is for Linux systems:

…and back into MySQL

Now you should have two databases, and if you check /var/lib/mysql, the db1 folder should be there with its data, alongside a symbolic link to db2, which now resides on /home/mysql.

And that is how it is done! The drawback is that if you have existing scripts referencing your database, you’re going to have to update your queries to reference db2 for all the tables that now reside in your symlinked database. This is a hassle, but I’m sure some release of MySQL in the near future will harness the DATA and INDEX DIRECTORY syntax that works so well for MyISAM tables.

A word of caution: to save you having to re-write and reference db2, you could of course use symbolic links for all the tables in a pre-existing table you wish to have in a separate directory. ALTER TABLE queries will break this setup, so beware if you decide to go down that route.

All in all, it is a bit of an unusual situation to what shouldn’t really be a problem, given the versatility of MySQL, but for the meantime is a handy workaround.

When InnoDB is Slow in phpMyAdmin

You may have recently switched over from MyISAM tables to InnoDB, or in fact used InnoDB for a long time. phpMyAdmin has been a mainstay tool for quick viewing and editing of databases, but unfortunately seems to grind to a halt after clicking to view a particular database.

This is mainly due to the way that SELECT COUNT(*) FROM TABLE is calculated. MyISAM keeps that kind of metadata at hand so can instantly calculate the value while InnoDB does not. phpMyAdmin doesn’t recognise this major difference; which means that if you have big InnoDB tables or a number of medium sized ones, loading up a database’s details can take a number of seconds, even minutes. It also is sucking up resources while it intensely tries to calcuate some statistics about your tables.

If you are not too interested in the general stats of a database, and more interested in viewing and manipulating the tables, there is a small hack you can make to one of phpMyAdmin’s PHP files that will load tables up instantly, namely libraries/database_interace.lib.php.

At around line 290 there is a variable $sql declared (it is declared a number of times in the script but we’re interested in this instance), edit the $sql command to this:

The small downside is that you can’t see row counts and some other general metadata, but it’s a small price to pay to continue using phpMyAdmin as a quick GUI reference to your database. I can verify that a 600M row database (10 tables) originally took about 6 seconds to load, but loads instantly after this fix. A smaller 150M row database (40 tables, lots of table partitions) would take up to 30 seconds to load and hang my browser… now only takes a second to spark up.

Credit goes to Richard Dale in the Source Forge forum, who created this workaround. I thought I would dedicated a post to this as the issue does not seem to be too prominent. With MySQL soon to use InnoDB as its default database engine, no doubt this issue will come ot the fore, and phpMyAdmin will implement a more permanent workaround.

Creating CSS Sprites with PHP

When you have a number of small images that appear on a page, or a number of pages that you know one particular client is going to visit that include these small images, it makes sense to use CSS sprites to speed-up the rendering of your web page.

This has two advantages, namely:

  • It reduces the number of HTTP requests a client has to make to download the images on your page
  • Speeds up the rendering of the images and the web-page due to limits on the number of HTTP connections between a server and a client.

The slight downside is that you may make a CSS sprite that contains images that don’t appear on a particular page, and therefore the client doesn’t need. Some consideration has to be put into these situations to weigh the pro’s and con’s of using a sprite.

A fairly comprehensive analysis of using sprites can be found at css-sprites.com.

Using PHP to Create Sprites

It’s very easy to create a sprite using PHP, which can generate a single image from a collection of images, and also generate the CSS for you. This is a huge timesaver if you intend to make a number of sprites.

The following simple class will perform the following upon initiating the class:

A summary of what’s going on…

  • Accept 4 arguments to perform the sprite and CSS creation:
    1. $folder , the folder to read images from
    2. $output, the filename given to the output, $output.css and $output.png
    3. $x,$y, the dimensions of the images you want to consider, all other images are ignored. If you wish to have images of variable size you will also want to do some mathematical optimization, to fit the images into the smallest sprite dimension possible.
  • The $folder you provide will then be scanned for matching files
  • The sprite image is then created, with a size according to the number of images that will be put into it. A CSS file is also created.
  • For each image in the folder, the image is appended to the sprite image in its relevant position, and the position is logged in the CSS file. A simple counter is used to differentiate the classes declared in the CSS file.

A Simple PHP .htpasswd Manager

Sometimes simplicity is also convenient, and in the case of authentication on the most popular web server, Apache, .htpasswd fits the bill

You may opt for a PHP/MySQL login which in itself is simple enough, but requires the availability of MySQL. If you are on a cheap shared host, MySQL may not be available or is available for a fee.

About .htpasswd

.htpasswd can allow you to authenticate users and restrict access to particular areas of your site. Usernames and passwords are stored in a plain text file with passwords encrypted, while the default setup for Apache ‘hides’ .htpasswd from direct viewing because the filename begins with a period. To enable .htpasswd, you must declare in .htaccess (or apache2.conf) that a particular directory requires authentication, and indicate what file contains your username and password combinations.

An example .htaccess file indicating that /adminarea/ requires authentication, that should reside in the same folder you require authentication for. Note that the path you reference to the .htpasswd file must be an absolute path:

And an example .htpasswd file that will reside at /var/www/adminarea/.htpasswd for the purposes of this example

Managing .htpasswd

Because some shared hosts restrict your available tools and access, it is sometimes helpful to have a PHP script that can manage your .htpasswd file when you are unable to use the command line. If SSH is disabled or command line execution is forbidden, as is often the case on shared hosts, your only other current option is to manage htpasswd via an admin area such as CPanel, which can be slow and inefficient. With this in hand, you may find the following script of interest.

This class will add users, delete users, check if users exists and update existing user’s passwords…

This gives you some basic examples to work with:

Summary of .htpasswd Class

Before running, ensure you have your .htaccess and .htpasswd files created already, .htpasswd can be left blank.

$htpasswd->htpasswd('/var/www/adminarea/.htpasswd'); – Upon initiating the class, A file pointer is created, as all the functions require reading and writing to the file

$htpasswd->user_exists($username) – Accepts the $username variable and reads .htpasswd line by line until a username within it matches $username. Returns false if $username is not matched.

$htpasswd->user_add($username,$password) – Accepts $username and $password and checks whether the user already exists. If not, $username nad $password is written to a newline at the end of the .htpasswd file.

$htpasswd->user_delete($username) – Deletes $username from the .htpasswd file. The file is iterated through line by line and a string is built containing all the details minus $username’s credentials. The resulting string is then written to .htpasswd, effectively deleting $username’s account.

$htpasswd->user_update($username,$password) – Updates $password for $username. The file is read line by line until $username is matched, after which the password is then updated. The function will return false is the username is not found.

Some Simple PHP Password Generation Functions

Passwords may well become a thing of the past in the not-so-distant future, as processing power doubles up year on year and makes the cracking of passwords in a brute force manner almost trivial.

This calculator of password complexity against brute force attacks illustrates how easy it is to compromise a password in an offline environment.

Fortunately for us web developers, things take an awful lot longer in an online scenario- due to the fact that the network latency and distance limits the number of password hacking attempts can occur in a reasonable timeframe. Various other layers of protection can be added too, blacklistings IP’s, locking accounts after a number of failed login attempts and generally paying attention nefarious requests.

There’s been numerous instances where a public facing vulnerability has resulted in password crackers gaining access to an entire database of usernames and passwords. The worst cases involve passwords that are not one-way encrypted at all, no computation is necessary to exploit them and unfortunately, can expose users who use the same passwords across different sites.

A not-so-bad cases involve one-way encryption of passwords using what’s considered to be weaker algorithms like MD5 that do not use salted passwords.

Summarily, when storing passwords, you need to use salting, preferably on a per-password basis and use what’s considered to be a secure algorithm. Have a look at this article (and comments) for a more involved look at attack scenarios and how to prevent them. From the user standpoint, using any kind of dictionary word or concatenation of words is asking for trouble, particularly if the database that stores them is not salting passwords.

However…

If you are generating passwords for your users, you will want to have a level of complexity that won’t be subject to brute force attacks. The following code is a simple PHP password generator that will include alphanumeric characters and optionally other printable characters (you can add more if you like). This is good for instances where you need to generate a random password for a user.

This fairly rudimentary function generates passwords of reasonable complexity, depending on length.

For my own personal passwords, it is a pain to be signed up to so many websites and having to memorise or maintain a list. Software like KeePass have sprung up in order to save the hassle of you remembering the long and complex passwords that are desirfed to maintain a good degree of confidence your account(s) are secure.

The following function is similar in notion, passwords do not have to be remembered here, simply the domain name that you use them on:

There is a caveat. Due to some temporal lapse in concentration of some web developers, some password fields insist on restricting what characters are in a password, or insist on certain groups of characters being present. This is fine for preventing people enter overly simple passwords, but can cause issues in my above implementation. For those occasions you may want to just use a part of the hex representation of the hash and calculate that some of the characters should become uppercase.

Storing Websites in Memory Using PHP

The proliferation of content management systems has allowed many more people to get a site online, which is a great thing. These content management systems tend to be quite abstract and “one size fits all”, so they often suffer from code-bloat (and from the security aspect, popular software is always a bigger prize target for hackers…). The bare-bones of the CMS’s themself are so abstract with tiny functions and hooks that turns serving a web request into a complicated matter. That’s not to say that these content management systems are slow, though they certainly are more resource intensive than serving static files.

For files in general (be it an image file, PHP script or static HTML file), Operating Systems are good at caching regularly accessed files on disk. The popular content management systems also tend to have a cache in memory of the most regularly accessed files, to save reading them from a much slower disk. MemCached is an oft mentioned service that is used. Some applications like MySQL’s InnoDB engine take care of their own file and memory caching, while MyISAM defers file caching decisions to the operating system. A file based content management system will typically be very quick when all the regularly used files it accesses are in the disk cache.

For all other requests out of the disk cache, disk seeking is required, which is many, many times slower than using a cache or memory. See this short conversation about disk seeks and why they are the bottleneck in today’s computer world. Apparently, ‘disks are the new tape’… though SSD’s are a very nice intermediate solution.

With that in hand, how about making a site that is fast and simple, and is as fast (or very close) to your hardware limits? Loading a page of static content should be very quick, regardless of what content management system is used to generate the content. The following code is an example of storing a small website in memory, rather than on disk. If we wanted more speed, then we’d likely want to code this in something like C, removing the need for PHP and Apache. PHP has a range of semaphore and shared memory functions that allows you to store data permanently in memory that is persistent between web requests. Shared memory also allows you to share memory across applications, so your Java, Perl, C or whatever other language is able to access the same shared memory segment. So how can shared memory be used?

Storing HTML Templates in Memory

This example takes a 12 Megabyte Bootstrap template and compresses it into 3.3 Megabytes of shared memory. It assumes a reasonable knowledge of PHP in order to tweak it to your liking. 1. Download the template and extract the contents of the file into a web accessible folder, for the purposes of this example the folder is called test and resides in /var/www/test/, which is accessible in the browser via http://localhost/test 2. Create an .htaccess file in /var/www with the following contents. If you are not using Apache, then use the URL rewriting engine available on your preferred web server.

3. Save this PHP file as server.php in the test folder

4. Run the script once, preferably from the command line and putting an exit(); after @shm::destroy();shm::create();. All the files in the folder test matching the content types we’re interested in will now be held in the shared memory segment, in a compressed format. Assuming that went well for you,  comment out the @shm::destroy();shm::create(); line altogether, simply leaving the call to shm::get();. If you have problems, ensure that everything is located in the right place and that you are allowed to have < 4 Megabytes of shared memory. 8 Megabytes is a fairly common default so you should be OK there.

A Quick Rundown of How it Works

1. Apache receives a request to your test folder and sees that it should be internally rewritten to /test/server.php. This populates the variable $_SERVER['REDIRECT_URL'] with the originally requested URL.

2. A call is made to shm::get() from our script

3. $_SERVER['REDIRECT_URL'] has the trailing directory stripped. It is checked to ensure there is no directory traversal which could lead to requests like http://localhost/test/../../../secrets.txt which may contain sensitive accessible to the web server.

4. fileinode() is called to get the inode number of the file requested. This touches the hard disk with 1 disk seek when it is not already cached, but it’s quite likely it will be cached. The shared memory segment is then opened and checked to ensure the contents of the file exist in our shared memory, otherwise it’ll return a 404 response to the client. Inodes were used as it is a simple way to convert a pathname to an integer, which shm_get_var() and shm_put_var() require as unique identificaiton of a variable. You’re perfectly able to use a quick hashing scheme like murmurhash() in order to get the integer you need, though you’d have to consider possible collisions (though they are unlikely). To ensure no collisions, run through the files an extra time and check that each hash is generated only once for all filenames.

5. The extension of the URL is examined to determine which content type to return.

6. The clients request headers are evaluated to see whether they’ll accept HTTP compression, and if so, they will get served compressed content. Otherwise, the contents of shared memory are uncompressed and served. Most clients are able to deal with compression and it saves memory by storing it in compressed format (3.3 Megabytes versus 12 Megabytes)

7. The content is served to the client with the appropriate Content-Type and Content-Encoding.

Some Possible Improvements

  • You may want the ability to dynamically add new files into the shared memory segment. Bear in mind security considerations, i.e. you do not want to allow anyone to simply add their own content, you will want some kind of authentication or separate the creation/editing operations of the segment with the selecting of values from it.
  • Consider an alternative to using fileinode() , and you can avoid touching the disk entirely. If you use a hashing method, you could use some of the shared memory as a linked list to deal with collisions.
  • The content types are listed in 3 separate places in this example, you may want to at least dynamically create the .htaccess file to reduce that to 2.