Storing Websites in Memory Using PHP

The proliferation of content management systems has allowed many more people to get a site online, which is a great thing. These content management systems tend to be quite abstract and “one size fits all”, so they often suffer from code-bloat (and from the security aspect, popular software is always a bigger prize target for hackers…). The bare-bones of the CMS’s themself are so abstract with tiny functions and hooks that turns serving a web request into a complicated matter. That’s not to say that these content management systems are slow, though they certainly are more resource intensive than serving static files.

For files in general (be it an image file, PHP script or static HTML file), Operating Systems are good at caching regularly accessed files on disk. The popular content management systems also tend to have a cache in memory of the most regularly accessed files, to save reading them from a much slower disk. MemCached is an oft mentioned service that is used. Some applications like MySQL’s InnoDB engine take care of their own file and memory caching, while MyISAM defers file caching decisions to the operating system. A file based content management system will typically be very quick when all the regularly used files it accesses are in the disk cache.

For all other requests out of the disk cache, disk seeking is required, which is many, many times slower than using a cache or memory. See this short conversation about disk seeks and why they are the bottleneck in today’s computer world. Apparently, ‘disks are the new tape’… though SSD’s are a very nice intermediate solution.

With that in hand, how about making a site that is fast and simple, and is as fast (or very close) to your hardware limits? Loading a page of static content should be very quick, regardless of what content management system is used to generate the content. The following code is an example of storing a small website in memory, rather than on disk. If we wanted more speed, then we’d likely want to code this in something like C, removing the need for PHP and Apache. PHP has a range of semaphore and shared memory functions that allows you to store data permanently in memory that is persistent between web requests. Shared memory also allows you to share memory across applications, so your Java, Perl, C or whatever other language is able to access the same shared memory segment. So how can shared memory be used?

Storing HTML Templates in Memory

This example takes a 12 Megabyte Bootstrap template and compresses it into 3.3 Megabytes of shared memory. It assumes a reasonable knowledge of PHP in order to tweak it to your liking. 1. Download the template and extract the contents of the file into a web accessible folder, for the purposes of this example the folder is called test and resides in /var/www/test/, which is accessible in the browser via http://localhost/test 2. Create an .htaccess file in /var/www with the following contents. If you are not using Apache, then use the URL rewriting engine available on your preferred web server.

3. Save this PHP file as server.php in the test folder

4. Run the script once, preferably from the command line and putting an exit(); after @shm::destroy();shm::create();. All the files in the folder test matching the content types we’re interested in will now be held in the shared memory segment, in a compressed format. Assuming that went well for you,  comment out the @shm::destroy();shm::create(); line altogether, simply leaving the call to shm::get();. If you have problems, ensure that everything is located in the right place and that you are allowed to have < 4 Megabytes of shared memory. 8 Megabytes is a fairly common default so you should be OK there.

A Quick Rundown of How it Works

1. Apache receives a request to your test folder and sees that it should be internally rewritten to /test/server.php. This populates the variable $_SERVER['REDIRECT_URL'] with the originally requested URL.

2. A call is made to shm::get() from our script

3. $_SERVER['REDIRECT_URL'] has the trailing directory stripped. It is checked to ensure there is no directory traversal which could lead to requests like http://localhost/test/../../../secrets.txt which may contain sensitive accessible to the web server.

4. fileinode() is called to get the inode number of the file requested. This touches the hard disk with 1 disk seek when it is not already cached, but it’s quite likely it will be cached. The shared memory segment is then opened and checked to ensure the contents of the file exist in our shared memory, otherwise it’ll return a 404 response to the client. Inodes were used as it is a simple way to convert a pathname to an integer, which shm_get_var() and shm_put_var() require as unique identificaiton of a variable. You’re perfectly able to use a quick hashing scheme like murmurhash() in order to get the integer you need, though you’d have to consider possible collisions (though they are unlikely). To ensure no collisions, run through the files an extra time and check that each hash is generated only once for all filenames.

5. The extension of the URL is examined to determine which content type to return.

6. The clients request headers are evaluated to see whether they’ll accept HTTP compression, and if so, they will get served compressed content. Otherwise, the contents of shared memory are uncompressed and served. Most clients are able to deal with compression and it saves memory by storing it in compressed format (3.3 Megabytes versus 12 Megabytes)

7. The content is served to the client with the appropriate Content-Type and Content-Encoding.

Some Possible Improvements

  • You may want the ability to dynamically add new files into the shared memory segment. Bear in mind security considerations, i.e. you do not want to allow anyone to simply add their own content, you will want some kind of authentication or separate the creation/editing operations of the segment with the selecting of values from it.
  • Consider an alternative to using fileinode() , and you can avoid touching the disk entirely. If you use a hashing method, you could use some of the shared memory as a linked list to deal with collisions.
  • The content types are listed in 3 separate places in this example, you may want to at least dynamically create the .htaccess file to reduce that to 2.

A Simple PHP/XML Sitemap Generator

XML Sitemaps are a useful method for search engines to quickly pick up new content they would otherwise have to find via hyperlinks on the web. Sitemaps can also provide useful metadata regarding URLs such as its last modification date, how often the content changes and the relative importance of the URL in comparison to the rest of the site.

The following code allows you to easily maintain an XML sitemap using PHP. You can:

  • Generate a sitemap from scratch
  • Add URLs to the sitemap
  • Edit URL’s metadata in the sitemap
  • Delete URLs from the sitemap

The script is very simple and currently only takes into account a URL and its last modification date, with the latter being optional. This allows you to easily populate the sitemap with existing and new URLs and easily indicate when content has been updated. You do not have to worry about adding duplicate files as the DOM takes care of itself there.

If you have a large number of URLs, you will need to tinker the script to accommodate more than one sitemap, which would simply mean passing a unique filename into the invocation of the class instance.

Simply save the contents of the two scripts below and try it for yourself. The references to Sitemap.xml assume that you are in your domain’s root directory, otherwise, you should ensure your Sitemap.xml is saved in the root directory to avoid complications, unless it is referenced in a sitemap index file.

Example code to populate a sitemap

Example on how to update the last modified value

How to Create Website Thumbnails with PHP and Firefox

Being able to create thumbnail images of websites can be particularly useful as a website owner, visitors, and generally aesthetically pleasing to appear on a page.

There are a number of ways you can create thumbnails, some better than others, while some lack fairly essential features like the ability to render flash before the thumbnail image is generated. Without that particular feature, flash website thumbnails appear like blank pages.

Here are two options that are available to you and can be adjusted accordingly, one exclusively for linux users:

PHP / Firefox

If you are a firefox user, I recommend getting the Pearl Crescent Page Saver plugin for Firefox. A free and paid version is available, with the latter offering slightly more features should you have the requirement of them.

The basic requirements of the script are:

  • PHP: Or any other scripting language that can iterate through the list of URLs you would like to make thumbnails of
  • Firefox: The browser that is used to render webpages you want to make thumbnails of
  • Page Saver Plugin: The plugin that interacts with Firefox to generate a thumbnail
  • ImageMagick: Not essential, but is very handy for post-processing of images, i.e. resizing.
  • Access to the command line

If you do not have Imagemagick installed, you will want to remove the last 2 lines of code as it involves resizing the image.

Note that you will want to close all your browser windows while testing out this script, or at least create a separate firefox profile. Ideally you would run a separate firefox profile as well as on a separate display/screen.

As an aside, I enjoy using the Firefox browser extenson MozRepl… and there’s a nice github on how to use the Firefox internals (Javascript, essentially), to create screenshots. This will remove the need to install the Pearl Crescent extension, though that particular extension has some nice customisation built in.

Bash / Konqueror

I am using Ubuntu, your flavour of Linux may require different commands. The following packages/software are required for the bash script to run correctly:

Save the following as thumbnails.sh

Save the following as thumbnails.txt and also create a directory for the thumbnails to reside in, for example /var/www/thumbs/

The following command will then iterate through the list in thumbnails.txt

Some notes regarding the latter script and both methods in general

  • I use PHP to generate the thumbnails.txt input files, the bash script iterates through each line and accesses the $URL and saves the thumbs with the filename $URLID in each line
  • Xvfb initiates a mock display that konqueror uses to render web pages. You don’t actually need to see the browser working through the list
  • You may want to have a default page and load that up before calling each URL. This is so that a ‘default thumb’ can be used when a webpage is very slow to load.
  • You have to shut the browser down before running these scripts, otherwise browser invocation will complain that the browser is already running. This is why I like the Konqueror bash script more, as I use Firefox to browse. Alternatively, you can set up separate firefox profiles that won’t cause the program to grumble when you invoke a new window.

There are lots of solutions in acquiring screenshots, these are just a couple of ways to give you an idea of how it is done.

Using PHP DOM Functions to Parse PHP and Find Links

When developing websites, there are a million and one reasons that you will find yourself needing to parse some HTML to find snippets of information. On the face of it, most of the time a simple regular expression will do the trick, particularly when you are in control of the HTML you are fetching.

When parsing other peoples HTML, you soon find that the tag soup that makes the World Wide Web results in situations and code segments your regular expression was never built to accommodate, resulting in false positives, false negatives… and generally the unexpected.

PHP’s DOM functions are specifically made for XML and X/HTML parsing. So, when you have the need to parse some SGML language, turn to these functions and stay away from regular expressions, the comprehensive DOM library will add, edit and delete any attribute, tag or HTML within tags with its suite of functions.

The following example shows how easy it is to collect hyperlinks from a page or file without the problem of broken HTML, attributes with missing/no quotes, or any other hassle that may impede the collection of links:

Broken Link Checker Using PHP and cURL

Whether operating a commercial site, a directory, or a personal site, it is important to ensure you do not have ‘dead’ links on your website. Broken links; links that point to inactive domains or 404 pages are of little use to your site visitors and may jeapordise any good search engine rankings you have, as it can be inferred your site is not well maintained while having broken links on it.

To remedy any potential problem, using a script to periodically check links on your pages means you can quickly alter & remove links that are no longer active or useful.

The following script will Pagination do this task for you, using PHP and cURL, with a simple HTML parser to find links on a page. Simply enter a URL into the form, and the results will appear on an IFrame in the same page.

Simple PHP & MySQL Pagination

When looking at MySQL output, it is sometimes more convenient to split up the number of records returned into separate pages and include hyperlinks to further pages in the result set, a layout often referred to as pagination.

The following is an example of such pagination. Change the MySQL query in the example at the foot of the code to see it working for yourself, remembering to connect to your MySQL database beforehand. This code is designed for simplicity rather than considering the finer details of pagination (mentioned below).

First off, create a test table if you wish to test the code:

Add some test data

This is the simple PHP class to illustrate basic pagination:

Produces something like…

In most cases and in particular for small tables, this method of pagination is fine as it’s a relatively inexpensive computation and allows you to jump to any page you like.

For larger tables you will find that an alternative method is preferred, this post goes into detail why. The post is useful in understanding the general concepts regarding performance and pagination from the MySQL point of view.

Forking with PHP from the command line

Forking new processes is an extremely handy function in programming that allows you to run tasks in parallel to one another, from a single invocation of a program.

You may be interested in forking if:

  • You have a multi-processor/threaded CPU and want to utilise it more effectively
  • You want something to run in the background while your main thread of execution continues
  • You have a set of tasks that take an appreciable time to complete, but do not rely on the results of one another to complete.

As ever, an introdution to the concept is available in the PHP manual.

It is worth noting early on that forking is slightly different to threading, which is described in more detail in this StackOverflow question. Historically threading has not been available in PHP though there has been developments in remedying that.

One popular example usage is HTTP fetching. Fetching is a relatively slow process because of all the latency involved in talking to servers across the world. If you have a queue of 1000 URLs to fetch and each URL takes 3 seconds to fetch, it will take 3000 seconds to fetch all the URLs. Slow or unresponsive servers mean that your average is higher, and that URLs later in the queue have to wait for all the slower URLs in front of it to be fetched.

With forking (or threading), you can split the workload between instances of the script. In the URL fetching example for instance, you could create 10 forks of the fetching script that will fetch 100 URLs each. This should dramatically speed up the time it takes to fetch all the URLs, because if one particular URL is slow, your 9 other forked scripts will still be fetching the URLs in their queue.

I have provided skeleton code below to give you an idea of how it can work for you.

One important thing to consider when forking scripts is to avoid the nastiness of a fork bomb or the unpredictability of a race condition. Bear these concepts in mind as you delve into the usefulness of multi-tasking with forks or threads.

Workarounds for this problem are quite easy. In a text file for instance, you would want each script instance to grab every 10th line, so the 1st fork would grab the 1st line, the 11th line, the 21st line etc. Alternatively, you can have one fork that “serves” lines to the other forks (like in the example above), so that each line is only issued once. If you’re using a database as input and it has an auto-increment field, simply using a modulus of the auto-increment as a quick’n’dirty way to delegate an equal number of rows to each fork. Essentially, you’re looking to keep each fork busy and avoid allocating the same job twice.

HTTP Fetching in PHP Without cURL

On some shared hosting accounts, cURL, fopen or file_get_contents functions may be disabled ‘for security reasons’, yet you can still achieve HTTP fetching using socket functions.

This simple class of code will allow you to do a wide range of HTTP fetching. It should be easy enough to customise should you feel the need to. Check out the 3 examples provided at the foot of the code to try it out. You should be able to use this code with plain old PHP, with no extra functionality.

You may find httpbin.org useful in testing your network requests.

Here are some simple examples to get started with…