A Simple PHP/XML Sitemap Generator

XML Sitemaps are a useful method for search engines to quickly pick up new content they would otherwise have to find via hyperlinks on the web. Sitemaps can also provide useful metadata regarding URLs such as its last modification date, how often the content changes and the relative importance of the URL in comparison to the rest of the site.

The following code allows you to easily maintain an XML sitemap using PHP. You can:

  • Generate a sitemap from scratch
  • Add URLs to the sitemap
  • Edit URL’s metadata in the sitemap
  • Delete URLs from the sitemap

The script is very simple and currently only takes into account a URL and its last modification date, with the latter being optional. This allows you to easily populate the sitemap with existing and new URLs and easily indicate when content has been updated. You do not have to worry about adding duplicate files as the DOM takes care of itself there.

If you have a large number of URLs, you will need to tinker the script to accommodate more than one sitemap, which would simply mean passing a unique filename into the invocation of the class instance.

Simply save the contents of the two scripts below and try it for yourself. The references to Sitemap.xml assume that you are in your domain’s root directory, otherwise, you should ensure your Sitemap.xml is saved in the root directory to avoid complications, unless it is referenced in a sitemap index file.

Example code to populate a sitemap

Example on how to update the last modified value

How to Create Website Thumbnails with PHP and Firefox

Being able to create thumbnail images of websites can be particularly useful as a website owner, visitors, and generally aesthetically pleasing to appear on a page.

There are a number of ways you can create thumbnails, some better than others, while some lack fairly essential features like the ability to render flash before the thumbnail image is generated. Without that particular feature, flash website thumbnails appear like blank pages.

Here are two options that are available to you and can be adjusted accordingly, one exclusively for linux users:

PHP / Firefox

If you are a firefox user, I recommend getting the Pearl Crescent Page Saver plugin for Firefox. A free and paid version is available, with the latter offering slightly more features should you have the requirement of them.

The basic requirements of the script are:

  • PHP: Or any other scripting language that can iterate through the list of URLs you would like to make thumbnails of
  • Firefox: The browser that is used to render webpages you want to make thumbnails of
  • Page Saver Plugin: The plugin that interacts with Firefox to generate a thumbnail
  • ImageMagick: Not essential, but is very handy for post-processing of images, i.e. resizing.
  • Access to the command line

If you do not have Imagemagick installed, you will want to remove the last 2 lines of code as it involves resizing the image.

Note that you will want to close all your browser windows while testing out this script, or at least create a separate firefox profile. Ideally you would run a separate firefox profile as well as on a separate display/screen.

As an aside, I enjoy using the Firefox browser extenson MozRepl… and there’s a nice github on how to use the Firefox internals (Javascript, essentially), to create screenshots. This will remove the need to install the Pearl Crescent extension, though that particular extension has some nice customisation built in.

Bash / Konqueror

I am using Ubuntu, your flavour of Linux may require different commands. The following packages/software are required for the bash script to run correctly:

Save the following as thumbnails.sh

Save the following as thumbnails.txt and also create a directory for the thumbnails to reside in, for example /var/www/thumbs/

The following command will then iterate through the list in thumbnails.txt

Some notes regarding the latter script and both methods in general

  • I use PHP to generate the thumbnails.txt input files, the bash script iterates through each line and accesses the $URL and saves the thumbs with the filename $URLID in each line
  • Xvfb initiates a mock display that konqueror uses to render web pages. You don’t actually need to see the browser working through the list
  • You may want to have a default page and load that up before calling each URL. This is so that a ‘default thumb’ can be used when a webpage is very slow to load.
  • You have to shut the browser down before running these scripts, otherwise browser invocation will complain that the browser is already running. This is why I like the Konqueror bash script more, as I use Firefox to browse. Alternatively, you can set up separate firefox profiles that won’t cause the program to grumble when you invoke a new window.

There are lots of solutions in acquiring screenshots, these are just a couple of ways to give you an idea of how it is done.

Using PHP DOM Functions to Parse PHP and Find Links

When developing websites, there are a million and one reasons that you will find yourself needing to parse some HTML to find snippets of information. On the face of it, most of the time a simple regular expression will do the trick, particularly when you are in control of the HTML you are fetching.

When parsing other peoples HTML, you soon find that the tag soup that makes the World Wide Web results in situations and code segments your regular expression was never built to accommodate, resulting in false positives, false negatives… and generally the unexpected.

PHP’s DOM functions are specifically made for XML and X/HTML parsing. So, when you have the need to parse some SGML language, turn to these functions and stay away from regular expressions, the comprehensive DOM library will add, edit and delete any attribute, tag or HTML within tags with its suite of functions.

The following example shows how easy it is to collect hyperlinks from a page or file without the problem of broken HTML, attributes with missing/no quotes, or any other hassle that may impede the collection of links:

Broken Link Checker Using PHP and cURL

Whether operating a commercial site, a directory, or a personal site, it is important to ensure you do not have ‘dead’ links on your website. Broken links; links that point to inactive domains or 404 pages are of little use to your site visitors and may jeapordise any good search engine rankings you have, as it can be inferred your site is not well maintained while having broken links on it.

To remedy any potential problem, using a script to periodically check links on your pages means you can quickly alter & remove links that are no longer active or useful.

The following script will Pagination do this task for you, using PHP and cURL, with a simple HTML parser to find links on a page. Simply enter a URL into the form, and the results will appear on an IFrame in the same page.

Simple PHP & MySQL Pagination

When looking at MySQL output, it is sometimes more convenient to split up the number of records returned into separate pages and include hyperlinks to further pages in the result set, a layout often referred to as pagination.

The following is an example of such pagination. Change the MySQL query in the example at the foot of the code to see it working for yourself, remembering to connect to your MySQL database beforehand. This code is designed for simplicity rather than considering the finer details of pagination (mentioned below).

First off, create a test table if you wish to test the code:

Add some test data

This is the simple PHP class to illustrate basic pagination:

Produces something like…

In most cases and in particular for small tables, this method of pagination is fine as it’s a relatively inexpensive computation and allows you to jump to any page you like.

For larger tables you will find that an alternative method is preferred, this post goes into detail why. The post is useful in understanding the general concepts regarding performance and pagination from the MySQL point of view.

Forking with PHP from the command line

Forking new processes is an extremely handy function in programming that allows you to run tasks in parallel to one another, from a single invocation of a program.

You may be interested in forking if:

  • You have a multi-processor/threaded CPU and want to utilise it more effectively
  • You want something to run in the background while your main thread of execution continues
  • You have a set of tasks that take an appreciable time to complete, but do not rely on the results of one another to complete.

As ever, an introdution to the concept is available in the PHP manual.

It is worth noting early on that forking is slightly different to threading, which is described in more detail in this StackOverflow question. Historically threading has not been available in PHP though there has been developments in remedying that.

One popular example usage is HTTP fetching. Fetching is a relatively slow process because of all the latency involved in talking to servers across the world. If you have a queue of 1000 URLs to fetch and each URL takes 3 seconds to fetch, it will take 3000 seconds to fetch all the URLs. Slow or unresponsive servers mean that your average is higher, and that URLs later in the queue have to wait for all the slower URLs in front of it to be fetched.

With forking (or threading), you can split the workload between instances of the script. In the URL fetching example for instance, you could create 10 forks of the fetching script that will fetch 100 URLs each. This should dramatically speed up the time it takes to fetch all the URLs, because if one particular URL is slow, your 9 other forked scripts will still be fetching the URLs in their queue.

I have provided skeleton code below to give you an idea of how it can work for you.

One important thing to consider when forking scripts is to avoid the nastiness of a fork bomb or the unpredictability of a race condition. Bear these concepts in mind as you delve into the usefulness of multi-tasking with forks or threads.

Workarounds for this problem are quite easy. In a text file for instance, you would want each script instance to grab every 10th line, so the 1st fork would grab the 1st line, the 11th line, the 21st line etc. Alternatively, you can have one fork that “serves” lines to the other forks (like in the example above), so that each line is only issued once. If you’re using a database as input and it has an auto-increment field, simply using a modulus of the auto-increment as a quick’n’dirty way to delegate an equal number of rows to each fork. Essentially, you’re looking to keep each fork busy and avoid allocating the same job twice.

HTTP Fetching in PHP Without cURL

On some shared hosting accounts, cURL, fopen or file_get_contents functions may be disabled ‘for security reasons’, yet you can still achieve HTTP fetching using socket functions.

This simple class of code will allow you to do a wide range of HTTP fetching. It should be easy enough to customise should you feel the need to. Check out the 3 examples provided at the foot of the code to try it out. You should be able to use this code with plain old PHP, with no extra functionality.

You may find httpbin.org useful in testing your network requests.

Here are some simple examples to get started with…