Unit Test Generated PDFs with PHPUnit and PDFBox

Amongst the features, that are hard to test with Unit Tests, is generating PDF documents.

The command line tool PDFBox with the option ExtractText comes in handy:

PDF

This application will extract all text from the given PDF document.

This allows us, to test the textual content of the document or searching for specific strings inside.

It gets interesting with the option -html, which converts the PDF to HTML instead. Thus structure and formatting gets at least remotely testable.

Unfortunately the tool does not work with streams, we have to use temporary files. A simple example for a function that receives a PDF document as string, converts it to HTML with PdfBox and returns the HTML string:

/**
 * @var string $streamIn binary string with generated PDF
 * @return string HTML string
 */
function htmlFromPdf($streamIn)
{
  $pdf = tempnam();
  file_put_contents($pdf, $streamIn);
  $txt = tempnam();
  exec('java -jar pdfbox-app-x.y.z.jar ExtractText -encoding UTF-8 -html ' . $pdf . ' ' . $txt);
  $streamOut = file_get_contents($txt);
  unlink($pdf);
  unlink($txt);
  return $streamOut;
}

For regression tests or refactoring it sometimes is enough to test that the generated PDF did not change in comparision to a reference PDF. This could be achieved with a hash value but a PDF itself is not binary equal every time, probably due to timestamps. But a hash of the converted HTML is sufficient:

        // In PHPUnit test case:
        $converter = new PdfBox();
        $html = $converter->htmlFromPdfStream($pdf);
        $this->assertEquals('336edd9ee49b57e6dba5dc04602765056ce05b91', sha1($html), 'Hash of PDF content');

In this example I use a self-written class PdfBox, which encapsulated the call to Apache PdfBox. The code is available under BSD Licence on GitHub: PHP PdfBox

PHP PdfBox

Requirements

  1. Java Runtime Environment, with “java” in the system path. To test this, run java -version on the command line. If you see information about the Java version, everything is fine
  2. Apache PdfBox as executable JAR file. You can download it here: http://pdfbox.apache.org/downloads.html
  3. The PHP function exec() for executing system commands must not be disabled. On shared hosts this is usually the case for security reasons; for local execution of Unit Tests it shouldn’t be a problem to allow exec().PHP-CLI, i.e. PHP on the command line usually uses a different php.ini configuration file than PHP-CGI for the web. The command php --ini shows, which INI files are loaded in CLI mode. If necessary, edit these to remove exec from the disable_functions list.
  4. A PSR-0 compatible autoloader, as shipped with most frameworks. Otherwise you will need to include the single PHP files.

Usage

First you’ll have to specify the full path to the PdfBox JAR. Afterwards you can call the conversion methods, for example:

use SGH\PdfBox

//$pdf = GENERATED_PDF;
$converter = new PdfBox;
$converter->setPathToPdfBox('/usr/bin/pdfbox-app-1.7.0.jar');
$text = $converter->textFromPdfStream($pdf);
$html = $converter->htmlFromPdfStream($pdf);
$dom  = $converter->domFromPdfStream($pdf);

The following conversion methods exist:

  • string textFromPdfStream($content, $saveToFile = null)
  • string htmlFromPdfStream($content, $saveToFile = null)
  • DomDocument domFromPdfStream($content, $saveToFile = null)
  • string textFromPdfFile($fileName, $saveToFile = null)
  • string htmlFromPdfFile($fileName, $saveToFile = null)
  • DomDocument domFromPdfFile($fileName, $saveToFile = null)

The second parameter is either the PDF as binary string ($content) or the file name of a PDF ($fileName). The second parameter, if provided, is a file name for the output. In this file the text, or HTML, will be saved.

A few additional PdfBox-Options can be useful as well:

// Only extract pages 2-5
$converter->getOptions()
    ->setStartPage(2)
    ->setEndPage(5);

// ignore corrupt PDF objects
$converter->getOptions()
    ->setForce(true);

Everything else should be clear from the PhpDoc comments. Happy Testing! Continue reading “Unit Test Generated PDFs with PHPUnit and PDFBox”

PHP: References and Memory

Never ever use references in PHP just to reduce memory load. PHP handles that perfectly with its internal copy on write mechanism. Example:

$a = str_repeat('x', 100000000); // Memory used ~ 100 MB
$b = $a;                         // Memory used ~ 100 MB
$b = $b . 'x';                   // Memory used ~ 200 MB

You should only use references if you know exactly what you are doing and need them for functionality (and that’s almost never, so you could as well just forget about them). PHP references are quirky and can result to some unexpected behaviour.

Question and Answer on StackOverflow

PHP: “Mocking” built-in functions like time() in Unit Tests

A common problem in Unit Testing in PHP is testing something that depends on the current time. For a determined test it should be possible to set the time in your test script without really changing the system settings. In this article I’ll describe how it is usually done with OOP and then come to an alternative solution with much less code that makes use of the new features in PHP 5.3.

The usual approach would be a wrapper class like this:

class Calendar
{
    public function time()
    {
        return time();
    }
    public function date($format, $time = null)
    {
        return date($format, $time ?: $this->time());
    }
    // ...
}

Now any class that uses date/time functions has to be modified to use the Calendar class via Dependency Injection:

class SomeClass
{
    /**
     * @var Calendar
     */
    private $calendar;

    public function __construct(Calendar $calendar = null)
    {
        $this->calendar = $calendar ?: new Calendar;
    }
    public function oneHourAgo()
    {
        return $this->calendar->date('H:i:s', $this->calendar->time() - 3600);
    }
}

Then you mock the Calendar class in your tests and pass it to the test subject. I won’t go into further details because you probaly know the concept of mocking and how to do this in your favourite unit testing framework. After all this article is not about mocking classes, because I have:

A simpler solution with namespaces

If you are using PHP 5.3 namespaces you are lucky because you won’t need all this overhead and probably no changes in your classes at all. The trick is to override built-in functions in your current namespace. Consider this namespaced version of the class from above:

namespace My\Namespace;

class SomeClass
{
    public function oneHourAgo()
    {
        return date('H:i:s', time() - 3600);
    }
}

As you can see, no overhead, just a straightforward call to date() and time(). To test this with specific times we implement a test case as follows (Example in PHPUnit but works as well with other frameworks):

namespace My\Namespace;

require_once 'PHPUnit\Framework\TestCase.php';

/**
 * Override time() in current namespace for testing
 *
 * @return int
 */
function time()
{
	return SomeClassTest::$now ?: \time();
}

class SomeClassTest extends \PHPUnit_Framework_TestCase
{
	/**
	 * @var int $now Timestamp that will be returned by time()
	 */
	public static $now;

	/**
	 * @var SomeClass $someClass Test subject
	 */
	private $someClass;

	/**
	 * Create test subject before test
	 */
	protected function setUp()
	{
		parent::setUp();
		$this->someClass = new SomeClass;
	}
	/**
	 * Reset custom time after test
	 */
	protected function tearDown()
	{
		self::$now = null;
	}

	/*
	 * Test cases
	 */
	public function testOneHourAgoFromNoon()
	{
		self::$now = strtotime('12:00');
		$this->assertEquals('11:00', $this->someClass->oneHourAgo());
	}
	public function testOneHourAgoFromMidnight()
	{
		self::$now = strtotime('0:00');
		$this->assertEquals('23:00', $this->someClass->oneHourAgo());
	}
}

The crucial point here is that we implement a new function named exaclty like a built-in function. You cannot replace functions but since this is defined in the namespace \My\Namespace it does not replace anything. In fact it is a new function with the fully qualified name \My\Namespace\time()

The test subject now calls time() as unqualified name so PHP looks for the function in the current namespace at first. That is \My\Namespace\time() in our example. I recommend the section about name resolution rules in the manual for further reading.
Important Implication: It does not work if you use the global functions with fully qualified names (i.E. \time()) in your test subjects!

You can implement this function however you like, I decided to make the return value configurable within the test case via a static property that gets resetted after each test and if it is not set the real time is used.

I hope this solution helps, it may feel hackish but for me it made testing a lot easier!

Anonymous function calls in PHP

Anonymous function calls are a well-known pattern in JavaScript but there are also use cases in PHP where they make sense. Of course PHP 5.3 with its Lambda Functions is required!

But let me first introduce the pattern shortly:

In JavaScript you often have code that just has to be executed when loaded but you really don’t want to pollute the global namespace. The solution is to create an anonymous function and call it directly:

(function() {
  var some, local, variables;
  // do something
})();

Why should you want this in PHP?

Imagine an application that is not fully object oriented (yes this is the reality and yes, sometimes that even makes sense) and has some include files which execute code directly. Now if you don’t unset all local variables at the end you leave a big mess in the global namespace. See the analogy?

Unfortunately the following is not posible with PHP Lambda Functions:

(function() {
  $localvar = 'foo';
  // do something
})(); // Parse error: syntax error, unexpected '('

But there is still good old call_user_func(). Since Lambda Functions are objects of the Closure class which again is of the callable “type”, it fits perfectly our needs:

call_user_func(function() {
  $localvar = 'foo';
  // do something
});

Now what if there are variables from the outer scope that you need or want to change? Of course you could use the global keyword but that only works if you are refering to the global scope and there is a better way: The use keyword.

Look at this example:

// test.php
function test() {
  $readMe = 'Hello';
  $writeMe = 'World';
  include 'include.php';
  echo 'index.php: ', $readMe, ' ', $writeMe, "\n";
}
test();

// include.php
call_user_func(function() use ($readMe, &$writeMe) {
  $temp = $readMe . ' ' . $writeMe;
  echo 'include.php: ', $temp, "\n";
  $readMe = 'Good Bye';
  $writeMe = 'Internet';
});

test.php results in the following output:
include.php: Hello World
index.php: Hello Internet

Let’s go through it step-by-step:

The $readMe and $writeMe variables are present in the local scope of test(). Since we also include include.php there, the scope stays the same within this file.

Using the anonymous function call we then open a new scope but take over $readMe and $writeMe with the use statement. Any other local variable from test() will not be present!

It is important to know that variables passed with use work similar to function parameters, so a copy is used by default (call-by-value). You can change this behaviour to call-by-reference exactly like in function declarations with a “&” denoting a reference.

Now our function prints the contents of $readMe and $writeMe to the screen (“Hello World”) and assigns new values to both variables.

Afterwards, back in test() the $writeMe variable that was passed by reference will hold this new value whereas $readMe is still the same because it was passed by value. Therefore the output is “Hello Internet”.

Of course $temp will not be set, it was a local variable in the anonymous function scope and destroyed after its execution.

One step further

To keep code maintainable the inclusion of files should not have side effects on variables which is on one hand assured by the anonymous function call but on the other hand bypassed again with use parameters by reference. To keep this transparent I recommend doing the wrapping around the include statement instead of (or even additional to) inside the included file:

// test.php
// ...
  call_user_func(function() use (&$writeMe) {
    include 'include.php';
  });
// ...

Now regardless what include.php contains it is clear that it will only have effects on $writeMe

Conclusion

If you work with procedural code and include files (and let it be just configuration files) anonymous function calls are a good way to keep the code more maintainable. They constrain unwanted side effects and make wanted side effects more visible without further documentation. So the two extra lines should really be worth their effort!

Update

Someone at reddit pointed out namespaces so let me clarify: Namespaces do not solve this problem!

Only classes, functions and constants are namespaced, a namespace does not have its own scope, so they share any variables within the parent scope!

Propel 1.5.5: propel-gen fails

Recentyl I couldn’t get the Propel generator running anymore. First I suspected conflicts due to different versions that were installed at the same time but an update and forcing usage of the latest version did not help.

Turns out that Propel cannot work with the latest versions of Phing, everyghing starting from Phing 2.4.2. So unfortunately the following was necessary:

pear install -f phing/phing-2.4.2

(-f is the “force” parameter to overwrite newer versions)

Et voilà, no more errors.

PHP: Undefined constant __COMPILER_HALT_OFFSET__

This notice showed up in a file with __halt_compiler() occasionaly. It took me some time to get the problem… the error came only when I refreshed the page within short time, so after a while I suspected the opcode cache to be the problem. Actually it was related to APC and I found the following bug report:
http://pecl.php.net/bugs/bug.php?id=15788&edit=2

Solution: Upgrade APC or just use other methods to store data than at the end of a PHP script 😉

PHP Fatal Error Handler

Code snipped inspired by this article:

fatalerrorhandler.php

<?php
// report all but fatal errors 
error_reporting((E_ALL | E_STRICT) ^ (E_ERROR | E_CORE_ERROR | E_COMPILE_ERROR));

// fatal error handler as shutdown function
register_shutdown_function('fatalErrorHandler'); 

function fatalErrorHandler() {
	$error = error_get_last();
	if ($error['type'] & (E_ERROR | E_CORE_ERROR | E_COMPILE_ERROR)) {
		echo '<h1>Fatal Error, shutting down...</h1>';
		echo '<pre>' . var_export($error,true) . '</pre>';
	} else {
		echo 'Regular Shutdown, no fatal errors.';
	}
}

test1.php

<?php
require 'fatalerrorhandler.php';

// Fatal Error (E_ERROR)
unknown_function_call();

test2.php

<?php
require 'fatalerrorhandler.php';

// E_USER_ERROR
trigger_error('...', E_USER_ERROR);

test3.php

<?php
require 'fatalerrorhandler.php';

// Notice (E_NOTICE)
echo $unknown_var;