file handling is one of those things developers don't pay a lot of attention to anymore; we read in files with, for instance, file_get_contents
, or write with it's 'put' companion and it just works. most of the time.
file handling in the real world, beyond our development environment, though, can be a lot trickier. for many years, my company's primary focus was rescuing other people's projects that had been abandoned or were just plain Too Broken, and i have seen many instances were a casual approach to reading and writing files was the cause of production-crashing bugs.
in this article, we're going to go over how to write file handling functions that are safer and more reliable.
the flyover
the basic topics we're going to go over in this article are:
- handling filesystem problems before access
- reading files with generators
- a simple file write function
- putting it together with a fun but not particularly useful 'file copy' function
note: all of the examples here assume we are reading text files to keep things shorter.
reading files
when we read a file with file_get_contents()
or even fopen()
and a loop, we're putting a lot of faith in the filesystem; faith that the file exists, faith that we have the permissions read it, and so on. on our development box, this faith is usually justified. but if we're writing wares that run in the wild, especially if it's on someone else's servers that we don't control, that faith can cause problems.
catching filesystem errors early
there are basically three errors we have to check for before we can successfully read a file:
- the file exists
- we have permission to read the file
- some other potential error error
fortunately, php provides us with methodologies to check each of these. let's build a validate_file_read()
function that does those checks.
/**
* Validate we can read a file
* @param String $path_to_file The path to the file we want to read
* @return void
*/
function validate_file_read(String: $path_to_file):void {
// does the file exist?
!file_exists($path_to_file) ? die("File '$path_to_file' does not exist") : null;
// do we have permissions to read the file?
!is_readable($path_to_file) ? die("File '$path_to_file' is not readable") : null;
// do we get any other errors trying to open the file
$fp = fopen($path_to_file, "r");
$fp === false ? die("Could not open file '$path_to_file") : null;
fclose($fp);
}
this function tests all three of our requirements and, if one fails, kills the script with an error message.
first, we confirm the file exists with file_exists
. this function returns a boolean, so we can test it with an if()
statement or, as shown here, a ternary operator. next, we use is_readable
to verify that we actually have the permissions required to open and read the file. note that php, when used as a web language, usually runs under a special user, ie 'www-data', that has limited permissions.
finally, we try opening the file to confirm that there is nothing else wrong. we note here that fopen()
is one of those annoying php functions that returns the 'mixed' type. on success, we get our file pointer. on fail, we get boolean false.
we probably aren't going to use this function 'as is'; it's basically just to illustrate the process of file validation. going forward, we will be implementing the contents of this function in other, more immediately useful functions.
handling large files
once, many years ago, i worked on a rescue project that had a central feature of processing very large files uploaded by users. the system would frequently fail if the file was too large, and the 'solution' the original development team implemented was setting php's memory_limit
to -1 (no limit) and buying a boatload of ram. it was clumsy, expensive, and still didn't stop the client from losing business because of errors.
the solution we implemented was to migrate all the file reads to generators so that the wares only every held one line of a file in memory at a time.
let's take a look at how we would read a file using a generator:
/**
* Generator to read a file line-by-line
*
* @param String $file The path to the file to read
* @return Generator
*/
function read_generator(String $file):Generator {
// open the file for reading
$fp = fopen($file, "r");
// read file one line at a time until the end
while (!feof($fp)) {
yield fgets($fp);
}
// close the file pointer
fclose($fp);
};
// entry point
foreach (read_generator("testfile") as $line) {
print "processing line... ".$line;
}
if you've never used generators before, this may be a bit confusing, and we will do a short overview of generators below.
we see in the read_generator()
function, that we accept a path to a file as an argument and then open that file for reading. we then proceed to loop to the end of the file, reading one line at a time. instead of appending each line to a buffer string or array, however, we yield
that line so it can be dealt with by the code that called the function. no buffer means no risk of running out of memory with large files!
generators implement a simple iterator that allows us loop over the results using foreach
. we can see in our loop at the entry point that we can treat a call toread_generator()
the same way that we would treat foreach
-ing over an array that we got from calling, say, file
.
this construct gives us (most of) the convenience of having our entire file in an array of lines without the risk of us blowing through our memory roof if the file is very large. of course there are some limitations to this technique compared to having an in-memory array of file lines; we can't call count
or use array_map
or the like, but the payoff is safety and reliability.
a bit about generators
generators are not widely used in php, which is a shame because they are a very powerful tool to have.
essentially, all a generator is, is a function that executes up to the next yield
statement every time it is called, maintaing the function's state.
let's look at this example:
/**
* A sample generator function
*/
function samplegenerator() {
// we set the state of $j here on the first call
$j = 10;
// we then yield three times, incrementing $j each time
yield $j; // first yield
$j++;
yield $j; // second yield
$j++;
yield $j; // third yield
// there are no more yields, so the end of the generator's iterator is reached
print "there are no more yields, the generator's iterator ends.";
}
foreach(samplegenerator() as $l) {
print $l.PHP_EOL;
}
in our foreach()
we call the generator function samplegenerator()
in a loop until it has no more yield
statements. now, let's look at the generator function itself and how it behaves on each of these calls.
on the first call to samplegenerator()
, the code executes up to the first yield
. this means our function sets the value of the internal variable $j
to 10 and then yields
it, essentially returning the value of $j
to the calling loop. our calling loop sets the value of $j
, 10 this time, to the variable $l
and prints it.
on the second call to the function, exection advances to the second yield
. this takes the value of $j
, which has persisted in the function, and increments it by one to 11. it is then returned by yield
. on the third call, we find that the value of $j
set by the previous call as 11 is still set. we increment again, and advance to the third and final yield
.
on the last call to the function, there are no more yields
. execution continues to the bottom of the function and it terminates. our generator is now 'empty', and our calling foreach
loop ends.
if we run this script, we will see output like this:
10
11
12
there are no more yields, the generator's iterator ends.
generators are not limited to being used as iterators, either. php provides methods on the Generator class like current
and next
that allow us more finely-grained control.
writing files
reading files is great, but at some point we're going to want to write them as well.
fortunately, writing files requires less work than reading them; there's no need for generators. however, we will still have to do some error checking.
catching errors
like reading, writing files can result in errors, and we want to catch those errors before they happen.
in general, there are four potential errors we want to check for:
- the target directory we want to write to does not exist
- we don't have permission to write the file
- we don't have enough disk space for our new file
- optionally, we're overwriting a file that's already there
let's look at a function that tests all those conditions:
/**
* Validate we can write a file
* @param String $path_to_file The path to the file we want to write to
* @return void
*/
function validate_file_write(String $file_contents, String $path_to_file):void {
// does the target directory exist?
!file_exists(dirname($path_to_file)) ? die("Target directory does not exist") : null;
// is the target directory writable?
!is_writable(dirname($path_to_file)) ? die("Target directory is not writable") : null;
// do we have enough diskspace
strlen($file_contents) > disk_free_space(dirname($out)) ? die("File '$file' is too big to write. Not enough space on disk.") : null;
// optional: are we clobbering an existing file?
file_exists($path_to_file) ? die("Target output file already exists at '$out'") : null;
}
again, this is not a function we would typically use in real life; it's just an example to show how we test our file writes.
we used file_exists
to check for errors when we were reading files, and we're using it again here. however, this time we're checking to see if the directory we want to write to is there or not. we have a file path in $dirname
that we want to write to, but is this a valid path? we test that by getting the directory of the path with dirname
and then running file_exists
to see if the directory is there. despite it's name, file_exists
also handles diretories!
next, we check if we have permissions to write our file with is_writable
. this function is the companion to the is_readable
we used in our file read example, and it works the same way. again, we're testing if the directory is writeable since the file itself does not exist yet.
then there's the issue of diskspace. running out of diskspace is never fun. fortuanely, checking how much room we have on a drive is fairly straightforward with disk_free_space
. we note, here, that this function takes our target directory as an argument. this is because we're not really checking for available space on the disk, but on the partition. the path to the directory tells disk_free_space
which partition we're interested in. once we have our available space in bytes, we can check it against the size of the contents we want to write.
building a cp
function
now that we can safely read and write files, let's put it all together into a function that copies a text file by reading it line-by-line and applying a transform and filter function on each line. note that this is function is just for demonstration and is probably not something you would use in real life!
let's look at the function cp
:
/**
* Copies file $in to destination $out with optional line filter and transformation.
*
* @param String $in Path to input file
* @param String $out Path to target output file
* @param Callable $transform Optional. Function to apply to each line on copy
* @param Callable $filter Optional. Function that returns boolean test on line and copies line on true.
* @return void
*/
function cp(String $in, String $out, ?callable $transform = null, ?callable $filter = null):void
{
/**
* Assign identity functions as default for transform and filter
*/
$transform = $transform ?? fn ($n) => $n;
$filter = $filter ?? fn ($n) => true;
/**
* Preflight we can read and write.
*
* @param String $in Path to the input file
* @param String $out Path to the target output file
* @return void
*/
$preflight = function (String $in, String $out):void {
!file_exists($in) ? die("File '$in' does not exist") : null;
!is_readable($in) ? die("File '$in' is not readable") : null;
!file_exists(dirname($out)) ? die("Target directory does not exist") : null;
!is_writable(dirname($out)) ? die("Target directory is not writable") : null;
file_exists($out) ? die("Target output file already exists at '$out'") : null;
// check disk space
filesize($in) > disk_free_space(dirname($out)) ? die("File '$file' is too big to copy") : null;
};
/**
* File readline generator
* @param String $file Path to the file to read
* @return Generator
*/
$read = function (String $file):Generator {
// open file and handle error
$fp = fopen($file, "r");
$fp === false ? die("Could not open file '$file") : null;
// yield each line
while (!feof($fp)) {
yield fgets($fp);
}
// cleanup
fclose($fp);
};
/**
* Confirm that our filesystem is good before starting copy
*/
$preflight($in, $out);
/**
* Read from input file and write to output file one line at a time
* testing the filter and applying the transformation
*/
$fp = fopen($out, 'w');
foreach ($read($in) as $line) {
$filter($line) ? fwrite($fp, $transform($line)) : null;
}
// cleanup
fclose($fp);
} // cp
the basic steps this function follows are:
- 'preflight' that we can read and write our target files
- read the input file, one line at a time, using a generator
- apply a filter function on each line, determining if that line will be copied to the output file or not
- apply a transform function on each line, altering it
- write the line to the target output file
when we look at this function, the first thing that catches our notice are the transform
and filter
arguments. these are of type callable
; basically they are functions that we will apply to each line as we copy it from the in
file to the out
file.
the filter
function determines if we copy the line at all. this function takes the line of the file as an argument and it's body applies a test to that line. if the filter
function returns true
, we copy the line. if it returns false
, we don't.
let's take a look at a filter function we might use as an argument here:
$filter = fn($line) => !str_contains($line, 'two');
cp($in, $out, null, $filter);
here we create an anonymous function using php's arrow notation an assign it to the variable named $filter
. this function checks if the line contains the word 'two' and returns false if it does. if we pass this as our $filter
argument, then any line in our input file that contains the word 'two' will not be copied to our output file.
the $transform
argument is similar. it is also a function we pass to cp
, however it's purpose is to modify the line we are copying. here's an example:
$transform = fn($line) => ucfirst($line);
cp($in, $out, $transform, null);
this transform
changes the first letter to uppercase and returns it. if we pass this function as an argument to cp
, then every line in the out
file will have it's first letter uppercase.
in the body of the cp
function, we see that the first thing we do is assign values to $transform
and $filter
if they are null:
$transform = $transform ?? fn($n) => $n;
$filter = $filter ?? fn($n) => true;
our filter
funtion always returns true, so it copies every line, and our transform
function simply returns the line unmodified.
the next thing we see is a preflight
function. we didn't need to wrap this code in an anonymous function, but it's done that way here to keep it neat and separate.
the preflight
function is where we do all the tests to assure that our read and write will work; the stuff we've covered alredy.
next, is another anonymous function: read
. this is our generator function for reading the file line-by-line. it behaves exactly the same as the read_generator()
we looked at before; the only difference is that this is an anonymous function inside our main function and is applied to a variable name.
finally, we get to the foreach
call that iterates over the generator. each line it takes is tested against the filter
function. if it passes, the transform
function is applied to the line and it is written to the output file. copying is achieved.
to show how this cp
function is used, we'll run a few basic tests to copy this textfile, which is a list of the sonic youth records i own.
$ cat /tmp/in.txt
confusion is sex
bad moon rising
evol
sister
daydream nation
goo
dirty
experimental jet set, trash and no star
now let's run our cp
function with a transform
that uppercases the first letter of each line, and a filter
function that removes any line that contains the letter 'o'.
$in = "/tmp/in.txt";
$out = "/tmp/out.txt";
$transform = fn($line) => ucfirst($line);
$filter = fn($line) => !str_contains($line, 'o');
cp($in, $out, $transform, $filter);
the results, predictably enough, are:
Sister
Dirty
of course, we can also just copy the file without any filtering or transforming:
$in = "/tmp/in.txt";
$out = "/tmp/out.txt";
cp($in, $out, null, null);
and our new file is the same as the original.
conclusion
it may not seem like an important thing to write safer file-access code. after all, most php applications are for the web and run in controlled environments where the filesystem is predictable.
however, the effort is minimal and file access errors, if not properaly caught and handled, can be catastrophic. i have seen clients who have lost tens of thousands of dollars in business because of unsafe file access code. safety has its rewards.
Top comments (1)
That was a nice read! Liked, bookmarked and followed, keep the good work!