In this post I want to describe my thought process from when I wrote a PHP script to upload image files to my website.
It is super easy to save an uploaded file with PHP, all the information about the new files is provided inside the superglobal array $_FILES
Now, I want to ensure my script is safe to use, so that for example another authenticated user besides myself can upload files to the website, without having to fear a malicious file is placed in the server by a malicious user to accomplish whatever malicious goal.
I started to get scared tbh, because I was shocked about how little control you have as a web developer, when a user sends a file to the web server. In fact, I learned that you can not even prevent a user from sending a file to the webserver. So it comes down to the question: How do I validate the uploaded file?
I gave it a lot of thought and I came to the conclusion that you can always provide security for your website, by using restrictions. The more I restrict the properties of the uploaded file, the more security I get, or so I suppose.
Let's do that with image files.
So in my PHP script I grab all the information about the image file just uploaded and check them against my set limitations.
The goal is to make sure the image file is what it claims to be.
1. File Size
$maxFileSize = 1024 * 1024 * 10; // Max. 10 MB
$maxImgSize = 4000; // Max. 4000px (for width and height)
I'll admit, I would rather not restrict those too much, I believe every smartphone today shoots images with at least 2000px and 3-4 MBs and that may be played down.
2. File Extension And Mime/Media Type
I want to allow only jpg, png, gif, bmp and webp images to be uploaded, I think that covers enough ground.
In this array I set the file extensions together with their respective mime/media types.
$imgWhiteList = array("jpg" => "image/jpeg",
"jpeg" => "image/jpeg",
"gif" => "image/gif",
"bmp" => "image/bmp",
"png" => "image/png",
"webp" => "image/webp");
Now I use the following function to get the file extension from the uploaded file's name. If the given file extension is not whitelisted, this function returns FALSE. So calling this function is also a validation step. The code is self explanatory:
function getFileExtension($name):string|false
{
// split file name by dots
$arr = explode('.', strval($name));
// last array element has to be the file extension
$ext = array_pop($arr);
$ext = mb_strtolower(strval($ext));
// Return file extension string if whitelisted
if(array_key_exists($ext, $GLOBALS["imgWhiteList"])) {
return $ext;
}
return FALSE;
}
if(!$ext = getFileExtension($_FILES["file"]["name"])) {
die("Invalid file type");
}
// $ext is now your file extension
// Check the mime type like this:
if($imgWhiteList[$ext] != mime_content_type($_FILES["file"]["tmp_name"]))
{
die("Invalid media type");
}
But do I really need to check the mime media type?
At this point my researching started to intensify and I came to understand that you can detect extension and media type of a file,
but it is still data that can be manipulated. And additionally the PHP function mime_content_type() uses the "magic.mime" file in your PHP installation to determine the file.
I didn't know what that is.
The official PHP documentation and the top comments below enlightened me a bit:
https://www.php.net/manual/en/function.mime-content-type
But I still am not sure how reliable this technique really is.
But I'm not giving up, I want to have a safe and reliable upload script, even when I have to use magic!
3. The Magic Bytes
So while I was tirelessly googling things like "help how to protect against virus file php upload" I eventually came across something often referred to as Magic Numbers or Magic Bytes. As it seems binary files all contain a kind of signature. The "real" type of a file is readable inside its first few bytes. Now the length of this signature varys between different file types. But luckily we live in the age of knowledge and wikipedia has this to offer:
https://en.wikipedia.org/wiki/List_of_file_signatures
Now to be quick, I made a plan:
- Take the signature info of the desired file type from wikipedia
- Read the first few bytes of the uploaed file and check it against the signature from wikipedia
- Write boolean functions that do this and use them finally to check if the file is what it claims to be.
This is an example for gif images. According to wikipedia a gif image file has to start with a byte signature of 6 bytes:
Either "47 49 46 38 37 61" (GIF87a)
or "47 49 46 38 39 61" (GIF89a)
So the trick is simple, for GIF files read the first six byte and check if they are either one of the values above.
function magicBytesGIF($file):bool
{
if(!$handle = fopen($file, 'r')) return FALSE;
if(!$readBytes = fread($handle, 6)) return FALSE;
$readBytes = mb_strtoupper(bin2hex($readBytes));
if($readBytes === "474946383761"
OR $readBytes === "474946383961") {
return TRUE;
}
return FALSE;
}
BOOM. A solid boolean function to check if the uploaded image.gif is REALLY a gif and not a -
I don't even know what I am protecting myself against, I just dont want my website to be hacked.
Oh my god reading up on Cyber Security can make a man paranoid!
Anyways.
Here is another example with JPG, now jpg files can have 5 different byte signatures, as following:
1. "FF D8 FF DB" (4 bytes)
2. "FF D8 FF E0 00 10 4A 46 49 46 00 01" (12 bytes)
3. "FF D8 FF EE" (4 bytes)
4. "FF D8 FF E1 ?? ?? 45 78 69 66 00 00" (12 bytes)
5. "FF D8 FF E0" (4 bytes)
So I need to read 4 bytes and 12 bytes, and then check if the read bytes match with any of the 5 above.
In variant 4 the question marks mean it can be any value in that position. Don't ask me why, I feel like deciphering ancient writings. But I shall not be scared, this is solved with simple regex.
function magicBytesJPG($file):bool
{
if(!$handle = fopen($file, 'r')) return FALSE;
if(!$readBytes12 = fread($handle, 12)
OR !$readBytes4 = fread($handle, 4)) {
return FALSE;
}
fclose($handle);
$readBytes12 = mb_strtoupper(bin2hex($readBytes12));
$readBytes4 = mb_strtoupper(bin2hex($readBytes4));
// It must be one of these:
if($readBytes4 == "FFD8FFDB" OR $readBytes4 == "FFD8FFEE"
OR $readBytes4 == "FFD8FFE0"
OR $readBytes12 == "FFD8FFE000104A4649460001"
OR preg_match("/FFD8FFE1[A-F0-9]{4}457869660000/", $readBytes12)) {
return TRUE;
}
return FALSE;
}
Now one might ask, why do I need to read 4 bytes and 12 bytes each once? At the beginning I just read the first 20 bytes or something of each file, regardless which type and tried to compare the values with the known signatures.
But I got results which were to me equally unexpected and confusing.
It turns out the byte values (or at least the hexa decimal translations) change depending on how much bytes you read.
I do not know why, and a Engineer may laugh at me now, but hey I am still learning!
One last thing before we finish: Bitmaps are fairly uncomplicated.
function magicBytesBMP($file):bool
{
if(!$handle = fopen($file, 'r')) return FALSE;
if(!$readBytes = fread($handle, 2)) return FALSE;
// file signature bitmap "42 4D" (2 Bytes always)
if(mb_strtoupper(bin2hex($readBytes)) == "424D") {
return TRUE;
}
return FALSE;
}
Now I learned a new technique how to better control the files that are being uploaded to the server.
I'm curious what everyone else thinks, are the magic bytes useful?
You may now think, what the hell did I just read?! Was that a tutorial, a presentation or a elaborated question?
I'll tell you, it was my first blog post. Hope you enjoyed.
Top comments (3)
I don't want to make you more paranoid, but there's nothing stopping someone uploading, for example, a PHP file that starts like this:
and calling it "foo.jpg"
Then checking the file extension, you think it's a jpeg.
Checking the file size seems reasonable.
Checking the magic bytes... well:
It reads as a big GIF.
And finally, running it:
It still works as a PHP file.
This is how so many Wordpress sites got hacked all the time, through unsecured WYSIWYG editors that let you upload images, and then those "images" could be run straight out of the known uploads directory.
Oh I just tested it and you're right!
Just as I thought, byte signature can be bypassed as well ...
But I am confused about how the PHP file that snuck in pretending to be an image, can be executed on the server?
If a file claims to be an image file and you call the URL of it, the browser won't run the php code on the server, but tries to display the image, which fails right?
I admit now that I mention it, it makes the byte signature check a bit redundant.
The reason I started this topic is just that I think it is not enough to just validate the filename and size, when I'm using any upload function on my website.
Maybe my research so far already gave me the right idea: There is never 100% security. Guess I have to keep searching for new techniques.
Thanks for your comment, and don't worry nobody makes me paranoid like myself lol
Hey Yassine! Great article! And beautiful that you take care of secure file uploads. But Ben Sinclair is right: you must do the exact opposit. Hackers use the Magic Numbers to fake an image file type and upload a hidden webshell instead.
So what you must do, is, detect malicious code in a file that claims to be an image. That is not so easy because webshells may have hundreds of different signatures and may also be Base encoded.
Not sure if this library does it perfectly. But it looks promising: github.com/jvoisin/php-malware-finder
Anyway, something that might be of your interest is tryhackme.com. Especially the OWASP rooms are awesome to learn and understand how to develop secure web apps.
Best of luck!