Using PHP to help parse any file

The internal structure of a data file can be important for various reasons. The primary one is perhaps, in order to build extensions based upon, or with a deep understanding of the structure. If you do not know the structure, you cannot extract the data in an open manner. You cannot modify the data, except through approved channels (i.e. vendor's applications), and you cannot create data-centered extensions.

Authors

Written 2010 by Will Johnson for Fast Forward Technologies
Email Fast Forward Technologies at fft2001@aol.com
or post your comments for public view far below.
Creative Commons Attribution 3.0 License
Follow Fast Forward Technologies
on Twitter!

or use my Knol Public activity feed


Obviously if a data file is only ASCII character strings, and is comma-delimited, tab-delimited, bar-delimited, or any other clear and obvious method, then we would not need a parser.  Opening the data file in Notepad, would show us just ASCII character strings, and the delimiters would stand out.  However sometimes software vendors try to be a bit tricky and use some obscure form of data storage.  That is when you would need a program like the one we're about to build.

To enable our parsing-helper program to be used as a general service, we'll first create a form.  This will allow a web reader to enter any URL, which PHP can open just as a local file.  In addition the web server owner could enter the name of any file local to the server, which would also be opened in the same manner.  Any reader of this article who copies the entire code, could then provide this parsing service to anyone else, or use it for their own local files.

When the submit button is clicked we want to call our special PHP file parser.  So our initial form, which we may call ParseMyFile.html  just looks like this.

<html><head><title>Parse a file showing hex and text</title></head>
<body>
    <form method="POST" action = "showfile.php">
    File: <input type="text" name=filename><br>
    <input type = "Reset"><input type = "Submit"><br>
    </form>
</body>
</html>

Now we will create the file showfile.php, in the same directory.  The first thing to realize, is that we're going to be displaying a line of Hex above a line of Text and then repeating them all down the page.  We want the columns to line up properly so we want to use a fixed-width font.  Most fonts use what is called "proportional spacing" meaning that fat letters get fat spaces, and thin letters like "i" and "t" get thin spaces.  We don't want that!  It will make the columns not line up from top to bottom correctly.  Contrasting to "proportional spacing" are fonts that use "fixed-width spacing" where every letter gets the same number of pixels as every other letter.  One of those type of fonts is called "Courier", so lets set our font to Courier, with a command like this, and then turn on php processing.

<html><body><div style = "font-family:Courier">
<?php


The first command we're going to want to do, is to get the file name which was passed.  To do that we execute this command

$filename = $_POST["filename"];

You can see above in the form that we called the "name" of our File input field "filename", that's why we can use here "filename" as the name of the Post array element.  Your POST array elements, are just the same names as your form names.

PHP's fopen command can open a URL just as readily as a local file, however with a local file we can know if it exists or not and we can complain if it doesn't.  So let's do this, to distinguish between a user-entered URL and a user-entered local file name:

if (substr($filename,0,7) <> "http://")
  if (!file_exists($filename))
    exit ("file ".$filename." doesn't exist");
$handle = fopen($filename);


Next we want to read the contents in block-mode (which is binary safe), determine the length of what we just read, and we want to initialize some variables we're going to use in a loop to parse the contents.


$contents = fread($handle,4096);
$len_contents = strlen($contents);
$chr_pos = 0; $hexline = ''; $textline = ''; $done = false; $loop = 0;


What are these extra variables for?

$len_contents = the length of what we read, we cannot assume we got 4096 characters just because we asked for it, we may be at the end of the file already

$chr_pos = the byte-wise position for what our characters pointer is now, we're going to move this one character at a time as we parse the data

$hexline = this is where we're going to store our partial hex results until we're ready to print a line

$textline = this is where we're going to store out partial text results until we're ready to print a line

$done  = Boolean which is set to end the parsing loop

$loop = number of times we have to repeat the fread command as we traverse the entire file, this will increment each time we decide we need to read more of the file


--> Forward to Using PHP to help parse any file (Page 2)