Close

MIME type handling

A project log for micro HTTP server in C

Connect your browser to your smart devices, using a minimalist HTTP compliant server written in POSIX/C

yann-guidon-ygdesYann Guidon / YGDES 03/21/2017 at 23:370 Comments

There's a "little detail" that I have to examine right now.

For now, the response headers are hardwired so I can specify the MIME type inside a single, long characters string. Because there are only 2 or 3 responses.

When the server will process files from the local filesystem, the response headers must also contain the MIME type, which is unknown. Any type file could be served. And I don't want to be bothered with this.

Usually, classic servers like Apache manage a (long) list of file types, recognised by the filename's extension. This means : for each request, isolate the extension (don't get caught in traps such as PHP files with options) then look up in the (configurable) list. How inconvenient.

For the sake of simplicity, I don't want to handle the type myself. I want the developer/maintainer to explicitly do it, which saves some efforts from my side. The metadata must then reside on the filesystem...


My first idea was to add one small metadata file for each data file, using the same filename but with the added ".type" suffix.

That's ugly. Apple's Mac do that (with the ._ prefix) and it's horrible, AND the maintainer must create these, one file for each chunk of data...

These tiny MIME type files might not have to reside in the same directory as their data. The data can go in a /data directory and the corresponding types in the /type directory... then a script would populate /type once for all.

The type files can be simple symbolic links to a few actual files, or even contain a broken symbolic link which is the actual metadata/type. But that's still one additional file for each contents file.


The number of files can be reduced if the filename contains the metadata as well. However this greatly complexifies the procedure from the server's point of view. All I want to do is call stat() to see if the file already exists, then read() it. If I don't know the file's name end in advance, I have to manually scan the whole filesystem for a filename starting with the given string. Meh.

On top of that, it's not really practical when the files are deployed because their names must be changed.


Where can we put the metadata then ? Let's look at what stat() tells us :

struct stat {
    dev_t     st_dev;     /* ID of device containing file */
    ino_t     st_ino;     /* inode number */
    mode_t    st_mode;    /* protection */
    nlink_t   st_nlink;   /* number of hard links */
    uid_t     st_uid;     /* user ID of owner */
    gid_t     st_gid;     /* group ID of owner */
    dev_t     st_rdev;    /* device ID (if special file) */
    off_t     st_size;    /* total size, in bytes */
    blksize_t st_blksize; /* blocksize for file system I/O */
    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
    time_t    st_atime;   /* time of last access */
    time_t    st_mtime;   /* time of last modification */
    time_t    st_ctime;   /* time of last status change */
};

One dirty, dirty way would be to put it in the group ID.

OOooohhh that's naughty and the setup would be sordid. Additionally, the user/group names can't contain the dash and slash characters !


Yet another approach uses symbolic links.

All the site's files would be links to actual files, located each in one directory per MIME type. The server just follows the symlink, gets the MIME type from the directory name, then reads/serves the file. This looks pretty easy for the server but administration (and site development) would be terrible, as this breaks the directory structures...


________________________________________________________________________________
The constrains are increasing:

Ideally, no change must be required to make a directory ready to serve.

The metadata must be looked up once (during site installation), probably with a single script.

What does it do ? It scans the directories and checks for known filename extensions. But then where does it write it ?

Again, I want to use the filesystem as a sort of database, so all the searches and buffering are handled by the kernel. But what form should this take ?

One thing I know is that there is no advantage in using symbolic links over plain files because the data are so small. Symlinks are actual files that contain another filename and the sizes are almost the same : a filename is 6 to 12 characters long in average, a MIME type string is 8 to 20 characters long, and a file usually gobbles a whole block (4096 bytes) for a file !

There is however an alternative to symbolic links : hard links. They just point to the inode of a file that contains the desired data. No room is wasted.

Still, where could these hardlinks be stored ? Taking the URL from the HTTP request, there are 2 possibilities :

The second method is a bit more complex : you have to scan the URL string, make sure there is no buffer overflow, and the result would litter the site with pseudofiles...

The first method is easier in many respects : it creates a 3-letters subdirectory where the whole site's structure is mirored. The metadata filename can be created directly by replacing the four letters "GET /" with the fixed-length directory name. For this, I chose to use tree dots : ".../" because it is very discreet ! (try it : "mkdir ...")

The hardlinks can be created only by the superuser. Usually this is not a problem, embedded systems need root access, Raspbian lets anybody use sudo and that's all. Another issue is that the hardlinks must be created on the target filesystem and can't be transferred in an archive file (.tgz or .zip).

If root rights can't be granted, or if the files are not processed on the target, symlinks still work even though they are not optimal.


For the server, the algorithm is :

  1. Ensure the request starts with "GET /"
  2. Scan the URL, parse and make sure the string is reasonable (bounded in size, no special characters). Escape sequences may or may not be processed. Not is safer, so complain if a "%" is encountered. Also complain if you find two dots in a row, to prevent going into the "..." directory or back in the filesystem.
  3. Find the first space character and replace it with a NULL character. There, the URL is turned into a (relative) filename (if you don't use the initial slash).
  4. open() said filename.
  5. fstat() said file to get the size and access rights : check the owner
    (note : open() goes first because this saves one filename lookup)
  6. Replace the characters "GET " with " ..."
  7. open() said filename.
  8. read() the contents. If successful, dump it into the response header.
  9. read() the data/contents file to the network

That sounds reasonable...

Furthermore, since the metadata file's contents is sent as is in the header, we can add special header data, stored along with the MIME type, on a file by file basis. For example : static data (CSS, HTML...) can be pre-compressed and the associated metadata file can include the necessary headers, which are sent along with the type.


Epilog: see MIME sniffing

https://mimesniff.spec.whatwg.org/

Maybe I "could" just avoid sending the MIME type but it's not good.



20170430:

I might have found an even lazier way to specify the MIME type ! It's possible because its format is very stable.

Just include it in the URL: if it starts with the keyword "MIME/" then parse the following 2 words, then the rest is the file's path...

No need to touch ANYTHING on the filesystem. The HTML files however must be adapted....

This causes other issues with the HTML files : relative paths are not possible anymore. One way to solve this is to put the type after the path, for example behind a "??" keyword. Example:

GET /path/path/filename.png??image/png

Discussions