Dec 12, 2021

What makes a good file name

One of the topics included in a research data management workshop with which I am involved is "what is a good file name". The question may seem trivial, but naming files is an everyday practice with real consequences on data usability that isn't talked about very often.

When looking for existing recommendations, we found a popular presentation Naming Things by Jenny Bryan, a related presentation Project Structure by Danielle Navarro, and a colelction of File naming best practices from the MIT Libraries. It was interesting to see that, although the three sources agree on core principles, they make different specific recommendations. So the question "what is a good file name" is not as trivial as it might first seem. In this post I am trying to pick apart a few recommendations, trying to better understand where they might come from and whether I would agree with them or not.

If you are interested in the subject, the discussion also spawned an entertaining and comprehensive appendix to the DataLad Handbook written by my colleague – How to name a file: Interoperability considerations.

General principles

The three sources listed at the beginning agree that the goal of file naming is to make life easier when working with files recorded a while ago. The three rules they give for file names are best summarised as:

machine readable,
human readable,
easy to sort and search,

and it's hard not to agree on that.

Spaces are bad but what about hyphen?

Sources also agree that spaces are best avoided. This comes, mainly, from command line usage, which is a very popular way of interacting with data. In a command line, space is used to separate arguments of a command, so a file name (single argument) with spaces needs to be put in quotation marks "like this" or written with an escape character like\ this, which is plainly annoying. Even bigger problems await when a file name needs to be passed between programs, and you may find yourself escaping quotation symbols or escape characters themselves.

A common recommendation, with which I agree, is to replace spaces with underscores and hyphens. The two can be combined and have different meaning, with underscores separating units of information, and hyphens separating words within a unit for better readability. For example, sample_green_city-landscape.jpg has three units of information: category (sample), color (green) and content description (city landscape). Every common programming language will have a facility for easily splitting the name into three blocks.

This can be taken further: the name can be constructed from key-value pairs, with hyphens separating key from value, and underscores separating pairs. This makes the pattern self-explanatory and can be useful if there are many possible units of information, but different subsets are used between files. For example, the previous name could be rewritten as kind-sample_color-green_content-citylandscape.jpg. Again, a programming language of your choice will have a key-value based representation and an easy way to create it from this name.

Interestingly, one of the sources recommended that only underscores should be used, and hyphens must be avoided. The reasoning? That it's easy to confuse a hyphen (-) with an n-dash (–, a typographic character used, for example, in ranges or numbers) or m-dash (—, a typographic character often used to indicate a pause in a sentence). I admit, I have fallen prey of a dash in a file name once, and then it took me a while to find out why the file name does not match. However, pretty much the only way a dash can find its way into a file name is when it gets copied from a document editor; the keyboard only has a hyphen. In my opinion, the adventages of having two word-separating characters far outweigh this risk, and I was surprised to see an assertive statement "don't use hyphens and dashes".

Does a date belong in a name?

One recommendation that seems pervasive is to start a filename with a date, formatted according to ISO 8601, e.g. 2021-12-10-file-naming.org. It's hard to deny that if a date is used, ISO 8601 is the way to go (as alphabetic sorting produces chronological order). However, there is probably too much emphasis on including the date at all.

If the date is an essential information (e.g. weather data collected daily) or the chronology is of particular importance (e.g. meeting notes, blog posts), then sure. Other than that? Probably redundant in most cases. Arguably, it may even be considered bad practice, especially when used to represent changes over time (honestly, that's hardly different than adding "_v1", "v2", etc.) – that problem is better solved by using actual version control. Perhaps some of the drive for including the date comes from the fact that creation date may or may not survive file transfer between computers. But again, since version version control tools associate each change with a date, this problem is also solved.

Is numbering helpful?

Another common recommendation is to include a number at or near the beginning: 01_this or analysis02_that. As with dates, one thing is for sure: if you are including numbers and there is a chance of going beyond 10, use zero padding (because alphabetic sorting is done by single characters, 10_x will land before 2_x but not 02_x). But I'm not certain that the recommendation is universal. In some cases, adding numbers will certainly be easy and look nice. In some, however, you won't know the definitive order in advance, and if somethings needs to be inserted in the middle, you will have to sequentially rename all subsequent files. In these cases, at least for code, an alternative strategy would be to record the correct order in a free-form description (readme file) or some sort of a master script (e.g. makefile). And again, using version control provides an additional perspective. If the outcome of including numerals is 02_first-attempt and 04_an-improved-approach then it really could just be the-analysis with the evolution stored in version control history.