What makes a good file name
One of the topics included in a research data management workshop with which I am involved is "what is a good file name". The question may seem trivial, but naming files is an everyday practice with real consequences on data usability that isn't talked about very often.
When looking for existing recommendations, we found a popular presentation Naming Things by Jenny Bryan, a related presentation Project Structure by Danielle Navarro, and a colelction of File naming best practices from the MIT Libraries. It was interesting to see that, although the three sources agree on core principles, they make different specific recommendations. So the question "what is a good file name" is not as trivial as it might first seem. In this post I am trying to pick apart a few recommendations, trying to better understand where they might come from and whether I would agree with them or not.
If you are interested in the subject, the discussion also spawned an entertaining and comprehensive appendix to the DataLad Handbook written by my colleague – How to name a file: Interoperability considerations.
General principles
The three sources listed at the beginning agree that the goal of file naming is to make life easier when working with files recorded a while ago. The three rules they give for file names are best summarised as:
- machine readable,
- human readable,
- easy to sort and search,
and it's hard not to agree on that.
Spaces are bad but what about hyphen?
Sources also agree that spaces are best avoided. This comes, mainly,
from command line usage, which is a very popular way of interacting
with data. In a command line, space is used to separate arguments of
a command, so a file name (single argument) with spaces needs to be
put in quotation marks "like this"
or written with an escape
character like\ this
, which is plainly annoying. Even bigger
problems await when a file name needs to be passed between programs,
and you may find yourself escaping quotation symbols or escape
characters themselves.
A common recommendation, with which I agree, is to replace spaces
with underscores and hyphens. The two can be combined and have
different meaning, with underscores separating units of information,
and hyphens separating words within a unit for better
readability. For example, sample_green_city-landscape.jpg
has
three units of information: category (sample), color (green) and
content description (city landscape). Every common programming
language will have a facility for easily splitting the name into
three blocks.
This can be taken further: the name can be constructed from
key-value pairs, with hyphens separating key from value, and
underscores separating pairs. This makes the pattern
self-explanatory and can be useful if there are many possible units
of information, but different subsets are used between files. For
example, the previous name could be rewritten as
kind-sample_color-green_content-citylandscape.jpg
. Again, a
programming language of your choice will have a key-value based
representation and an easy way to create it from this name.
Interestingly, one of the sources recommended that only underscores should be used, and hyphens must be avoided. The reasoning? That it's easy to confuse a hyphen (-) with an n-dash (–, a typographic character used, for example, in ranges or numbers) or m-dash (—, a typographic character often used to indicate a pause in a sentence). I admit, I have fallen prey of a dash in a file name once, and then it took me a while to find out why the file name does not match. However, pretty much the only way a dash can find its way into a file name is when it gets copied from a document editor; the keyboard only has a hyphen. In my opinion, the adventages of having two word-separating characters far outweigh this risk, and I was surprised to see an assertive statement "don't use hyphens and dashes".
Does a date belong in a name?
One recommendation that seems pervasive is to start a filename with
a date, formatted according to ISO 8601,
e.g. 2021-12-10-file-naming.org
. It's hard to deny that if a
date is used, ISO 8601 is the way to go (as alphabetic sorting
produces chronological order). However, there is probably too much
emphasis on including the date at all.
If the date is an essential information (e.g. weather data collected daily) or the chronology is of particular importance (e.g. meeting notes, blog posts), then sure. Other than that? Probably redundant in most cases. Arguably, it may even be considered bad practice, especially when used to represent changes over time (honestly, that's hardly different than adding "_v1", "v2", etc.) – that problem is better solved by using actual version control. Perhaps some of the drive for including the date comes from the fact that creation date may or may not survive file transfer between computers. But again, since version version control tools associate each change with a date, this problem is also solved.
Is numbering helpful?
Another common recommendation is to include a number at or near the
beginning: 01_this
or analysis02_that
. As with dates, one thing
is for sure: if you are including numbers and there is a chance of
going beyond 10, use zero padding (because alphabetic sorting is
done by single characters, 10_x
will land before 2_x
but not
02_x
). But I'm not certain that the recommendation is universal.
In some cases, adding numbers will certainly be easy and look nice.
In some, however, you won't know the definitive order in advance,
and if somethings needs to be inserted in the middle, you will have
to sequentially rename all subsequent files. In these cases, at
least for code, an alternative strategy would be to record the
correct order in a free-form description (readme file) or some sort
of a master script (e.g. makefile). And again, using version
control provides an additional perspective. If the outcome of
including numerals is 02_first-attempt
and
04_an-improved-approach
then it really could just be
the-analysis
with the evolution stored in version control history.