journal de recherches

Let's bring garbage collection to media uploads,

Tags : web, architecture

Traditional content websites allowing rich text editing chronically suffer from a never-ending media ingestion syndrome. Knowing when to delete an asset is not trivial if not tackled from the beginning of the lifecycle of an application.

I’ve seen multiple approaches in the wild, and tried (too late) to add garbage collection to projects mid-way. Here’s what I encountered, from the least to the most certain :

  • Notifying the user on a media page, derived from a filesystem, that an image might be suitable for deletion.
  • Keeping track of post thumbnails, page headers, and deleting those images when the parent content is deleted. When rich text comes into play, this is inefficient.
  • Periodically running a job whose task is to compare the list of images present in the filesystem with a list of images that should be reachable, according to analysis of various post types and content types.

Let’s define “media” as “anything that isn’t directly text, but referenced inside of it”.

Those approaches range from ineffective to dangerous. What would be really nice would be to keep track of every media uploaded, its kind, and a reference count of its uses. Inserting/Updating a container datatype would run an asset delta calculation, to track newly used media, and newly unused media, updating their respective reference count.

This would bring the benefit of having a consistent media library, allowing to show where and how a specific asset is used through the site.

Then, every media dropping to zero could be collected after some time. Or some could be marked uneligible for collection, and stay in the media library until they bitrot.

Those requirements would be quite trivial to implement iff we have a total control on the various editing widgets on the site, consistent data structures, and content -> asset delta routines are kept up to date. Ideally, content insertion and updating should fail if the content -> asset delta routine produces a wrong result. Known and hidden sources and results could be used.