Leaves may die and be reborn, but limbs persist until destroyed. As they relate to Forust, limbs are page objects and leaves represent their data. Up until now I’ve written about two different routes of saving leaves, but I may have it figured out. I’m adopting the TSV idea, but also some objects based on their URL’s will be included, and the data itself will be spread across multiple objects.
First, why not just use a database, especially SimpleDB which is already based on S3? The answer is difficult for me to formulate, but it comes down to the gut feeling of that data being fluid and unpermanent. I am not opposed to using a database for transient information and analytics, but I will not use one for Forust storage. Chalk it up to crankiness if you like, I have certainly tired of arguing with myself over it, but I’m moving on.
Now, there are a couple priorities behind what I’m about to explain: minimize S3 transactions and never destroy. Obviously the latter is relative, as certain junk and trademark/copyright material must be burned, but in general I aim to never lose data.
Why not use straight-up TSV for the page’s leaves and segments? The answer lies in the abstraction: S3 objects are atomic, distributed, and decidedly not files as we would normally treat them. If I could simply append to them, then there’d be no issue. Consider if two users try to add a comment on the same page and exactly the same time; the internet is great for producing these types of fringe scenarios. Both operations to add the comments will succeed, but one is going to get “stomped”. Not good.
Why not use separate S3 objects then? Pages will move and be given new URL’s while the old URL must be modified to forward there. With S3 objects based on a page’s URL, they’re going to get lost now and then when trying to perform this rename. I came up with some ways around this, but it’s easy to imagine it getting out of sync and either losing data or duplicating it.
Alright, so how can I possibly do this? Well, I’m going to combine these ideas using some extra (albeit more complicated) server logic. Let’s start at the beginning and walk through a page’s lifetime.
Initially, when a user creates a new page the content and the limb are one. The name of the file they’re uploading will be embedded in the URL they’re given:
/2009/05/27/141500-image.jpg.html
The above is both the public URL and the S3 URL; there is no need to create a separate “head” object. When a leaf is added to the limb, say a comment, a new S3 object is created:
/2009/05/27/141500-image.jpg.html[timestamp]$tsv
If a new content segment is added, the same type of action occurs, but the URL it ends up at ends with “$seg”:
/2009/05/27/141500-image.jpg.html[timestamp]$seg
Segments use the “f0″ meta tag whereas TSV objects use something else (but not Content-Type, since a user can cause that to be set which would be dangerous).
The use of a timestamp is for ordering purposes and to prevent stomping (not perfect, but I’ll further solve it when I get there — probably use the PHP session ID too).
All of these objects are loaded in order to build a map of the page. If a segment is not referenced in any TSV then its mere existence causes it to be in the map. The TSV files are formatted similarly to how I imagined earlier today, except the first column is a timestamp. The two types of data rows (implied by existence and in a TSV) are sorted by timestamp prior to interpretation to ensure new ones override old ones.
Fine, you say, but what happens when the page is categorized or renamed? If it’s given a new URL then it breaks the URL-relationship of all those tsv’s and seg’s.
Yup, totally, and that’s okay! Let me explain …
Renaming a page under this system involves two new objects: a tsv under the old structure indicating the page is relocated and a new page object that is a tsv rather than a seg. The new page object will have all the previous TSV data rolled up into it including the rename directive and all the segments that had linkage implied by their paths. The segments thus retain their old URL’s, but are linked to the limb via explicit TSV rows rather than their paths.
There are issues here, but none of them are related to losing data. Given that there will be unnecessary / stray S3 objects after a rename, a CRON job will have the responsibility of cleaning that crap up. Over time all these fragmented objects will cause page loading performance to drop considerably, and it will be the job of said scheduled task(s) to fix it. I hate to use this term, but it’s kinda like defragmenting. Yes, it sounds awful, but I think it can work out well.
Data duplication could be a problem, so one rule I’m stating now is that duplicate TSV rows (implied or not) are completely ignored. The only reason a row will be duplicated is if there is some sort of synchronization issue, because the timestamp column prevents two pieces of equivalent content added separately from being an issue? Why would you want that? Well, hypothetically if each comment is simply its text then you wouldn’t be able to say just “lol” more than once. God forbid!
Caching is an obvious advantage but also a necessity. Note that S3 objects are never modified by this system, so it’s perfectly safe to cache them indefinitely. The removal of data is actually done by adding new objects which indicate what data to cancel out. Legally there are problems with this approach, and I’ll need some way to purge cached items that must be permanetly deleted, but I’ll get to that.
Posted by Neil Obremski
Posted by Neil Obremski
Posted by Neil Obremski