You're staring at a seemingly insurmountable pile of content that needs to be moved. For whatever reason, you are moving to a new content management system. How are you going to migrate this stuff? A natural conclusion is that you'll have to manually review, edit, and move content. That is understandably intimidating and would probably take a huge amount of time and effort. Moreover, especially for a large site, a manual approach has many other issues including lower quality due to inconsistency in decisions that the various editors make. Furthermore, even if you are going to be doing a lot of manual work, you should still look at the problem as a whole and break down the steps rather than just dealing with each piece of content in isolation.
So how can we turn this mess of content into something that is easier to deal with? In the following steps:
- Divide the problem into manageable pieces
- Take time to pilot and estimate
- Re-assess based on your pilot, and migrate
Step 1. Divide content into manageable pieces: type, cut, automate
The first step is to break up your content into types. You may be lucky and your content is already neatly arranged or easy to identify. Or you may have work to be done to figure out your content types. You will need to do content analysis anyway to move into your CMS, so the first step in in the process is categorizing your content by type. To use a container ship analogy, instead of looking at the shipyard full of containers as a whole, start looking at each type of container separately.
The reason to break your content into types? You'll need to do that for a variety of reasons such as defining the metadata (which will help enable site behaviors) as well as define the editorial requirements on an ongoing basis. For the purposes of planning your migration, breaking into types will help you decide what to cut. You may feel that your press releases over five years old don't need to be moved, or the encyclopedia you tried that didn't work can be deleted.
One of the main take-away points here is that you want to define the rules for cutting content. For smaller sets of content, this may not be needed and evaluating / identifying from spreadsheets listing all content may work. But hopefully you'll be able to define rules that will allow you to not even have to take a second look at those that were cut. Defining rules may allow you to just cut wide swaths of content that isn't contributing value to your site. Some other ways of deciding to cut could include web analytics (cut anything that hasn't received page views in the last month), site section (not just the encyclopedia entries, but the whole encyclopedia site), metadata that already exists (the topic "parcheesi" doesn't interest anyone, so delete anything tagged to it in the current system), or even contributor/source (perhaps everything that intern entered three years ago can be safely deleted).
Next, you want to decide what you can automate. You've already decided what will be migrated, so now you need to decide what's going to be automatically moved. Automating isn't an either/or proposition, and in general you want to automate as much as possible. Again, the idea here is to define the rules about what will be automatically migrated and what will not. This will be useful in helping you estimate your effort and help prioritize. The biggest issue on what can be automatically migrated and not is the structure and "regularity" of the content. If you had a bunch of content that was already in a CMS, only referred to a common CSS stylesheet, was only in standard HTML, already had high quality metadata, and all the relationships between content was clearly defined, then you could migrate that easily. That's because in that example it's structured and regular. Chances are, your content is more interesting and varied (and hopefully less interesting once in the new system), so you will need to look at samples of your content and try to determine how regular and structured it is. If you can see patterns (which can be more art than science to find), then you hopefully can automate the migration of some portion of some/all content types.
After Step 1 of typing, and then finding rules for cutting and automating, you will have a more useful inventory of your content, with numbers like this:
Content Type | Original Count | Percent Cut | Percent Automated | Manual Count | Automated Count |
Press Release | 20,000 | 50% | 50% | 5,000 | 5,000 |
Article | 100,000 | 20% | 90% | 8,000 | 72,000 |
Step 2. Take time to pilot and estimate
At this point you have some big numbers based on educated guesses, but you could still be wildly off on how many can be automated. At this point, take this one step further and try to estimate how long it will take to do the manual migration and automated migration. You might say that the press releases just need a small amount of manual massaging, so each one will take an hour on average. But perhaps those articles that you want to manually touch require actual rewriting, so will be four hours each. Using the numbers from the table above, this would mean the manual migration of the press releases is 5,000 * 1 hour = 5,000 hours and the effort for the articles would be 8,000 * 4 hours = 32,000 hours. This probably is more manual effort than you're willing to take, so you probably would cycle through Step 1 to see what other assumptions you might be able to change (for instance quality).
Let's say after your do your further analysis, you think a larger percentage of your content can be automatically migrated. Well, until the rubber hits the road, you won't really know if those assumptions are true. At this point, you should pilot an actual migration of a section of the site. Then you can actually see how much can be migrated and, by doing a sampling, you can discover how good the migration quality is (and potentially cycle through iterations to improve quality).
Step 3. Re-assess based on your pilot, and migrate
After the pilot it complete, it's time to re-assess where you're at. Did the automation yield the quality you were expecting? Are there ways of improving the quality through the automation rules? Are you going to need to jettison more content, or lesson the quality? You may decide that you need to do manual QA on a certain sampling of the automation on some content types, to confirm that the quality is working out over a large batch of content (this may also vary based on content type). See Why estimate? I'm not getting more resources for this site migration.
In sum, this is the overall process of migrating content:
After you have done the pilot and re-assessed, you should:
-
Have a good estimate of the effort it will take
-
Get a flavor for the types of issues you will be encountering
-
Have a way of breaking down the problem for the actual migration.
Also see the previous entry on ensuring quality during migration.
Be Careful!
This article jumps straight in the middle of migration planning, and assumes you've already taken care of other important planning, including:
- Defining a compelling vision
- Considered your content strategy
- You have already conducted a proof of concept on your CMS
- You aren't considering this a one-shot deal, but are looking at how to steward this content and the site over time
Remember that a migration is almost never just about content, but you need to consider relationships, teams, and tools.
A final note: you may feel a bit of a deer in headlights with the issue of whether something can be automated or not. I've heard some interesting excuses from systems integrators / development shops about why something cannot be automated, which may sound like valid reasons to you. Hopefully by conducting your pilots and estimates, you will have compelling reasons to automate as much as possible. As mentioned above, it can be a bit more art than science in seeing patterns and being able to apply them, and it certainly should go beyond throwing a tool like HTMLtidy at the problem.