Sep 14, 2011
Migrating from ContentDM to Omeka

Seems like the Omeka forums get a lot of traffic from people looking to migrate from ContentDM to Omeka. I, personally, get inquiries about this all the time (for some unknown reason). So I figured I may as well share what I know about the process here so I can just send a link or you can find it on Google or whatever.
It’s worth noting at the outset that I don’t know anything about generating ContentDM export files (partially because even the ContentDM documentation is proprietary, or at least hidden behind a login). But I do know that every time someone has sent me a cDM export file, it is in tab-delimited format (UPDATE: here is the Tab-delimited export documentation), which is basically a plain text spreadsheet. I gather these spreadsheets can be produced pretty easily so I think we can start by assuming you already have the spreadsheet and need to prep it for import into Omeka, which will be done in this tutorial using the CSV Import plugin. If you are not already familiar with how that plugin works, check out the documentation page at Omeka.org before continuing.
You probably need to use Excel
First, open the tab-delimited spreadsheet in Microsoft Excel. You will later save this file in CSV format. If you prefer using non-MS spreadsheet software, you are probably out of luck. As far as I know, neither OpenOffice/LibreOffice nor Apple’s Numbers have an equivalent ‘Text-to-Columns…’ functionality, which will be used in this process.
Remove unwanted and problematic rows/columns, Rename column headers
Once you have your spreadsheet open in Excel, go ahead and remove any ContentDM-specific administrative metadata, or anything else you don’t wish to carry over to Omeka during the migration. At this point, you should probably rename the column headings to something meaningful. This will help with the crosswalk step later.
Breaking out semicolon-delimited values
Next, you will want to check for columns having multiple entries within a single cell. At the very least, this will probably include the Subjects column (because archivists/librarians are never satisfied with assigning just one subject term… subject classification being the Lay’s potato chip of librarianship). By default, these multiple subjects will be separated by a semicolon (e.g. “Librarianship — Potato Chip Analogies; Librarianship — Puns; Librarianship — Personality Disorders;” ). Instead of copying each one of these entries into a new Subjects column, you can just use the “Data > Text to Columns…” feature in Excel. I recommend using a separate worksheet for this step as the new columns will overwrite your existing ones if you are not careful.
Getting the File Path URLs
So you’ve shaped up all the metadata on the spreadsheet. Now you need to define the path to the item file in ContentDM. This is probably the trickiest part to come up with on your own, especially if you are not so familiar with ContentDM. Basically, our starting point will be the cDM “Reference URL.” Those look something like this: http://images.ulib.csuohio.edu/u?/press,59. Assuming you have a whole column of Reference URLs, you need to run a Find and Replace to create your file path (again, I recommend doing this in a separate worksheet so you don’t accidentally overwrite important data). Let’s begin.
UPDATE: in ContentDM version 6+, ShowFile is replaced by GetFile. Adjust the following instructions as needed (i.e. in the first Find and Replace, swap out showfile.exe with getfile.exe).
Start with something like:
http://images.ulib.csuohio.edu/u?/press,59
FIND this:
u?/
… and REPLACE with this:
cgi-bin/showfile.exe?CISOROOT=/
Next, FIND this:
,
…and REPLACE with:
&CISOPTR=
So now we have something that looks like:
http://images.ulib.csuohio.edu/cgi-bin/showfile.exe?CISOROOT=/press&CISOPTR=59
This is a working file path that can be used by the CSV Import plugin to ingest the item file along with the metadata record.
In some instances, you might need to tweak this process. For example, if your ContentDM installation includes JP2 or TIF files (or some other unfriendly image format) but you don’t want the hassle of building a custom display wrapper into your Omeka theme, you can append some additional query string parameters to your file URL.
So if you want ContentDM to serve up a JPG instead of a JP2 (or other…) file, add this to your file column
&DMSCALE=100.00000&DMWIDTH=1600&DMHEIGHT=1600&DMX=0&DMY=0&DMTEXT=&REC=1&DMTHUMB=0&DMROTATE=0
…using this Excel function (where A2 is the first column/cell in need of appending):
=CONCATENATE(A2,"&DMSCALE=100.00000&DMWIDTH=1600&DMHEIGHT=1600&DMX=0&DMY=0&DMTEXT=&REC=1&DMTHUMB=0&DMROTATE=0")
Finally, you need to swap showfile with getimage in the file URL above by running one more find and replace in Excel.
So now, your file path looks like:
http://images.ulib.csuohio.edu/cgi-bin/getimage.exe?CISOROOT=/press&CISOPTR=59&DMSCALE=100.00000&DMWIDTH=1600&DMHEIGHT=1600&DMX=0&DMY=0&DMTEXT=&REC=1&DMTHUMB=0&DMROTATE=0
This will return a JPG file, which is pretty handy. These additional parameters will vary by installation and file type. I don’t know what all of the parameters are or even what each one does; only that this usually works. Again, this is a case where actual ContentDM documentation would be really handy. UPDATE: Keep in mind that this only works with image file types. For more details, check out the GetImage documentation.
Metadata Crosswalk
It’s usually a good idea to plan out your metadata crosswalk in advance, especially if you have multiple export files (and you should if your collection is bigger than a few hundred items; more on that later). Remember that Omeka – out of the box – only uses the first 15 Dublin Core elements. You may need to add a new Item Type or install Dublin Core Extended in order to find/create an appropriate home for your legacy/custom metadata in Omeka.
Batching
To avoid server timeouts, you should consider breaking your spreadsheets into manageable batches. I try not to import more than a few hundred items at a time, and even then one of the two servers involved is likely to timeout or throw an error or something. Keeping the batches small makes it easier to isolate problems, avoid import errors, and undo problematic imports.
Using the CSV Import Plugin
From here, just follow the standard instructions for using the CSV Import plugin.
Bugs, Known Issues, and Limitations
As of version 1.3, there are still some quirks. For example, your file path – that hideous long URL you worked so hard to create – will become the actual name of your imported/migrated file. In some instances, your files may be ingested sans file extension (e.g. .pdf, .jpg, .mp3), which can cause various headaches (though it’s worth noting that these files will generally display inline on your site, due to the way most Omeka themes handle media files, and will only break down when someone tries to download the file, in which case they would need to manually add the file extension). From time to time, you could have an import that hangs indefinitely, never finishing and never failing — and thus not easily “undo-able” (at least, the “Undo Import” button will not be visible). In such a case, you can manually create that button by entering the following URL pattern into your address bar:
http://[PATH TO YOUR OMEKA INSTALLATION]/admin/csv-import/index/undo-import/id/[IMPORT ID]
– this is on the plugin documentation page by the way, as are several other points in this tutorial.
One of the biggest limitations of the CSV import strategy is that you will probably have issues migrating compound objects and other multi-file items, primarily because of the way ContentDM formats the export file and serves compound objects online and partially due to limitations in the way the plugin works with Omeka. Basically, you need all the files for an item to be in the same row as all of the other item-level metadata (e.g. in columns like “File 1,” “File 2,” “File 3,” etc). And there is currently no way to use the CSV Import plugin to assign file-level metadata. For example, if you had a postcard in your ContentDM collection and it had distinct metadata for each side (say, for front.jpg and verso.jpg), along with general metadata for the object as whole, something is going to be lost in the migration without some serious elbow grease.
The Omeka Dev forums are the best place to report bugs, inquire about error messages, discuss workarounds, and submit patches. The general Omeka Forums are also great for more basic questions; happily, most questions get answered in fairly short order. Please do not post support questions here. Please do, however, feel free to leave general comments, suggestions for improvement, requests for clarification, etc.
IMAGE NOTE: poorly Photoshopped post image contains assets by multiple artists and designers, including the amazing “Bob” sketch from Matt Haley’s unreleased but totally awesome sounding Twin Peaks: Season 3 graphic novel.

This is a nice write-up. One other gotcha worth mentioning in this context is that Omeka currently doesn’t create derivative images (e.g. thumbnails) from PDF originals. From what I’ve seen in the user forums this is because the conversion requires fairly recent versions of ghostscript and imagemagick. You can go ahead an import these files anyway via CSV and then supply the necessary (or not, depends which theme you use) jpeg thumbnails after the fact. After much beginner’s trial and error I worked out a way to do this. There’s a brief description of my workflow on the Omeka Forums here: http://omeka.org/forums/topic/thumbnails-for-pdf-files#post-5646 I’m certain there’s a better way but this was good enough for my needs.
A second import related gotcha has to do with the text you select for your DC Titles. If your users are going to be searching on this field (primarily) then make sure the relevant content has more than four characters else it’ll get passed over by MySQL’s full-text search. Advanced search includes the option for regex search so you won’t see it there but then again is not as quick & simple either.
Finally, though just mentioned (fi I recall correctly) as an aside in the Omeka docs, is the possibility of re-using your original import spreadsheet to push bulk revisions back onto your data in Omeka. I haven’t looked into this personally but it should in theory be possible (for example, for those of us who don’t have much experience doing this the ‘right’ way with SQL commands..).
For pulling out individual images, I like the syntax
http://images.ulib.csuohio.edu/cgi-bin/showfile.exe?CISOROOT=/press&CISOPTR=59
If you try to save the file in a browser or when using a recent version of curl that you can use with the -R option, you’ll notice that it uses the nice and short CDM filename (in this case 60.jpg) uses internally rather than something long and junky (though I haven’t tested to see how Omeka handles the import)
The other cool thing about this syntax is that it allows you to retrieve the XML *.cpd files directly off the server if you know their numbers and it can be used with things that aren’t images.
The best way to export data from CDM is custom XML as that will give you all fields, real filenames, and you won’t have weird parsing problems with repeated fields that might contain the delimiter you used.
On an aside note regarding the method you mentioned for extracting jpegs, it’s important to make sure the DMWIDTH and DMHEIGHT parameters are big enough to contain the largest dimensions you have or you’ll truncate the images (set those parameters to 10000 and see how the images in your examples change)
The online CDM documentation which is accessible to nonusers can be handy, particularly the part regarding exporting
http://www.contentdm.com/help6/collection-admin/exporting3.asp
But there are other sections that can be handy to browse for purposes of wrapping your mind around things. But as is the case with virtually all products, the coolest stuff is not documented.
Hi Kyle, thanks for the helpful replies.
You are right about using showfile vs getimage in the file URL. That was a mistake on my part and I’ll update the post to reflect your suggestion. I seem to recall that we needed to use getimage for certain file types, but don’t recall the specifics at the moment (if I can track that info down from old email threads, I will clarify in a post revision). But generally, you are correct, that showfile is preferable. Thanks for catching that.
I will also incorporate the documentation links and notes about query strings, which will be really helpful for me personally and probably for others as well.
Again, thanks for the insider scoop. I work in a university but not at the library so some of these details can be hard to extract when working across departmental lines.
[...] Migrating from Content DM to Omeka – Jefferson’s Newspaper (2011) [...]