Archive for June, 2011


Maintaining Alfresco Index Health

Alfresco uses Lucene to provide services for indexing and searching metadata and content. Though Lucene is a reliable subsystem for most of our customers, if Alfresco trouble does occur, the Lucene index is often involved. A customer asked me today for advice on maintaining the health of Alfresco indexes. After getting some ideas from support, I decided to document the advice for the larger community.

The live Lucene indexes can not be backed up without corruption, so Alfresco is configured to dump a snapshot of the indexes each night. This index backup can then be included in your Alfresco backup routine (copied off the production machine) and used during a restore of Alfresco. On restore, Alfresco will reload the indexes from the last backup, and index the metadata for content that was added after that point in time. Once the metadata index is complete, users can interact with the system again. The system will complete any missing full-content indexes in the background without impacting accessibility. While this process completes, users will be able to search for and find documents based on the metadata, but not on document content.

The most important reminder the team had was to verify that index backups are being performed and that you can successfully restore from the backups. Too often we (as IT professionals) neglect to confirm that our backup plans perform as designed.

Details on how to perform a restore from index backups are available here:

http://docs.alfresco.com/3.4/topic/com.alfresco.Enterprise_3_4_0.doc/tasks/restore-lucene-indexes.html

One strategy for reducing the time necessary to restore Alfresco is to dump the indexes more often than once per day. You can set the schedule to backup indexes as frequently as you would like, but the backup temporarily prevents content from being indexed on the node doing the backup. It shouldn’t be a long pause, but the more often you do it the harder it will be for that node to catch up on its indexing.

Here is documentation on how to change the schedule for Lucene index backups:

http://docs.alfresco.com/3.4/topic/com.alfresco.Enterprise_3_4_0.doc/tasks/luceneindex-backup.html

Our support team also has an index checking tool that can help validate the consistency of the indexes. They normally only use it when diagnosing a problem, as it can take a long time to return and can affect system performance while it is running. If you do script a scheduled run of the tool, performance impacts can be managed by running it on a cluster instance that is not servicing user requests. I believe this checker is currently an Enterprise-only feature.

I don’t believe we have detailed documentation on the tool, but it is accessible on your Alfresco server by going here:

[host]:[port]/alfresco/service/enterprise/admin/indexcheck

Support also has a method for evaluating index performance which is documented here:

http://wiki.alfresco.com/current/index.php?title=Index_Merging_Performance

Be aware that the default settings for index merging are appropriate in most cases, and tweaking those settings are more likely to reduce performance than to help.

Because all of these maintenance steps will impact the performance of the Alfresco instance on which they are run, we recommend configuring Alfresco in an N+1 redundant cluster. During normal system operation, you can use the extra instance to perform maintenance of this type rather than have that server accepting a full load of users.

The next major release of Alfresco is code-named Project Swift and will likely be Alfresco Enterprise 4.0. One of our major architectural changes is to move to Solr as our search architecture. This should help with Lucene reliability, as well as give Alfresco customers access to the many search features that Solr adds to Lucene.

Feel free to add any additional tips you have for index health in the comments.

Thanks go to Andy Hunt for technical corrections on this blog post.

[ Read More | 0 comments | 0 pingbacks | , , ]

Alfresco Content Formats

Alfresco is format-agnostic. It will store any bytes you hand it, and along with storage you get powerful functionality such as permission enforcement, categorization, versioning, automation rules, workflow, check-in / check-out, metadata, and meta-data search (among other capabilities). In addition to these core content-agnostic capabilities, Alfresco will also examine the MIME type to see if additional operations can be performed such as full content indexing, thumbnailing, transformation, automated metadata extraction, and content preview in Alfresco Share.

I am regularly asked to provide a list of formats for which Alfresco supports these extra capabilities. Unfortunately, such a list depends on the underlying technologies that we use and it changes very quickly. In this post I hope to provide some guidelines on how Alfresco will treat various formats. These lists are not exhaustive, as I have only listed formats that I have personally tested.

Alfresco is designed to be very extensible. If Alfresco does not have a capability that you would like for a specific format, it is worth checking with us to see if we are aware of an Alfresco Partner or community project that has provided the desired functionality.

Text

If Alfresco can convert a MIME type to text, it will extract that text and index it for full content searching. Share will also display a preview of the text, including formatting were possible. Alfresco uses Open Office for conversions that retain formatting, and Apache Tika to do conversions that require clean text. Apache Tika is also used for conversions from Word to HTML in Web QuickStart.

Formats:

  • Text based files (obviously) such as TXT, CSV, RTF, XML, and HTML. Single-byte and multi-byte encodings work fine.
  • File formats that are readable by OpenOffice, such as MS Office and Open XML formats (XLS, DOC, PPT, XLSX, DOCX, and PPTX), Word Perfect files (WPD), and Open Document files (ODS, ODTODP).
  • The text layer of a PDF files (be aware that some PDF files just contain an image of the text).

Previews are created by transforming the content to a PDF using Open Office, then using the open source utility PDFtoSWF to convert it to a Flash movie. For this reason, the formatting for some MS Office documents will not be rendered correctly in the preview. In our Swift release it will be possible to incorporate a 3rd party transformer (such as MS Office) to help with these extractions and transformations.

Images

Alfresco can do image transformations, such as thumbnail generation. If Alfresco can convert an image format to PDF, then it will also display the image for preview in Share as described above. ImageMagick is the library used under the hood for transformations, and there are very few file formats that do not work.

Formats:

  • JPG
  • PNG
  • TIFF
  • GIF

Apache Tika is used for extracting metadata such as EXIF data.

Audio and Video

Alfresco Share does not currently have a previewer for audio and video, but will in the Swift release.

In the Swift release, one of three previewers will be selected depending on the content MIME type and browser capability. If the browser has Flash 10+ available, Strobe from the OSMF will be used. Flashfox will be used for older versions of Flash. If Flash is not available, or if neither player supports the content MIME type, HTML5 will be used to deploy the content. The available players will be configuration driven. Previews use progressive downloading so that playback can be started before the whole video has been downloaded.

There are no transformers bundled for audio and video, but the Share Extras project shows how to define transformers based on FFmpeg, which we can’t ship due to license concerns.

Though I have seen the audio and video preview demoed, I haven’t played with it yet so I can’t tell you what video and audio formats will be supported. Based on the underlying libraries, I would expect common modern formats will work just fine.

Metadata Extraction

Most metadata extraction is done with Apache Tika. Only a limited number of extractors are wired out of the box, but the hooks are available for customers to wire additional extraction.

Unknown Formats

I hesitate to call them “unsupported” formats, because Alfresco can manage these files just fine (even really large AutoCAD files) but you won’t get previews or metadata extraction. Of course this is not a complete list, but I thought it would be helpful specify common requests that we do not currently fulfill.

  • MS Visio
  • MS Project
  • AutoCAD (though a partner plugin can provide preview, thumbnails, and metadata)
  • iWork formats (though engineering is currently working on this one)

Closing

Though the details of this post will quickly become out of date, I hope that the information presented will continue to guide you in determining what capabilities Alfresco will have for a given format. Our open-source approach of leveraging the best underlying libraries the community has to offer means that our capabilities expand quickly. By looking at the tools we harness you can get a sense for what a specific release can do.

Feel free to list in the comments any additional formats you have tried so that others benefit. I apologize for the cumbersome comments system, as spammers have caused me problems in the past.

[ Read More | 0 comments | 0 pingbacks | , , ]

Alfreco Content Modeling Tips

Here are some tips for building custom content models in Alfresco:

Keep Properties in Custom Aspects

Alfresco provides developers with powerful tools for modeling content and automating functionality. Aspects are one of the biggest. Because Java developers are familiar with method annotations, most quickly comprehend how aspects allow a developer to cross-cut the content model and group custom properties into a bundle. However, aspects have the additional property that they can be added or removed programmatically. This makes aspects more flexible than traditional content types which cannot be removed once applied to an asset.

For this reason, aspects can help our content model keep pace with the inevitable evolution of business needs. Next time you create a custom type, instead of defining the custom properties as part of that type, define them as part of a custom aspect that is mandatory for the custom type. In the future you can remove the aspect if you need to remove the properties (after migrating any information that needs preserving, of course).

Use CamelCase, Not Underscores

Left over from my pre-Java days is a fondness for underscores. I think underscores improve readability, especially with identifiers like RFI_Number. However, using underscores in aspect, type, or property names had unexpected consequences when the name was converted to a key for localization or display in a form field.

I asked internally about this, and I found a difference of opinion about whether underscores should be legal in a content model. Since there are some instances where underscores are used in the core product’s content model, any problems with underscores are probably bugs. But the convention of using CamelCase is predominant and it is what our automated tests cover. It is safest to use CamelCase as is common in the Java world.

[ Read More | 0 comments | 0 pingbacks | , , ]

Email: Password:
OpenID URL:
Forgot Login? Close