Introducing Embed.ly client for .NET

First of all if you haven’t heard of Embed.ly you really should check it out:

At it’s core, embedly is an oEmbed provider. oEmbed is a format for allowing an embedded representation of a URL on third party sites. The simple API allows a website to display embedded content (such as photos or videos) when a user posts a link to that resource, without having to parse the resource directly.

If you’ve ever posted a link on facebook and been impressed that it automatically added a title, some descriptive text and one or more preview images to select from or included a playable video automatically and want to build something like that into your own site then this is for you.

There were already client libraries for several other languages but none for .NET so I developed this. I’ve been using it on a forum app to automatically detect links to videos and images to build up a gallery of both and to make them playable within the post. You can also use it on the client via a jQuery plugin but then you lose the ability to build up the gallery and index the additional content. If someone has posted a link to a Bob Dylan video then I’d like that post to be returned if someone searches for ‘Dylan’.

The response from embedly can also include a flag to indicate if the URL is considered ‘safe’ (based on Google’s safe-browsing API).

Example

Here is an example of the original content posted showing how the link is converted into a video and the additional information retrieved.

Original Post

The user makes a post and just copies and pastes a regular link

original

Embedly Enhanced

The HTML is parsed and sanitized (using HtmlAgilityPack and a custom Html cleaning library) and the discovered URL checked with Embedly. We told embedly we wanted a preview of 640px maximum width so the html snippet returned fits perfectly and shows a playable preview:

with-embedly

Thumbnail Gallery

Embedly also returns static thumbnail images which are perfect to add to a gallery of content:

video-library

Additional Content

As well as the html preview and thumbnail, the title, description and other information is returned by embedly which can enhance the page to host the content or make it more searchable on our site:

video-preview

Embedly provides a much richer experience to the end user.

So what does the .NET client do?

Basically, it provides an easy way to make requests to embedly and get strongly-typed results back. It automatically handles the request to the embedly service to get the details of the services they support fully and has a high-performance regex-less way of matching URLs against them to see if they are supported (doing 500+ regex lookups against each URL is too slow when batch processing).

Requests to embedly can be filtered based on the provider information making it easy to limit requests to YouTube videos or Amazon products or perhaps any video or photo provider.

When requesting more than one URL the client will automatically batch them into a single HTTP request to embedly (which supports up to 20 URLs per request) and uses async downloading to handle the response without blocking or using valuable CPU time.

Finally, a caching mechanism helps avoid re-requesting URLs that you have recently checked operating at the URL level – the individual URL results are cached, not the entire embedly response which could be for 20 URLs so if you requested 60 URLs and 40 had already been requested it would only sand a single HTTP request to embedly whatever sequence they were requested in.

The caching can be disabled completed if required and there is also an InMemory cache provided as well as examples of an ADO / SQL Client cache and a MongoDB cache (which is the one I’m using myself).

What doesn’t it do?

At the moment it works for the base oEmbed endpoint only but I plan on adding support for the Preview and Objectify endpoints in the future.

Where do I get it?

You can download the source from GitHub: https://github.com/CaptainCodeman/embedly-dotnet or get a binary version as a NuGet package:

http://nuget.org/List/Packages/embedly

NOTE: I’ll probabably be splitting the NuGet version into a core / base package and separate cache providers to avoid bloating the dependencies.

How do I use it?

The source includes a sample project showing some of the ways you can use it but I’ll give a brief summary here.

Create a client

All requests go through the client which, at a minimum, needs an embedly account key provided which you can store however you want (the sample shows it stored in a .config file using the standard .NET COnfigurationManager). You can sign-up for a free account at http://embed.ly/pricing to get a key

var key = ConfigurationManager.AppSettings["embedly.key"]; 
var client = new Client(key);

Use a Cache

If you want to use a cache then this should be passed into the client constructor. Here’s an example using the MongoDB cache:

var key = ConfigurationManager.AppSettings["embedly.key"]; 
var database = ConfigurationManager.ConnectionStrings["embedly.cache"]; 
var cache = new MongoResponseCache(database.ConnectionString); 
var client = new Client(key, cache);

The final optional parameter when creating a client is the embedly request timeout. If the HTTP request to embedly takes longer than this then it is aborted and an exception returned instead of the embedly result. The default timeout for requests is 30 seconds.

List Embed.ly Providers

Once you have a client then you can see the list of providers that embedly supports:

foreach (var provider in client.Providers) 
{ 
    Console.WriteLine("{0} {1}", provider.Type, provider.Name); 
}

Check if a URL is supported:

Embed.ly supports over 200 different providers (all the big names like YouTube) although they will return results for the non-provider backed requests too.

var url = new Uri(@"http://www.youtube.com/watch?v=YwSZvHqf9qM") 
var supported = client.IsUrlSupported(url);

Get provider information for a URL:

You can get the provider for a URL (this does not make any additional requests to embedly beyond the initial retrieval of the provider list itself).

var url = new Uri(@"http://www.youtube.com/watch?v=YwSZvHqf9qM") 
var supported = client.IsUrlSupported(url); 
Console.WriteLine("Supported      : {0}", supported); 
Console.WriteLine();

var provider = client.GetProvider(url); 
Console.WriteLine("PROVIDER"); 
Console.WriteLine("About          : {0}", provider.About); 
Console.WriteLine("DisplayName    : {0}", provider.DisplayName); 
Console.WriteLine("Domain         : {0}", provider.Domain); 
Console.WriteLine("Favicon        : {0}", provider.Favicon); 
Console.WriteLine("Name           : {0}", provider.Name); 
Console.WriteLine("Regexs         : {0}", string.Join(", ", provider.Regexs)); 
Console.WriteLine("Subdomains     : {0}", string.Join(", ", provider.Subdomains)); 
Console.WriteLine("Types          : {0}", provider.Type);

Get the oEmbed information for a single URL:

The API supports single URL requests.

var url = new Uri(@"http://www.youtube.com/watch?v=YwSZvHqf9qM") 
var result = client.GetOEmbed(url, new RequestOptions { MaxWidth = 320 }); 

// basic response information 
var response = result.Response; 
Console.WriteLine("Type           : {0}", response.Type); 
Console.WriteLine("Version        : {0}", response.Version);

// link details 
var link = result.Response.AsLink; 
Console.WriteLine("Author         : {0}", link.Author); 
Console.WriteLine("AuthorUrl      : {0}", link.AuthorUrl); 
Console.WriteLine("CacheAge       : {0}", link.CacheAge); 
Console.WriteLine("Description    : {0}", link.Description); 
Console.WriteLine("Provider       : {0}", link.Provider); 
Console.WriteLine("ProviderUrl    : {0}", link.ProviderUrl); 
Console.WriteLine("ThumbnailHeight: {0}", link.ThumbnailHeight); 
Console.WriteLine("ThumbnailUrl   : {0}", link.ThumbnailUrl); 
Console.WriteLine("ThumbnailWidth : {0}", link.ThumbnailWidth); 
Console.WriteLine("Title          : {0}", link.Title); 
Console.WriteLine("Url            : {0}", link.Url);

// video specific details 
var video = result.Response.AsVideo; 
Console.WriteLine("Width          : {0}", video.Width); 
Console.WriteLine("Height         : {0}", video.Height); 
Console.WriteLine("Html           : {0}", video.Html);

Get oEmbed information for a list of URLs:

Any IEnumerable<Uri> list of URLs can be processed as a batch. The .NET client will return results as they arrive.

var results = client.GetOEmbeds(urls, new RequestOptions { MaxWidth = 320 })

Limit the URLs to request to supported providers only:

(embedly can return results for ‘unsupported’ providers but the supported ones typically have richer content.

var results = client.GetOEmbeds(urls, provider => provider.IsSupported);

Limit the URLs to request to a single provider:

A lambda expression enables the request to be filtered on any property of the provider identified for a URL.

var results = client.GetOEmbeds(urls, provider => provider.Name == "youtube")

Limit the URLs to request based on the type of provider:

Each provider has a Type to indicate the content they return so if you are only interested in video links you can filter on that type.

var results = client.GetOEmbeds(urls, provider => provider.Type == ProviderType.Video);

NOTE: ‘urls’ is an IEnumerable<Uri> in the above.

NOTE: RequestOptions enables a number of additional request arguments to be specified, see: http://embed.ly/docs/endpoints/arguments

The Result returned contains the original request (URL and any matching provider) an Exception (if the HTTP request failed) or a Response which could be an embedly Error (used to indicate if the URL being inspected doesn’t exist for instance) or one of the specific response types (Link, Photo, Rich and Video).

Extension methods enable the results to be filtered as a convenience for:

result.Success()

Returns results that were successful only

result.Failed()

Returns results that failed (HTTP error during request to embedly)

result.Errors()

Returns results that embedly responded with an error code. i.e. the request to embedly was successful but maybe the URL doesn’t exist

result.Link()

Returns results that are of type Link

result.Photos()

Returns results that are of type Photo

result.Richs()

Returns results that are of type Rich

result.Videos()

Returns results that are of type Video

If you are iterating over multiple results and want to handle them correctly then the first step is to check each result’s Exception property. If there was an exception during the HTTP request to embedly then this will be set. If it is null then the request to embedly was successful in that embedly returned a response but that response may be an Error, a Link or a Phot, Rich or Video. The Respone.Type will indicate the response and the As[type] property is a convenience way to get the Response as the particular type.

foreach (var result in results.Successful()) 
{ 
    if (result.Exception == null) 
    { 
        Console.WriteLine("{0} found for {1} ({2})", result.Response.Type, result.Request.Url, result.Request.Provider.Name); 
        switch (result.Response.Type) 
        { 
            case ResourceType.Error: 
                var error = result.Response.AsError; 
                Console.WriteLine("  code:{0} message:{1}", error.ErrorCode, error.ErrorMessage); 
                break; 
            case ResourceType.Link: 
                var link = result.Response.AsLink; 
                Console.WriteLine("  title:{0}", link.Title); 
                Console.WriteLine("  url:{0}", link.Url); 
                break; 
            case ResourceType.Photo: 
                var photo = result.Response.AsPhoto; 
                Console.WriteLine("  title:{0} ({1}x{2})", photo.Title, photo.Width, photo.Height); 
                Console.WriteLine("  url:{0}", photo.Url); 
                break; 
            case ResourceType.Rich: 
                var rich = result.Response.AsRich; 
                Console.WriteLine("  title:{0} ({1}x{2})", rich.Title, rich.Width, rich.Height); 
                Console.WriteLine("  url:{0}", rich.Url); 
                break; 
            case ResourceType.Video: 
                var video = result.Response.AsVideo; 
                Console.WriteLine("  title:{0} ({1}x{2})", video.Title, video.Width, video.Height); 
                Console.WriteLine("  url:{0}", video.Url); 
                break; 
        } 
    } 
    else 
    { 
        Console.WriteLine("Exception requesting {0} : {1}", result.Request.Url, result.Exception);                
    } 
}

Logging

The library uses the Common.Logging 2 library so you can plug it in to whatever your preferred logging framework is. The log output isn’t very rich right now but I’ll be expanding that in future so you can peek into what is happening.

Reactive Extensions

The other dependency is the Reactive Extensions which I’m new to but it really made the caching of individual LINQ responses much easier than it would otherwise be. The Push vs Pull model allows the pipeline to be split with cached items going to the return pipeline immediately and non-cached requests going through the full download pipeline. I’ll try and make a further post describing how this works.

Roadmap

I’d like to add support for the other embedly endpoints (Preview and Objectify) although I’m not using them myself at the moment – let me know if you’d find these useful.

Some custom Windows Performance Counters would probably be good to track how many requests are going through the library and what the cache-hit ratio is.

The current caching system is very simple and doesn’t have much support for expiring items which should be added.

Feedback

If you find the library useful or have any comments or suggestions to improve things I’d welcome any feedback.

Running MongoDb on Microsoft Windows Azure with CloudDrive

I’ve been playing around with the whole CQRS approach and think MongoDb works really well for the query side of things. I also figured it was time I tried Azure so I had a look round the web to see if there we’re instructions on how to run MongoDb on Microsoft’s Azure cloud. It turned out there were only a few mentions of it or a general approach that should work but no detailed instructions on how to do it. So, I figured I’d give it a go and for a total-Azure-newbie it didn’t turn out to be too difficult.

Obviously you’ll need an Azure account which you may get with MSDN or you can sign-up for their ‘free’ account which has a limited number of hours included before you have to start paying. One thing to be REALLY careful of though – just deploying an app to Azure starts the clock running and leaving it deployed but turned off counts as hours so be sure to delete any experimental deployments you make after trying things out!!

First of all though it’s important to understand where MongoDb would fit with Azure. Each web or worker role runs as a virtual machine which has an amount of local storage included depending on the size of the VM, currently the four pre-defined VMs are:

  • Small: 1 core processor, 1.7GB RAM, 250GB hard disk
  • Medium: 2 core processors, 3.5GB RAM, 500GB hard disk
  • Large: 4 core processors, 7GB RAM, 1000GB hard disk
  • Extra Large: 8 core processors, 15GB RAM, 2000GB hard disk

This local storage is only temporary though and while it can be used for processing by the role instance running it isn’t available to any others and when the instance is moved, upgraded or recycled then it is lost forever (as in, gone for good).

For permanent storage Azure offers SQL-type databases (which we’re not interested in), Table storage (which would be an alternative to MongoDb but harder to query and with more limitations) and Blob storage.

We’re interested in Blob storage or more specifically Page-Blobs which support random read-write access … just like a disk drive. In fact, almost exactly like a disk drive because Azure provides a new CloudDrive which uses a VHD drive image stored as a Page-Blob (so it’s permanent) and can be mounted as a disk-drive within an Azure role instance.

The VHD images can range from 16Mb to 1Tb and apparently you only pay for the storage that is actually used, not the zeroed-bytes (although I haven’t tested this personally).

So, let’s look at the code to create a CloudDrive, mount it in an Azure worker role and run MongoDb as a process that can use the mounted CloudDrive for it’s permanent storage so that everything is kept between machine restarts. We’ll also create an MVC role to test direct connectivity to MongoDb between the two VMs using internal endpoints so that we don’t incur charges for Queue storage or Service Bus messages.

The first step is to create a ‘Windows Azure Cloud Service’ project in Visual Studio 2010 and add both an MVC 2 and Worker role to it.

We will need a copy of the mongod.exe to include in the worker role so just drag and drop that to the project and set it to be Content copied when updated. Note that the Azure VMs are 64-bit instances so you need the 64-bit Windows version of MongoDb.

We’ll also need to add a reference to the .NET MongoDb client library to the web role. I’m using the mongodb-csharp one but you can use one of the others if you prefer.

Our worker role needs a connection to the Azure storage account which we’re going to call ‘MongDbData’

The other configured setting that we need to define is some local storage allocated as a cache for use with the CloudDrive, we’ll call this ‘MongoDbCache’. For this demo we’re going to create a 4Gb cache which will match the 4Gb drive we’ll create for MongoDb data. I haven’t played enough to evaluate performance yet but from what I understand this cache acts a little like the write-cache that you can turn on for your local hard drive.

The last piece before we can crack on with some coding is to define an endpoint which is how the Web Role / MVC App will communicate with the MongoDb server on the Worker Role. This basically tells Azure that we’d like an IP Address and a port to use and it makes sure that we can use it and no one else can. It should be possible to make the endpoint public to the world if you wanted but that isn’t the purpose of this demo. The endpoint is called ‘MongoDbEndpoint’ and set to Internal / TCP:

Now for the code and first we’ll change the WorkerRole.cs file in the WorkerRole1 project (as you can see, I put a lot of effort into customizing the project names!). We’re going to need to keep a reference to the CloudDrive that we’re mounting and also the MongoDb process that we’re going to start so that we can shut them down cleanly when the instance is stopping:

private CloudDrive _mongoDrive;
private Process _mongoProcess;

In the OnStart() method I’ve added some code copied from the Azure SDK Thumbnail sample – this prepares the CloudStorageAccount configuration so that we can use the method CloudStorageAccount.FromConfigurationSetting() to load the details from configuration (this just makes it easier to switch to using the Dev Fabric on our local machine without changing code). I’ve also added a call to StartMongo() and created an OnStop() method which simply closes the MongoDb process and unmounts the CloudDrive when the instance is stopping:

public override bool OnStart()
{
    // Set the maximum number of concurrent connections
    ServicePointManager.DefaultConnectionLimit = 12;

    DiagnosticMonitor.Start("DiagnosticsConnectionString");

    #region Setup CloudStorageAccount Configuration Setting Publisher

    // This code sets up a handler to update CloudStorageAccount instances when their corresponding
    // configuration settings change in the service configuration file.
    CloudStorageAccount.SetConfigurationSettingPublisher((configName, configSetter) =>
    {
        // Provide the configSetter with the initial value
        configSetter(RoleEnvironment.GetConfigurationSettingValue(configName));

        RoleEnvironment.Changed += (sender, arg) =>
        {
            if (arg.Changes.OfType()
                .Any((change) => (change.ConfigurationSettingName == configName)))
            {
                // The corresponding configuration setting has changed, propagate the value
                if (!configSetter(RoleEnvironment.GetConfigurationSettingValue(configName)))
                {
                    // In this case, the change to the storage account credentials in the
                    // service configuration is significant enough that the role needs to be
                    // recycled in order to use the latest settings. (for example, the
                    // endpoint has changed)
                    RoleEnvironment.RequestRecycle();
                }
            }
        };
    });
    #endregion

    // For information on handling configuration changes
    // see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.
    RoleEnvironment.Changing += RoleEnvironmentChanging;

    StartMongo();

    return base.OnStart();
}

public override void OnStop()
{
    _mongoProcess.Close();
    _mongoDrive.Unmount();

    base.OnStop();
}

Next is the code to create the CloudDrive and start the MongoDb process running:

private void StartMongo()
{
    // local cache drive we'll use on the CM
    LocalResource localCache = RoleEnvironment.GetLocalResource("MongoDbCache");

    Trace.TraceInformation("MongoDbCache {0} {1}", localCache.RootPath, localCache.MaximumSizeInMegabytes);
    // we'll use all the cache space we can (note: InitializeCache doesn't work with trailing slash)
    CloudDrive.InitializeCache(localCache.RootPath.TrimEnd('\'), localCache.MaximumSizeInMegabytes);

    // connect to the storage account
    CloudStorageAccount storageAccount = CloudStorageAccount.FromConfigurationSetting("MongoDbData");

    // client for talking to our blob files
    CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();

    // the container that our dive is going to live in
    CloudBlobContainer drives = blobClient.GetContainerReference("drives");

    // create blob container (it has to exist before creating the cloud drive)
    try {drives.CreateIfNotExist();} catch {}

    // get the url to the vhd page blob we'll be using
    var vhdUrl = blobClient.GetContainerReference("drives").GetPageBlobReference("MongoDb.vhd").Uri.ToString();
    Trace.TraceInformation("MongoDb.vhd {0}", vhdUrl);

    // create the cloud drive
    _mongoDrive = storageAccount.CreateCloudDrive(vhdUrl);
    try
    {
        _mongoDrive.Create(localCache.MaximumSizeInMegabytes);
    }
    catch (CloudDriveException ex)
    {
        // exception is thrown if all is well but the drive already exists
    }

    // mount the drive and get the root path of the drive it's mounted as
    var dataPath = _mongoDrive.Mount(localCache.MaximumSizeInMegabytes, DriveMountOptions.Force) + @"";
    Trace.TraceInformation("Mounted as {0}", dataPath);

    // get the internal enpoint that we're going to use for MongoDb
    var ep = RoleEnvironment.CurrentRoleInstance.InstanceEndpoints["MongoDbEndpoint"];

    // create the process to host mongo
    _mongoProcess = new Process();
    var startInfo = _mongoProcess.StartInfo;
    // so we can redirect streams
    startInfo.UseShellExecute = false;
    // we don't need a window, it's hard to see the monitor from here (jk)
    startInfo.CreateNoWindow = false;
    // the mongo daemon is included in our project in the current directory
    startInfo.FileName = @"mongod.exe";
    startInfo.WorkingDirectory = Environment.CurrentDirectory;
    // specify the ip address and port for MongoDb to use and also the path to the data
    startInfo.Arguments = string.Format(@"--bind_ip {0} --port {1} --dbpath {2} --quiet", ep.IPEndpoint.Address, ep.IPEndpoint.Port, dataPath);
    // capture mongo output to Azure log files
    startInfo.RedirectStandardError = true;
    startInfo.RedirectStandardOutput = true;
    _mongoProcess.ErrorDataReceived += (sender, evt) => WriteLine(evt.Data);
    _mongoProcess.OutputDataReceived += (sender, evt) => WriteLine(evt.Data);

    Trace.TraceInformation("Mongo Process {0}", startInfo.Arguments);

    // start mongo going
    _mongoProcess.Start();
    _mongoProcess.BeginErrorReadLine();
    _mongoProcess.BeginOutputReadLine();
}

[TODO: Add more explanation !!]

So, that’s the server-side, oops, I mean Worker Role setup which will now run MongoDb and persist the data permanently. We could get fancier and have multiple roles with slave / sharded instances of MongoDb but they will follow a similar pattern.

The client-side in the Web Role MVC app is very simple and the only extra work we need to do is to figure out the IP Address and Port that we need to connect to MongoDb using which are setup for us by Azure. The RoleEnvironment lets us get to this and I believe (but could be wrong so don’t quote me) that the App Fabric part of Azure handles the communication between roles to pass this information. Once we have it we can create our connection to MongoDb as normal and save NoSQL JSON documents to our hearts content …

var workerRoles = RoleEnvironment.Roles["WorkerRole1"];
var workerRoleInstance = workerRoles.Instances[0];
RoleInstanceEndpoint ep = workerRoleInstance.InstanceEndpoints["MongoDbEndpoint"];

string connectionString = string.Format("Server={0}:{1}", ep.IPEndpoint.Address, ep.IPEndpoint.Port);

var mongo = new Mongo(connectionString);
mongo.Connect();
var db = mongo.GetDatabase("notes");

I hope you find this useful. I’ll try and add some extra notes to explain the code and the thinking behind it in more detail and will post some follow ups to cover deploying the app to Azure and what I’ve learned of that process.

RenderSubAction alternative to RenderAction for Sub-Controllers in MVC

The ASP.NET MVC Futures assembly contains several RenderAction extension methods for HtmlHelper to allow another action to be rendered at some point within a view. Typically, this allows each controller to handle different responsibilities rather than things being combined into the parent.

So, for example, a PersonController is responsible for retrieving and assembling the model to represent a Person and pass it to the View for rendering but it should not handle Contacts – the display and CRUD operations on contacts should be handled by a ContactController and RenderAction is a convenient way to insert a list of contacts for a person into the persion display view.

So, we have a PersonController which will retrieve a Person model and pass it to the Display view. Inside this Display view, we have a call to render a list of contacts for that person:

<% Html.RenderSubAction("List", "Contact", new { personId = Model.Id }); %>

I’ve come across two problems when using this though:

1. If the parent controller action requested uses the HTTP POST method then the controller action picked up for all child actions will also be the POST version (if there is one). This is rarely the desired behavior though – I’d only expect to be sending a POST to the ContactController when I want to change something related to a contact and not when updating a person.

2. If the [ValidateInput(false)] attribute is used to allow HTML code to be posted (imagine a ‘Biography’ field on Person with a nice WYSIWYG TinyMCE Editor control …) then the request will fail unless all the child actions are automatically marked with the same attribute. I would prefer to only have to mark the methods I specifically want a POST request containing HTML input to be called.

So, I created a set of alternative RenderSubAction extension methods which address both these issues:

1. Whatever the HTTP method used for the parent action, the routing will match the GET version for child actions called.

2. The state of the [ValidateInput()] attribute will be set on all child actions called.

The code is below … just reference the namespace that you put it in within your web.config file and then change the RenderAction method to RenderSubAction – the method signatures are identical so it is a drop-in replacement.

I’d be interested in any feedback on this approach.

public static class HtmlHelperExtensions {
    public static void RenderSubAction<TController>(this HtmlHelper helper, 
Expression<Action<TController>> action) where TController : Controller { RouteValueDictionary routeValuesFromExpression = ExpressionHelper
            .GetRouteValuesFromExpression(action);
        helper.RenderRoute(routeValuesFromExpression);
    }

    public static void RenderSubAction(this HtmlHelper helper, string actionName) {
        helper.RenderSubAction(actionName, null);
    }

    public static void RenderSubAction(this HtmlHelper helper, string actionName, string controllerName) {
        helper.RenderSubAction(actionName, controllerName, null);
    }

    public static void RenderSubAction(this HtmlHelper helper, string actionName, string controllerName, 
object routeValues) { helper.RenderSubAction(actionName, controllerName, new RouteValueDictionary(routeValues)); } public static void RenderSubAction(this HtmlHelper helper, string actionName, string controllerName, RouteValueDictionary routeValues) { RouteValueDictionary dictionary = routeValues != null ? new RouteValueDictionary(routeValues)
: new RouteValueDictionary(); foreach (var pair in helper.ViewContext.RouteData.Values) { if (!dictionary.ContainsKey(pair.Key)) { dictionary.Add(pair.Key, pair.Value); } } if (!string.IsNullOrEmpty(actionName)) { dictionary["action"] = actionName; } if (!string.IsNullOrEmpty(controllerName)) { dictionary["controller"] = controllerName; } helper.RenderRoute(dictionary); } public static void RenderRoute(this HtmlHelper helper, RouteValueDictionary routeValues) { var routeData = new RouteData(); foreach (var pair in routeValues) { routeData.Values.Add(pair.Key, pair.Value); } HttpContextBase httpContext = new OverrideRequestHttpContextWrapper(HttpContext.Current); var context = new RequestContext(httpContext, routeData); bool validateRequest = helper.ViewContext.Controller.ValidateRequest; new RenderSubActionMvcHandler(context, validateRequest).ProcessRequestInternal(httpContext); } #region Nested type: RenderSubActionMvcHandler private class RenderSubActionMvcHandler : MvcHandler { private bool _validateRequest; public RenderSubActionMvcHandler(RequestContext context, bool validateRequest) : base(context) { _validateRequest = validateRequest; } protected override void AddVersionHeader(HttpContextBase httpContext) {} public void ProcessRequestInternal(HttpContextBase httpContext) { AddVersionHeader(httpContext); string requiredString = RequestContext.RouteData.GetRequiredString("controller"); IControllerFactory controllerFactory = ControllerBuilder.Current.GetControllerFactory(); IController controller = controllerFactory.CreateController(RequestContext, requiredString); if (controller == null) { throw new InvalidOperationException(string.Format(CultureInfo.CurrentUICulture,
"The IControllerFactory '{0}' did not return a controller for a controller named '{1}'.",
new object[] { controllerFactory.GetType(), requiredString })); } try { ((ControllerBase) controller).ValidateRequest = _validateRequest; controller.Execute(RequestContext); } finally { controllerFactory.ReleaseController(controller); } } } private class OverrideHttpMethodHttpRequestWrapper : HttpRequestWrapper { public OverrideHttpMethodHttpRequestWrapper(HttpRequest httpRequest) : base(httpRequest) { } public override string HttpMethod { get { return "GET"; } } } private class OverrideRequestHttpContextWrapper : HttpContextWrapper { private readonly HttpContext _httpContext; public OverrideRequestHttpContextWrapper(HttpContext httpContext) : base(httpContext) { _httpContext = httpContext; } public override HttpRequestBase Request { get { return new OverrideHttpMethodHttpRequestWrapper(_httpContext.Request); } } } #endregion }

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

NHibernate.Search using Lucene.NET Full Text Index (3)

In Part 1 we looked at how to create a full-text index of NHibernate persisted domain objects using the Lucene.NET project. Part 2 then looked at how to query the index complete with query-parsing and hit-highlighting of the results.

Now that we have a full-text index there are other things that we can use it for. The easiest and most useful is probably adding a ‘similar items’ feature where the system can automatically display related items based on the text that they share in common. While it isn’t exact the results are often surprisingly good and while a human editor could probably pick out some links with more finesse it can quickly become an impossible task as the number of items grow – the human will typically resort to searching for similar items using the index anyway so why not automate it?!

This feature can be used to display related web pages or blog entries or, in this case, related books. It probably isn’t too far off from the system that Amazon uses. The benefit is that as new content is being added, the top related items can constantly be updated – even for existing items in the system. So, for example, if a new Harry Potter book is released then the existing books can immediately start linking to it and vice-versa or if a company starts offering a new training course or product then any related pages will immediately start to link together.

While it sounds complicated, it is actually quite easy thanks to the contrib assemblies provided with Lucene.NET. In fact, it’s so simple it’s almost trivial so this won’t be a long post!

First, we need to add a new reference to the SimilarityNet.dll assembly (part of Lucene.NET contrib). This provides a SimilarityQueries class which contains a FormSimilarQuery method. Calling this will a piece of text (from an existing field), an analyzer and the field name will produce a boolean query using every unique word where all words are optional. If we repeat this with each field, boosting the relevance of the most important ones (such as title) then we end up with a query that will look for every word in each field of the original item.

To quote the Lucene documentation:

The philosophy behind this method is “two documents are similar if they share lots of words”. Note that behind the scenes, Lucene’s scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.

What this means in practice is that the more unique a word is, the more likely it will be taken into account when ranking the similar items. So, if our original book has ‘Agile’ in the title and words such as ‘scrum’ and ‘backlog’ in the summary then chances are we will find other books that also have these more unique words … and it’s very likely that they will be related to our original book.

Of course, when we search our index for books with all these words there is going to be one obvious match – the original book! In fact, this should be the first result returned so we could either skip this when creating the result-set (looking for the same unique Id rather than just skipping the first one just to be safe) or, as in the example below, use a boolean search and specifically exclude the Id of the source item from the query. I haven’t experimented to see which one is quicker but I prefer to let Lucene do all the work – I trust it and it saves me writing any more code or getting results back that I am just going to discard which feels wrong.

Here is the code to find the best 4 similar matches to any book passed in. Note that I include the Authors and Publisher fields when doing the comparison so it will tend to favour books by the same author or publisher – you will need to experiment to see what makes most sense for your application and usage.

/// <summary>
/// Gets similar books.
/// </summary>
/// <param name="book">The book.</param>
/// <returns></returns>
public override IList<IBook> GetSimilarBooks(IBook book)
{
    IFullTextSession session = (IFullTextSession)NHibernateHelper.GetCurrentSession();
    Analyzer analyzer = new StandardAnalyzer();
    BooleanQuery query = new BooleanQuery();

    Query title = Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Title, analyzer, "Title", null);
    title.SetBoost(10);
    query.Add(title, BooleanClause.Occur.SHOULD);

    if (book.Summary != null) {
        Query summary =
            Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Summary, analyzer, "Summary", null);
        summary.SetBoost(5);
        query.Add(summary, BooleanClause.Occur.SHOULD);
    }

    if (book.Authors != null) {
        Query authors =
            Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Authors, analyzer, "Authors", null);
        query.Add(authors, BooleanClause.Occur.SHOULD);
    }

    if (book.Publisher != null) {
        Query publisher =
            Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Publisher, analyzer, "Publisher", null);
        query.Add(publisher, BooleanClause.Occur.SHOULD);
    }
    // avoid the book being similar to itself!
    query.Add(new TermQuery(new Term("Id", book.Id.ToString())), BooleanClause.Occur.MUST_NOT);

    IQuery nhQuery = session.CreateFullTextQuery(query, new Type[] { typeof(Book) })
                            .SetMaxResults(4);

    IList<IBook> books = nhQuery.List<IBook>();
    return books;
}

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

That about wraps it up for using NHibernate and Lucene. I’m expecting things to change when the new NHibernate version 2.0 is released so I’ll probably post again to update you of any changes though when it is. Also, there are a few other features available in Lucene which I may blog about such as using Synonyms for the ‘did you mean …’ type suggestions.

Please let me know if there is anything that I haven’t explained particularly well or you would like to see more about.

NHibernate.Search using Lucene.NET Full Text Index (2)

In NHibernate.Search using Lucene.NET Full Text Index (Part 1) we looked at setting up the NHibernate.Search extension to add full-text searching of -persisted objects.

Next, we’ll look at how we can perform Google-like searches using the index and some tips on displaying the results including highlighting the search-terms.

Our Book class has the Title, Summary, Authors and Publisher field indexed so we’ll allow searching in any of these fields. However, if a search-term exists in the title it is probably more relevant than if it just exists in the summary so we want to give more priority to certain fields than to others. Likewise, we probably want to be able to specify which fields to search on otherwise we would get books that make mention of “Martin Fowler” in the summary whereas we may want to only see books that have “Martin Fowler” as an author for example.

Also worth mentioning is the Summary field. In the Book class there is a SummaryHtml field which (you’ll never guess) contains the Html summary retrieved from Amazon and also a Summary field which is the one that is actually indexed. In the full app this text field is generated from the Html content using the . The reason we want a version of the Summary in plain text is to make indexing easier / more accurate (no HTML tags) and also to allow result fragments to be created: imagine if a section of the SummaryHtml was output – it could potentially split across an Html element or attribute (producing invalid markup) or include the opening tag but not the matching closing one (producing runaway bold-text for instance).

Back to our example though. To be able to show the highlighted search terms in the results I found it easier to create a special BookSearchResult class that I can return from the data provider – the highlighting is something Lucene.NET can do for us and avoids us having to write our own presentation code to handle it. Here is the class:

/// <summary>
/// A wrapper for a book object returned from a full text index query
/// with additional properties for highlighted segments
/// </summary>
public class BookSearchResult : IBookSearchResult
{
    private readonly IBook _book;
    private string _highlightedTitle;
    private string _highlightedSummary;
    private string _highlightedAuthors;
    private string _highlightedPublisher;

    /// <summary>
    /// Initializes a new instance of the <see cref="BookSearchResult"/> class.
    /// </summary>
    /// <param name="book">The book.</param>
    public BookSearchResult(IBook book)
    {
        _book = book;
    }

    /// <summary>
    /// Gets the book.
    /// </summary>
    /// <value>The book.</value>
    public IBook Book
    {
        get { return _book; }
    }

    /// <summary>
    /// Gets or sets the highlighted title.
    /// </summary>
    /// <value>The highlighted title.</value>
    public string HighlightedTitle
    {
        get
        {
            if (_highlightedTitle == null || _highlightedTitle.Length == 0)
            {
                return _book.Title;
            }
            return _highlightedTitle;
        }
        set { _highlightedTitle = value; }
    }

    /// <summary>
    /// Gets or sets the highlighted summary.
    /// </summary>
    /// <value>The highlighted summary.</value>
    public string HighlightedSummary
    {
        get
        {
            if (_highlightedSummary == null || _highlightedSummary.Length == 0)
            {
                if (_book.Summary == null || _book.Summary.Length < 300)
                {
                    return _book.Summary;
                }
                else
                {
                    return _book.Summary.Substring(0,300) + " ...";
                }
            }
            return _highlightedSummary;
        }
        set { _highlightedSummary = value; }
    }

    /// <summary>
    /// Gets or sets the highlighted authors.
    /// </summary>
    /// <value>The highlighted authors.</value>
    public string HighlightedAuthors
    {
        get
        {
            if (_highlightedAuthors == null || _highlightedAuthors.Length == 0)
            {
                return _book.Authors;
            }
            return _highlightedAuthors;
        }
        set { _highlightedAuthors = value; }
    }

    /// <summary>
    /// Gets or sets the highlighted publisher.
    /// </summary>
    /// <value>The highlighted publisher.</value>
    public string HighlightedPublisher
    {
        get
        {
            if (_highlightedPublisher == null || _highlightedPublisher.Length == 0)
            {
                return _book.Publisher;
            }
            return _highlightedPublisher;
        }
        set { _highlightedPublisher = value; }
    }
}

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

 

You’ll notice that the Highlighted… fields return the equivalent book field if the highlighted field does not exist. This just saves us having to check whether there is a highlighted term in each field when we’re building the search result list.

Our data provider will accept a single string consisting of the entered search-terms and return a list of BookSearchResult objects that match. Here is the code and I’ll then try and explain what it’s doing:

/// <summary>
/// Finds the books.
/// </summary>
/// <param name="query">The query.</param>
/// <returns></returns>
public override IList<IBookSearchResult> FindBooks(string query)
{
    IList<IBookSearchResult> results = new List<IBookSearchResult>();

    Analyzer analyzer = new SimpleAnalyzer();
    MultiFieldQueryParser parser = new MultiFieldQueryParser(
                                   new string[] { "Title", "Summary", "Authors", "Publisher"}, 
                                   analyzer);
    Query queryObj;

    try
    {
        queryObj = parser.Parse(query);
    }
    catch (ParseException)
    {
        // TODO: provide feedback to user on failed search expressions
        return results;
    }

    IFullTextSession session = (IFullTextSession) NHibernateHelper.GetCurrentSession();
    IQuery nhQuery = session.CreateFullTextQuery(queryObj, new Type[] {typeof (Book) } );

    IList<IBook> books = nhQuery.List<IBook>();

    IndexReader indexReader = IndexReader.Open(SearchFactory.GetSearchFactory(session)
                                         .GetDirectoryProvider(typeof (Book)).Directory);
    Query simplifiedQuery = queryObj.Rewrite(indexReader);

    SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b class='term'>", "</b>");

    Highlighter hTitle = GetHighlighter(simplifiedQuery, formatter, "Title", 100);
    Highlighter hSummary = GetHighlighter(simplifiedQuery, formatter, "Summary", 200);
    Highlighter hAuthors = GetHighlighter(simplifiedQuery, formatter, "Authors", 100);
    Highlighter hPublisher = GetHighlighter(simplifiedQuery, formatter, "Publisher", 100);

    foreach(IBook book in books)
    {
        IBookSearchResult result = new BookSearchResult(book);

        TokenStream tsTitle = analyzer.TokenStream("Title", 
                              new System.IO.StringReader(book.Title ?? string.Empty));
        result.HighlightedTitle = hTitle.GetBestFragment(tsTitle, book.Title);

        TokenStream tsAuthors = analyzer.TokenStream("Authors",
                              new System.IO.StringReader(book.Authors ?? string.Empty));
        result.HighlightedAuthors = hAuthors.GetBestFragment(tsAuthors, book.Authors);

        TokenStream tsPublisher = analyzer.TokenStream("Publisher", 
                              new System.IO.StringReader(book.Publisher ?? string.Empty));
        result.HighlightedPublisher = hPublisher.GetBestFragment(tsPublisher, book.Publisher);

        TokenStream tsSummary = analyzer.TokenStream("Summary", 
                              new System.IO.StringReader(book.Summary ?? string.Empty));
        result.HighlightedSummary = hSummary.GetBestFragments(tsSummary, 
                                    book.Summary, 3, " ... <br /><br /> ... ");

        results.Add(result);
    }

    return results;
}

/// <summary>
/// Gets the highlighter for the given field.
/// </summary>
/// <param name="query">The query.</param>
/// <param name="formatter">The formatter.</param>
/// <param name="field">The field.</param>
/// <param name="fragmentSize">Size of the fragment.</param>
/// <returns></returns>
private static Highlighter GetHighlighter(Query query, Formatter formatter,
                                          string field, int fragmentSize)
{
    // create a new query to contain the terms
    BooleanQuery termsQuery = new BooleanQuery();

    // extract terms for this field only
    WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field);
    foreach (WeightedTerm term in terms)
    {
        // create new term query and add to list
        TermQuery termQuery = new TermQuery(new Term(field, term.GetTerm()));
        termsQuery.Add(termQuery, BooleanClause.Occur.SHOULD);
    }

    // create query scorer based on term queries (field specific)
    QueryScorer scorer = new QueryScorer(termsQuery);

    Highlighter highlighter = new Highlighter(formatter, scorer);
    highlighter.SetTextFragmenter(new SimpleFragmenter(fragmentSize));

    return highlighter;
}

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

First, we parse the user-entered query string indicating that we want to match on the fields Title, Summary, Authors and Publisher using the MultiFieldQueryParser. This turns the user entered search expression into Lucene specific instructions. Most users when searching will enter a simple expression containing the words or phrase that they want to find. If the search term “XML’ is entered for example Lucene will convert this into the expression “Title:XML Summary:XML Authors:XML Publisher:XML” which effectively means “find any record where ‘XML’ exists in any of the fields”.

The user can enter specific instructions directly such as “Title:Architecture Authors:Fowler” which means “Find any books that have ‘Architecture’ in the Title field or ‘Fowler’ in the Authors field”. Boolean expressions can be used to control this further allowing “(Title:Architecture) AND (Authors:Fowler)” to find any books titled ‘Architecture’ authored by ‘Fowler’. When specific searches like this have been entered then the MultiFieldQueryParser doesn’t expand the search to include all fields (except for un-field-prefixed words and phrases).

Incidentally, in the original Book class we included attributes to control the indexing such as [Boost(10)] for the Title. This boosts the relevance of searches on certain fields so a search for ‘XML’ in the Title and Summary of a document will rank books with ‘XML’ in the Title higher than books that have ‘XML’ in the summary – they are more likely to be what the user is searching for in this case.

Lucene does provide many other ways to define a query but this is simple and easy for this example.

Once we have our Lucene query object we use this to create an NHibernate.Search full-text query to return Book objects. This is where NHibernate and Lucene meet (from a querying point of view). It is possible to combine full-text-queries of Lucene with NHibernate queries of the database – NHibernate.Search handles the searching and returns the relevant objects.

So, we now have a list of Book objects just the same as if it had come directly from NHibernate except that the results are in order based on the rank provided by the Lucene search.

Now, we’ll use another part of Lucene to highlight the matches. This is done using the SimpleHTMLFormatter, QueryScorer and Highlighter objects which combined allow us to get a fragment for each field with the search terms highlighted.

Note that the SimpleHtmlFormatter class is not in the main Lucene.Net.dll assembly but instead in a separate contrib assembly called Highlighter.Net.dll – there are also some other interesting utilities worth exploring in the contrib folder of the Lucene.NET distribution. Remember in Part 1 I mentioned that I had problems with assembly references and different versions of Lucene.Net.dll being used by NHibernate.Search so if you have problems building the solution after adding references to these contrib assemblies, consider building NHibernate.Search making sure that it references the same Lucene.Net.dll as the Lucene contrib assemblies were built against.

The Highlighter object for each field has to be based on the query terms for that field only so the original query is re-written and split up so that only the terms searched for that field are used. This isn’t strictly necessary but I think it makes more sense if when you search for ‘Microsoft’ in the Title of a book only that occurrences of ‘Microsoft’ in the Summary or Publisher fields are not highlighted: the highlighted results then show clearly which found terms influenced the results. I have split this functionality into a separate GetHighlighter() method.

For example, without doing this a search for ‘Title:Microsoft’ incorrectly highlights the occurrences of ‘Microsoft’ found within the Author, Publisher and Summary fields even though they did not really contribute to the Book being included in the results or it’s rank within them:

highlight_wrong

By creating the proper Highlighter for each field based on the terms used to search it the search results can be shown correctly without highlighting the un-searched fields / terms:

highlight_correct

Also, not that the fragments produced for the Summary are different – if a separate terms are used for the Title and Summary then having the Title terms highlighted in the Summary would possibly produce incorrect or sub-standard fragments.

Having built our Highlighters we can then iterate over the results creating a BookSearchResult to wrap each book in the result set. The same analyzer used in the initial query is then used to get a TokenStream for each field which the Highlighter instance needs to create the highlighted fragment from.

For the Title, Authors and Publisher fields we return a single Fragment which will normally be the field itself with the highlighted search terms wrapped in <b class=’term’> … </b> Html tags (courtesy of the SimpleHtmlFormatter class). The highlighted Summary is set to the best 3 fragments separated by ‘… <br /><br /> … ‘. However big the summary is this ensures that the results contain a similar sized chunk of text with the best fragments shown (those containing the most highlighted terms).

Here is an example of the results for ‘Title:Software Summary:Requirements Authors:Steve’ after formatting and CSS applied to show the highlighted terms in yellow:

search_results

Lucene.NET can do a lot more than I’ve shown here. I found the best resource for learning about how to use it is the ‘Lucene in Action’ book:

Lucene in Action (In Action series)by Otis Gospodnetic, Erik Hatcher

Read more about this book…

Note that this covers the Java version but applies equally well to the .NET port which is practically identical.

I hope this has been useful. In Part 3 I’ll try and demonstrate using the Lucene.NET index to find similar items based on the frequency of shared terms. This can be used to provide ‘other books you may like’ or ‘blog posts like this one’ type functionality.

NHibernate.Search using Lucene.NET Full Text Index (1)

Ayende added the NHibernate.Search last year but I've never seen a great deal of documentation or examples around it so hopefully this post will help others to get started with it.

Basically, this addition to NHibernate brings two of the best open source libraries together – NHibernate as the Object Relational Mapper that persists your objects to a database and Lucene.NET which provides full-text indexing and query support.

So how do you use it?

The first problem you will run into is actually finding it. Unfortunately the release of NHibernate does not include it in the bin although it is there in the source. Download the latest version of the NHibernate source (1.2.1 GA as of writing) and compile it to produce the NHibernate.Search.dll assembly.

Before you do this though, you may want to also download the latest Lucene.NET release (2.0.004) and replace the Lucene.NET.dll assembly in the NHibernate libnet2.0 folder (I'm assuming you are using .NET 2.0). While the Lucene.NET library has the same version number and did work fine, the sizes are different and I ran into some problems when trying to use some of the extra Lucene.NET assemblies for hit-highlighting and similarity matching.

The first step is of course to add a reference to NHibernate.Search.dll to your Visual Studio.NET Project.

Next, you need to add some additional properties to the session-factory element of the NHibernate configuration section(normally stored in your web.config file):

<property name="hibernate.search.default.directory_provider">NHibernate.Search.Storage.FSDirectoryProvider, NHibernate.Search</property><property name="hibernate.search.default.indexBase">~/Index</property>

 

If you've used Lucene.NET much you will know that it has the concept of different directory providers for storing the indexed such as RAM or FS (File System). The entries above are used to indicate that we want the Lucene index to be stored on the file system and located in the /Index folder of the website (it could of course be outside the website mapped folder). It's well worth reading a book such as Lucene in Action to get a good idea of how Lucene works and what it can do (it's for the Java version but is still excellent for learning the .NET implementation).

The next step requires that you decorate your C# class with some attributes to control the indexing operation. Personally, I don't like this as it means I need to start referencing NHibernate and Lucene assemblies from my otherwise nice, clean POCO (Plain Old CLR/C# Classes) project. It would have been much nicer IMO if this information could have been put in the NHibernate .hbm.xml mapping files but it's a small price to pay and some people already use the attribute approach for NHibernate anyway.

Here is an example of a Book class for a library application with the additional attributes:

[Indexed(Index = "Book")] public class Book : IBook {     private Guid _id;     private string _title;     private string _summary;     private string _summaryHtml;     private string _authors;     private string _url;     private string _smallImageUrl;     private string _mediumImageUrl;     private string _largeImageUrl;     private string _isbn;     private string _published;     private string _publisher;     private string _binding;     [DocumentId]     [FieldBridge(typeof(GuidBridge))]     public Guid Id     {         get { return _id; }         set { _id = value; }     }     [Field(Index.Tokenized, Store = Store.No)]     [Analyzer(typeof(StandardAnalyzer))]     [Boost(2)]     public string Title     {         get { return _title; }         set { _title = value; }     }     [Field(Index.Tokenized, Store = Store.No)]     [Analyzer(typeof(StandardAnalyzer))]     public string Summary     {         get { return _summary; }         set { _summary = value; }     }     public string SummaryHtml     {         get         {             if (_summaryHtml == null || _summaryHtml.Length == 0)             {                 return _summary;             }             return _summaryHtml;         }         set { _summaryHtml = value; }     }     [Field(Index.Tokenized, Store = Store.No)]     [Analyzer(typeof(StandardAnalyzer))]     public string Authors     {         get { return _authors; }         set { _authors = value; }     }     public string Url     {         get { return _url; }         set { _url = value; }     }     public string SmallImageUrl     {         get { return _smallImageUrl; }         set { _smallImageUrl = value; }     }     public string MediumImageUrl     {         get { return _mediumImageUrl; }         set { _mediumImageUrl = value; }     }     public string LargeImageUrl     {         get { return _largeImageUrl; }         set { _largeImageUrl = value; }     }     [Field(Index.UnTokenized, Store = Store.Yes)]     public string Isbn     {         get { return _isbn; }         set { _isbn = value; }     }     [Field(Index.UnTokenized, Store = Store.No)]     public string Published     {         get { return _published; }         set { _published = value; }     }     [Field(Index.Tokenized, Store = Store.No)]     [Analyzer(typeof(StandardAnalyzer))]     public string Publisher     {         get { return _publisher; }         set { _publisher = value; }     }     public string Binding     {         get { return _binding; }         set { _binding = value; }     } } 

Now we're ready to start using it from NHibernate. To do this we need to create a FullTextSession and use this instead of the regular NHibernate Session (which it wraps / extends):

ISession session = sessionFactory.OpenSession(new SearchInterceptor());IFullTextSession fullTextSession = Search.CreateFullTextSession(session);

 

And that's it. You can use the IFullTextSession in place of the regular ISession (even casting it for places where you are just doing normal NHibernate operations). All the magic happens inside NHibernate.Search – when you add, update or delete records the 'documents' in the Lucene index are automatically updated which provides you with an excellent Full Text index without a Windows Service in sight!

You can check that it's working by looking in the Index folder – there should be a 'Book' folder containing the Lucene index files (with CFS extensions).

In the next post I'll demonstrate using the index to do some queries including hit-highlighting for presenting the results but for now you may want to download and try Luke – a Java program to browser Lucene index catalogs (the file format is identical between the two implementations).

Running .NET Apps in 32-bit mode on 64-bit Windows

The normal behavior for .NET 2.0 applications compiled with the default 'Any CPU' platform is to run as 32-bit on x86 (32-bit) Windows and as 64-bit on x64 (64-bit) Windows.

Occasionally, some apps won't run correctly – I've recently run into this with CCNetConfig (a CruiseControl.NET Configuration tool) and have seen it before with other tools. Another obscure scenario where it shows up is if you try to use the JET OleDB driver which will fail in 64-bit mode because there isn't one! (it has to be 32-bit).

Rather than have to recompile the app or even worse, run a 32-bit Virtual Machine, there is an easy way to force .NET to run an app in 32-bit mode using the 'CorFlags.exe' tool.

Depending on your system this may be installed in different places. I've seen it in different places on XP64 and Vista X64:

  • C:Program FilesMicrosoft SDKsWindowsv6.0Binx64CorFlags.exe
  • C:Program Files (x86)Microsoft Visual Studio 8SDKv2.0BinCorFlags.exe

Running this from the command line with the path / filename of the app you want to change and the switch /32BIT+ to turn on 32-bit mode, e.g.:

   CoreFlags.exe TheApp.exe /32BIT+ 

If that fixes the problem then you know that it is a 64-bit issue. You can re-enable 64-bit operation for the app by turning off the 32-bit switch with the parameter /32BIT-, e.g.:

   CoreFlags.exe TheApp.exe /32BIT- 

Voila … control over 32-bit and 64-bit execution without doing a recompile! I'm not 100% certain but I think that this switch sets the same flag that the 'x86' and 'Any CPU' targets set in Visual Studio.