You know how sometimes, you get close to the end of your journey and it makes it feel like something short is something long. This will not be one of those cases as the last leg of this journey talks about the Dispatcher, and there is a lot to cover in a short space. For those of you who may have stumbled upon this article without reading the others in this series let me first say welcome. You are joining us at the tail end of our journey; a four part series about the Adobe Experience Manager infrastructure. To recap what you may have missed. The first article covers some AEM terms and basics, the second article covers the Author server, and the third article covers the Publish server. If you have no prior AEM knowledge it may be worthwhile to give those a look. It’s ok we will be right here when you get back. Civilization is near my friends. So sit down, strap in, and hold on, we’re taking our final ride.
What is the Dispatcher?
To start, the Dispatcher unlike the Author and Publish server is not a Java application. It is actually an httpd module, and it’s proprietary. The Dispatcher pulls double duty in the AEM stack, it serves as both a caching server, as well as something similar to a Web Application Firewall (WAF). When it comes down to it the Dispatcher is similar to other Apache run web servers, you can use Apache mods to change how traffic is handled and how the static content is served. The part that is special is the Dispatcher handler which eventually is passed the request and then it goes to work. The Dispatcher first checks the cache root, which should be the same as the DocumentRoot (more on this later). If the file being requested is not cached or has been marked as invalid, then the Dispatcher will connect to a Publish instance and pass along the request to be rendered. Once this is done the Dispatcher takes the rendered asset, saves it to the cache root, and then serves this content to the end user.
The high level of how the Dispatcher works is fairly simple to understand, however once you start digging into the configuration and customization of the Dispatcher it may seem a little overwhelming. Adobe has quite a bit of documentation on the Dispatcher, such as explanations and examples of the Dispatcher’s configuration. If you are unsure what something does, this documentation may shed some light on it. I am dropping this disclaimer here so I don’t have to state it multiple times below, the configurations I show you are how we at Axis41 do things, your situation or needs may vary. With that said hopefully this will get you on the right path to success. Like the other articles we are going to cover a base level configuration as well as some added tips.
Dispatcher Install
As the Dispatcher is a module there is some installation that is required, as a general rule you should try and use the latest version of the Dispatcher. The Dispatcher comes packaged as a archive and can be found for download here. As we use a linux derivative of RHEL I am going to use that as my example, obviously some of the same information will apply even if you are running a different platform. The way the Dispatcher is packaged you can drop the tar.gz file into the http directory and extract it from that location. This will add Dispatcher specific files to your conf directory and a few files at the httpd directory level. You can then move the dispatcher.so into the modules directory. For ease of use I would then recommend creating a symlink between the versioned .so and mod_dispatcher.so. Once this is done you can then look at the files in the conf directory. One of those files httpd.conf.disp2 is an example httpd.conf file that contains the Dispatcher IfModule configuration.
You have a few options here, the first option is my strong recommendation, as it can save you a lot of frustration later on. I will also mention a couple other ways this could be handled, but seriously just chose number one.
- Create a custom conf file with just the Dispatcher LoadModule and IfModule inside and then have Apache include your custom conf file. This is fairly easy to do, and has the added benefit that you don’t need to go mucking about in your httpd.conf file.
- Copy out the LoadModule and IfModule lines from the above mentioned httpd.conf.disp2 example file and move those lines into your existing httpd.conf file.
- Replace your existing httpd.conf with the example file. If you have a vanilla Apache installation is the only time I would even consider this, and even then I would still chose option one.
Once you have decided how you are going to implement the Dispatcher configurations you can start customizing. To start you off I would highly recommend setting DisspatcherUseProcessedURL to “On”, this allows you to modify the request such as with mod_rewrite and then have the Dispatcher use the updated request.
Dispatcher.any
If you thought installing the Dispatcher and setting up the IfModule configuration was all that is needed, then I have a surprise for you. There is another more comprehensive configuration file called dispatcher.any, and no “any” is not a typo. This file is mainly responsible for how the Dispatcher will behave. Inside this file there are sections that are each responsible for a different part of this behavior. Again in the interest of time I am using the default dispatcher.any file which is covered in Adobe’s documentation. What I am going to cover is some of the changes you might want to make as well as some of the configurations to pay specific attention to. The /website section is normally where you will start to configure things, you can even setup multiple websites each using the /virtualhosts section contained within to tie specific domains to specific Dispatcher configurations, just be aware that once you start down that road things can become tricky very quickly. We are going to stick with using one website using a wildcard for all domains. The /clientheaders section by default uses a wildcard and I would recommend specifying the headers you expect the Dispatcher to see.
/renders
This section sets up your Publish backend that the Dispatcher will connect to. The Dispatcher does have the ability to set up connections to more than one Publish backend, which will allow the Dispatcher to load balance between the Publish servers. If you are only using one Dispatcher this may be the way to go, however if you use two Dispatchers and two Publish servers I would not recommend setting connections to both Publish servers and here is why. In theory having the cross talking between two Publish and two Dispatchers seems like it would be a good thing. In practice you can run into an issue where the Publish server tells the Dispatcher to invalidate a page, then that Dispatcher, because of load balancing, requests the page from another Publish server which has not yet finished ingesting the content from Author. This then results in either out of date content, or content that is cached with errors on page. In essence you can end up with bad cache on one Dispatcher, and then to further complicate the matter because of load balancing it can be difficult to diagnose which Publish served up the bad content. This is why we recommend using a one to one Publish to Dispatcher configuration.
/filter
This section is where the WAF behavior comes into play. It is fairly straightforward if you look at the rules in the default file. The concept is to deny everything first and then whitelist only the requests that make sense for your project. This section is also setup so that the last rule wins. This means if you place broad rules near the bottom, those rules may expose or open up paths that you previously meant to restrict. These filters can be expanded to address each part of the http request; allowing you to move away from glob and wildcards to a more secure multi filter rule. For example using { /type “allow” /method “GET” /url “/content*” } tightens security when whitelisting the content path by only allowing GET requests to those paths. You can also do a similar filter to only allow POST to a specific servlet { /type “allow” /method “GET” /url “/bin/customservlet” }. If you are using a version earlier than 4.1.9 these expanded filters are not available and I would strongly recommend upgrading to a more recent version of the dispatcher.
Caching
The /cache section and caching on the Dispatcher in general really deserve to be singled out as they are one of the major functions of the dispatcher. Let’s start by talking about caching in general and then we will talk about the /cache section of the dispatcher.any file. First it’s important to point out that anything that is not cached or not able to be cached is sent back to the Publish server each time it is requested. As a general rule you want the Dispatcher to cache as much as possible (markup as well as content). This not only removes excess load from the Publish servers, it also greatly speeds up the delivery of content to the end user. The Dispatcher by default will try and cache everything with a few important exceptions.
- Missing extensions – If the requested path is missing an extension such as .html
- Method Type – If the method is not a GET request
- Errors – If the http response from the Publish server contains an error code, it will not be cached
- Http Header – If the response.setHeader (“Dispatcher”, “no-cache”); is used
- Authorized Headers – If the request contains authorized headers
- Query strings – If the request contains a query string
- Rules – If the request does not match any /rules defined under the /cache section of the dispatcher.any file
With the above list a few of these can be modified based on settings within the /cache section, which I will cover now. I also cover a couple other important properties to pay attention to in this section.
/allowAuthorized
This property allows the Dispatcher to cache content even if there is authorization being used in the headers, most of the time you would not want this behavior as you would not want content that requires authentication to be served to someone who is not authorized. If you do intend on using authorized headers but you still would like the benefits of cached content there is the option of using /auth_checker which may not be a configuration that was included with the default file.
/ignoreUrlParams
This property can allow you to cache query strings. It basically allows the Dispatcher to treat the query string as though it were a unique path to be cached. Even though this option is available to you, I would recommend trying to avoid query strings as much as possible, and instead use a selector such as page.string.html this still allows you to modify what is cached when a different string is provided as a selector. If you end up just using a query string for something like analytics tracking, which is fairly common, you can instead use something like mod_rewrite to leave the query in the URL but drop the query with a pass through to the Dispatcher.
/rules section
By default this section is set to cache everything with a wildcard, if for some reason there is a specific resource you do not want to allow to be cached you can create a rule under this section that denies that behavior.
/docroot
This property sets where the dispatcher will cache its assets on disk. This should end up being set to the same path as the Apache DocumentRoot. This sets Apache to treat the cache created by the Dispatcher as normal static assets of a website.
/statfileslevel
This property sets how deep in the directory structure the .stat files should be created, 0 being the /docroot cache path
/serveStaleOnError
This allows the Dispatcher to serve files from its cache even if they have been invalidated in the event that the Dispatcher is not able to reach the Publish backend.
Flushing
If you remember from the other articles we have talked a little bit about flushing and the different types of flushing. As the following configurations go hand in hand, I have decided to include them in this section. We will start out by talking about the easier concept which is /allowedClients this configuration is near the end of the Dispatcher.any file and it controls who, or what hosts/IPs, are allow to send flush requests to the Dispatcher. By default this section comes commented out which if left unchanged would allow anyone to send a flush request and clear your Dispatcher’s cache, which could be used as part of a denial of service attack. I strongly recommend you set this section to deny all and then only allow the hosts or IPs you trust, such as the Publish and Author servers.
Now we are going to talk about .stat files. In my opinion this method of cache invalidation is not very efficient, and it can be a little difficult to wrap your head around. So I will try to keep it as simple as I can. First off I mentioned the /statfileslevel just above, which determines how deep the .stat files are created. The .stat file itself is a zero byte file that the Dispatcher uses to help it determine if cached files are invalid. To do this when a file is requested the Dispatcher checks the last modified date of the .stat file as well as the cached file; if the .stat file is newer than the cached file the Dispatcher knows the cached file is stale. The Dispatcher will then check the /invalidate section, and anything stale that also matches a pattern in that section is then considered invalid by the Dispatcher, and requested again from the Publish. I will use an example to help explain what I mean.
Let’s say you have a site under the path /content/mysite/en and a content author updates the contact.html page located at /content/mysite/en/about-us/contact.html. When that page is activated the Dispatcher receives an invalidation for that same path located in the docroot, this file is then automatically considered invalid. Next for every level the Dispatcher traverses to reach the contact.html page it would touch a .stat file up to the number you define under /statfileslevel. So if you have that property set to 4, remembering that 0 is the root, then it would touch the following .stat files:
/.stat
/content/.stat
/content/mysite/.stat
/content/mysite/en/.stat
/content/mysite/en/about-us/.stat
Looking at the default /invalidate section rules we see { /glob “*.html” /type “allow” }. This means any .html file is allowed to be flushed, so any .html file located in the same directory where a .stat file was touched is now considered invalid by the Dispatcher. In our example you can see this would be the .html pages under /content/mysite/en/ that are mainly affected. Now if this only affected those files that would be one thing, however there are more files which will also be considered invalid. If the Dispatcher doesn’t have a .stat file at the same level as the page that was requested, it will traverse up the directory structure to find the nearest .stat file.
So, keeping with the example above if the next request was for /content/mysite/en/about-us/investors/how-to-invest.html the Dispatcher would see there is no .stat file under /content/mysite/en/about-us/investors/, as that would be at level 5. So it would then look for the nearest .stat file available, which in our case is (/content/mysite/en/about-us/.stat), and use that as its .stat file when determining if it should serve how-to-invest.html directly from cache or request it from the Publish again before serving. In our example this would mean the Dispatcher now also considers how-to-invest.html invalid. This would be the same for any other document that lives in cache under the /content/mysite/en/about directory. No matter how deep they are, they will always refer back to /content/mysite/en/about-us/.stat to see if they are invalid or not.
I should also mention the flip side, if you had a page /content/mysite/en/blog/post1.html, for example, this would not be invalid as its nearest stat file would be /content/mysite/en/blog/.stat which was not touched by updating /content/mysite/en/about-us/contact.html. You should put serious thought into your configuration when it comes to cache invalidation. Depending on your site structure, /statfileslevel, and /invalidate rules you could end up wiping out large sections of your cache by accident with a simple page update.
Apache Config
When it comes to the Apache configuration, I am not really going to cover much in the way of Apache itself. Suffice it to say that this is an Apache server that runs a custom Dispatcher handler for serving content. Like I mentioned above this means that most of the configurations you might be used to with Apache will still apply. What I am going to cover here is some configurations you can use to make the Dispatcher work a little more smoothly.
First off we use a vhost configuration block and override the Apache DocumentRoot property as well as set the Apache handler to allow the dispatcher module to take over for serving content.
DocumentRoot /opt/aem/dispatcher/docroot
SetHandler dispatcher-handler
When we covered the Publish server we talked about setting up etc/map rules, now we will talk about the Dispatcher side of that coin. For this we would be using mod_rewrite on the Dispatcher in combination with etc map rules to strip the /content/mysite/en path from URLs. To do this you could use something like I have listed below, added to that same vhost configuration I talked about above. This will add the content path back onto the request behind the scenes before the Dispatcher processes the request.
RewriteRule ^/$ /content/mysite/en.html [PT,L] RewriteRule ^/index.html$ / [R=301,L] RewriteCond %{REQUEST_URI} !^/content RewriteCond %{REQUEST_URI} !^/etc RewriteCond %{REQUEST_URI} !^/bin RewriteCond %{REQUEST_URI} !^/lib RewriteCond %{REQUEST_URI} !^/apps RewriteCond %{REQUEST_URI} !^/mysite RewriteCond %{REQUEST_URI} !^/en RewriteCond %{REQUEST_URI} !^/dam RewriteCond %{REQUEST_URI} !^/assets RewriteRule ^/(.+)$ /content/mysite/en/$1 [PT,L]
The other mod_rewrite you can do in the vhost file, that might make your life easier, is mapping your DAM paths. The above covers content paths, but anything under the DAM might have a path like /content/dam/mysite/. My suggestion is to setup etc/mapping to have this path changed to something like /assets/. Oh, so you noticed that I excluded /assets in my block above. Well that was intended, as now you can setup a rewrite rule to handle DAM assets.
RewriteRule ^/assets/(.+)$ /content/dam/mysite/$1 [PT,L]
In the immortal words of Porky Pig, “Th-th-th-that’s all folks!”. I know that this all can seem like an insurmountable task when you first start out, but hopefully these articles will speed you on your way. Now you have earned yourself a “I survived the desert of AEM” t-shirt, if such a thing existed. Make sure you take a look at some of the other articles available on this site, and as always if you have questions or comments feel free to email info@aempodcast.com. Thank you for taking this journey with me.