Adding Content Sources to SharePoint Search
SharePoint gives you a great amount of versatility for search that extent beyond SharePoint content. As you might expect SharePoint search does allow you to crawl and index local content, but that ability can be extended to non-SharePoint content. There will be main components that we will look at here: sources and schedules. While I will concentrate here on the overall content, I will go into greater detail for each element in later posts.
Sources is the first component for any SharePoint search content crawl and index. This describes the type of information search is to crawl and index for your query. The sources that are available to SharePoint search include:
- SharePoint Sites– This includes local, and remote, SharePoint sites. When indexing SharePoint sites the default content access is set in the search administration page. You can set specific search accounts as needed in the search crawl rules. SharePoint permissions are stored as part of the crawl and index.
- Web Sites– This includes non-SharePoint web sites that can be accessed by the crawling server. As an example you can crawl and index a company wiki resource. As with SharePoint sites you can specify a different search account via SharePoint crawl rules. Since this a foreign resource to the internal Active Directory permissions are not carried.
- File Shares– As you might expect from the name this allows you to index Window file shares. The key for this is to ensure that the search default content access account has at least READ privileges to the file share and accessible to the crawling server. As an added bonus any Active Directory account permissions are crawled and indexed.
- Exchange Public Folders– As this implies this gives you the ability to crawl and index the now depreciated Exchange Public Folders. Since this a resource that uses Active Directory ensure the default content access account has at least READ privileges. Active Directory account permissions are crawled and indexed.
- Line of Business Data– This works with an external data source in conjunction with the Business Data Connectivity Service Application. For this to work you have to have at lest one Business Data Connectivity service application correctly configured and enabled. Like the Web Sources content source Active Directory permissions are not applied.
- Custom Repository – This is a content content source that is custom written for the SharePoint site. This is Beyond the scope of this article.
I will go into each of these content sources more in-depth in later blog posts.
The second component to each of these is to schedule when these individual content sources are to be crawled at either full of incremental schedules.
- Full– As the name implies when the full crawl runs to looks at every single item in the in the content source adding, or removing, indexed items when appropriate. This is a more intensive operation, and can run for an extended amount. Depending on the size of the content source, and load, on the crawling server. While you can have overlapping full crawls for different resources you do not want them to start at the same time. Stager your start times, and mix high load crawls with lower load crawls to minimize server load, and time. I would suggest that a full crawl should be run once a day during off-hours. The full crawl should be able to complete a cycle during your off-hours. If you can not accomplish this then you should be looking at dedicated search crawlers and/or search’s host distribution rules.
- Incremental – This type of crawl looks at new items that have been added since your last full crawl. It crawls, and indexes, items as needed. It is typically much less intensive, and has a shorter run time frame. A typical schedule to do incremental crawls is about every two hours through out the day. This can be shorten to meet business requirements, but will cause a higher load on the crawler, and possibly effect the overall performance of the SharePoint farm. If need dictates to your goals you should be looking at dedicated search crawlers and/or search’s host distribution rules.
Lets look what steps are needed to add a new content source. In this example I’ll add a SharePoint content source to my SharePoint search. I’ll be crawling SharePoint resource at http://sharepointvaquero.com.
First, we need to determine which server, or servers, will be actually be doing the crawl of the SharePoint content resources. We do this to ensure that crawling server has access to the SharePoint resource. In most small farms it will be the same server that you installed the Central Admin components. Open SharePoint Search Admin page at:
Central Admin>Application Management>Manage Service Applications>SharePoint Server Search>Search Application Topology
Note: that you’re your SharePoint Server Search service application can be named differently. Look at the Service Application type to find the ‘Search service Application’
From my example you can see that the crawling will be done from server JBSP. On that server I will check to make sure that it can reach the appropriate resources that need to be crawled. In this case I modified the HOSTS file on that server to ensure the crawler service on that server was looking at specific IP address of the target SharePoint content resource. You may, or may no,t have to do this depending on your network structure.
Next, lets go to Content Sources section within SharePoint Search.
Central Admin>Application Management>Manage Service Applications>SharePoint Server Search>Crawling>Content Sources
Here, we can I already have several search content sources already set up. I select:
New Content Source>SharePoint Sites
Enter in the appropriate information: Name, URL of SharePoint resource, specific settings, and priority. From here you can also enter in the crawl schedules as you need them. Here I’m showing a crawl schedule for the incremental crawl set to run every two hours.
Once that is complete you select ‘OK’ and it will be added to the SharePoint content sources.
Make sure to watch search load and your performance and make changes as needed.