Search engine technology is complicated. It requires in-depth knowledge of all sorts of disciplines: HTML, HTTP, DNS, SQL and of course the programming language of your choosing. In this knowledgebase we try to provide all the information you need to create your own (web) search engine. So we will touch on the basics of all the technologies involved and will sometimes elaborate on topics in the articles section of this website.
Although this knowledgebase is based upon web search, many of the ideas and theories apply to other search engines as well. It's definitely worth your time to at least scroll through some of the subjects that you might not directly need. There may be some hidden gems in there that tickle your brain a little bit.
Cheers and good luck!
Pick your poison: programming languages
Which programming language fits best for your purposes? There are some considerations that might affect your opinion. For example, there are several options to read the contents of HTML pages:
- See if there is a DOM Document type library
Reading into the HTML-page DOM (Document Object Model) is probably the most efficient way for developers to get meaningful information from a structured or semi-structured document. HTML is semi-structured, so for many applications you can treat it as structured text.
- How easy is it to work with XML-like structures and how strict is this library?
A good way to read HTML pages in a structured way, is by using HTML-Tidy to transform the document to XHTML. The wonderful thing about XHTML is that the syntax is more-or-less XML-valid and changing from HTML to XHTML is not a huge deal in most cases. Then you can use an XML-library to read the XHTML-page and you can use all sorts of tools that you would normally use for XML documents. However, if your XML-library has a zero tolerance policy for document errors, you're in trouble - so always check the strictness as well.
- Does the programming language support something like Xpath?
We will describe Xpath in much greater detail later in the knowledgebase, but basically it's a tool for traversing the DOM in a very specific way.
And then there are these considerations:
- What is the performance of your chosen language?
- How much does your chosen programming language slow down development? (How easy is debugging?)
The programming language you choose may be one of the limiting factors down the road. Our advice however is simple: don't take performance as your main focus. Good code will be fast on any platform and bad code will be slow on any platform. Choose something you're great at, because your knowledge will be challenged.
With a platform, we mean the whole of the infrastructure. Of course, we devote a whole section of this website to infrastructure, but it's good to at least mention it beforehand. There are many types of platforms available to us. But maybe you want to reinvent the wheel?
You can start developing a simple mechanism on a single machine. Getting the algorithms down before investing in infrastructure is always a good idea. First check if your plans are realistic and feasible.
There are no comments (yet).Sign in with LinkedIn to comment