It’s no secret that Google is the search engine king these days and alternatives are mediocre but they are trying their best. I personally use DuckDuckGo but they are a combination of several search engines, not one built from the ground up.
I’m more of a front-end designer/developer but from what I’ve seen over the past year, something like Go or Rust is necessary for a sturdy foundation and Postgres seems to be the most popular database at the moment (and a document-oriented database should be used). Specific technologies are probably unnecessary but I am interested in implementation details and reasons why certain things are done.
Anyone have insight into such a thing? High-level or in the weeds, I’m all ears.
The cliqz blog might be a good place to start:
https://0x65.dev/blog/2019-12-06/building-a-search-engine-from-scratch.html
SMH, of course. Forgot they started blogging about this recently.
There is also https://blog.algolia.com/inside-the-algolia-engine-part-1-indexing-vs-search/
Oh that’s nice.
Not an expert - just a person who uses search engines - high level thoughts only.
The biggest challenge for a modern search engine is “spam” filtering. I.e. removing sites whose sole purpose in life is to get clicks, and who do so by generating fake content that users aren’t interested in.
These sites come in a lot of different forms. Some of them are cloning content from other sites and putting tons of ads and tracking in it. Some of them are using code to generate completely artificial content, with the goal of looking human enough that search engines pick it up and show it for keywords. Some of them are sites which have a mix of valuable legitimate content, and (often user submitted) spam - I’m looking at Quora in particular here. Lots of them are legitimate sites that and are generating uninteresting blogs to try and convince Google to rank them higher. Etc.
I strongly believe that this is why Google has become worse, not better. Their old algorithms were too vulnerable to spam. Their new algorithms are still somewhat vulnerable, but also sacrifice a lot to avoid the worst of it. They can’t easily win because spammers are reacting to everything that they do.
I’m not sure how you handle this problem. If you could do significantly better than Google, it would be a huge competitive advantage, but It’s not clear that that is possible. Alternatively you could just try and match Google’s anti-spam and compete in other dimensions (e.g. duckduckgo seems to be taking this route, with privacy).
If I was developing a search engine I would probably place a huge emphasis on links that I can be reasonably sure are organic. Scrape sites like reddit, tumblr, twitter, wikipedia, github, stackoverflow, etc that have reasonably human (rather than robot) communities, and put emphasis on links they provide, try and piggyback on their antispam. For example, if a post/poster get’s deleted for posting spam on reddit, take note and be more cautious with the pages/domains they linked to.
I’d probably cheat and whitelist the bigger domains that we as humans recognize as not-spam, as not-spam.
I’d give my users convenient ways of giving feedback that a certain site is spam. Possibly explicitly, possibly in the form of buttons that do things like “exclude this domain from my search results, permanently”, maybe do something like copy gmails “mark spam” button, and have a “spam” tab on my search results.
Ah yes, good points. I’ve seen a couple StackExchange clones and bogus Quora threads while searching for obscure coding issues.
I like your ideas on flagging results and piggybacking, thanks!
Decentralise it. I want a compeletely subjective web. I want to subscribe to people whose views and opinions I like and when I ‘search’ I want a breadth first search of all their content, spending increasingly more search resource on stuff that’s more semantically relevant.
Huh, that’d be interesting. I’d imagine there’d be “major” nodes (Wikipedia could host their own, for example) or…actually, maybe not. Nodes would be as big as the owner’s time and server/database size.
I like this idea.
EDIT: Seems like you’ve described YaCy (I just learned of this too).
Now you have two problems!
This sounds like a friend-to-friend version of yacy. I’d use it!
“Something like Go or Rust” is definitely not necessary and it doesn’t matter whether or not you use a document-oriented database. Your main concern is to have the computing power and connectivity to download and index the pages, especially if you want to even remotely compete with Google. This isn’t a task for a single computer, and you need to be able to add new servers on the fly.
With systems like this, even talking about a single database backend in the sense of “and there’s a few servers running Postgres” is silly. It wouldn’t scale. Can you imagine thousands and thousands of servers at different geographical locations, all consistent with each other?
You need to rethink your requirements, and the consistency is the first thing to go off the list. When a certain page gets removed from the index, for example, it doesn’t matter that one server stops serving it after 5 minutes and the other after 2 hours.
Some elementary (and extremely simplified) design could be something like this:
At a certain geographical location, you’ll have a bunch of nodes, where each node holds a certain part of the index, and also some front-facing servers. When a new search request comes in, you query all of these nodes (or only the relevant ones, if you can somehow make that happen) at once and then merge their answers into a single result you’ll report back to the user.
Then you need to figure out the crawling and indexing. Crawling is likely to be enormously computationally expensive, because with fulltext search, you need to do some analysis on the text to actually make it searchable. When I search “žlutý” in Google, which is a Czech word for “yellow”, it gives a preference to the exact match “žlutý”, but also highlights words žlutými, žlutá, žlutého, žlutém, žlutým, žlutě, žlutému, … (etc), which are the different grammatical cases of the very same word.
In order to achieve this, you need to take the input text and turn every word into a single grammatical case. So you would turn a Czech sentence “Jezdili žlutými autobusy” into something like (jezdili, jet), (žlutými, žlutá), (autobusy, autobus). You then index all of these words, but give them different weights (an exact match is more valuable).
Once you do this once, you need to propagate the modifications over the entire system (so, say, from North America to Europe), but do it in a way that doesn’t overload the servers.
And I’m terribly oversimplifying here. I didn’t mention scoring, sorting the results, and a lot, lot of other things.
But I hope I made my point clear: Deciding what language would be fancier to use is somewhat of a smaller problem here.
Thank you for your insight, there’s so much I don’t know I don’t know.
build discovery engines instead of search. Search minimizes diversity of result, discovery maximizes diversity under similarity constraints. This is a much more interesting problem, what similarity and diversity constraints are appropriate.
as an example of the limitations of search, searching for a document by typing the whole thing will give you that same document. Discovery engines would likely not be short queries, but entire documents.
Do you have an example of such an engine?
e.g. pubmed recommends related articles.
I haven’t seen anything net-wide, tho.
I would say that, although problematic, the recomendation system for youtube is an example of a Discovery Engine.
A “search engine” is kind-of a broad term. Are you talking about the thing that generates search results when you type in a query, or the thing that crawls webpages to get the data for those search results? Those are typically two separate entities, and work asynchronously from one another. The search engine itself, the place you actually type in queries and view results, needs to be highly available, fast, and accurate. The crawler, on the other hand, just needs to be able to crawl as many pages as possible in the shortest amount of time. If you have those components written in something super fast and performant (like Go or Rust), your next bottleneck is going to be the database. PostgreSQL is great and I love it for both personal and professional projects, but I’m not sure you’d be using it to its full potential if it’s just reading and writing to one big table all the time. You’re not working with a bunch of structured, inter-dependent data here, so I’m not sure you’d really be using all the great features that SQL databases provide. You might want to take a look at the various document-oriented databases out there, since they’re a bit easier to scale if you’re not trying to use any of the relational features that you come to expect from SQL.
(disclaimer: I don’t have really any specific insight, nor do I work for any search engine companies, this is just what I would do if I wanted to implement a search engine from scratch)
I suppose I mean the entire thing, so both. I’ve heard of document-oriented databases but never looked into them. I’ll check those out.
EDIT: Oh, that’s just something like Mongo or Rethink. I’m currently using RethinkDB for my personal projects and like it a lot.
Yeah, or Elasticsearch, which essentially combines a document-oriented data store with the Lucene library for full-text search, communicating over a REST API. Just to add more fuel to the “language doesn’t matter all that much” fire, Elastic is written in Java mostly because it can depend on Lucene, which is also written in Java. The one downside to writing your own search engine in Rust or Go or something else is that you won’t get to use Lucene, which has a lot of the text parsing and query syntax stuff already figured out for you…
There’s lots of Lucene-inspired things in many languages. tantivy is good. sonic though doesn’t depend on it while being an Elastic-like thing.
Search things usually avoid DBMSes like Postgres. For search, you don’t need the same kind of guarantees as for business applications. E.g. ElasticSearch does storage on its own.
Also, about web search specifically: you might be interested in https://www.yacy.net
The hard part of a search engine isn’t search, or even crawling. It’s the distributed compute platform on which it runs. It’s data-centre management, power management, cooling management, and finally, paying for all of it.
So basically, to be a real contender you’ve gotta already be flush with cash…and have other revenue streams.
Getting the search engine operational is one thing but I don’t think advertising is a good monetization strategy these days. But then, who in their right mind would run a powerful search engine for free?
I just want to mention that programming language is not critical to the success of this sort of thing at all. It’s all about the algorithm. You could use Visual Basic or anything more modern than UCSD Pascal and be fine. Interop with your data store and the network is the only real issue.
😅 Yeah, I definitely came at this from the wrong perspective.
To recap: algorithm, data store, network, and the speed from search input to displayed results is what’s needed to succeed.
This community is awesome.
Why would languages that have existed for 10 years be necessary to build something that’s been around for 20 years? And why would such young languages be considered a “sturdy foundation”?
A search engine will have at least one crawler and at least one web server, and some way that they speak to eachother; how you fill in the details will determine what sort of service it provides.
Something existing for x amount of years doesn’t mean it cannot be improved upon.
The languages I mentioned have been deployed to production on modern hardware for today’s products. They seem to be working. Furthermore, languages with active development and avid developers have a good chance of improving.
So you want to use new languages because you perceive them to be an improvement, not because they are necessary for a sturdy foundation.
Perl, Python, PHP, Lua, Erlang, Java, and C++ all have been deployed on modern hardware and have active development, and young languages are less sturdy than mature ones as a rule. There are differences in investment and rate of development, but I don’t usually want my sturdy foundations to be constantly innovating and reinventing themselves.
I recommend contacting the findx people. They tried.
https://web.archive.org/web/20190921180535/http://privacore.github.io/
Just reached out to the founder, thanks!