A new search engine for SEOs to play with

https://quickwit.io/blog/commoncrawl/

This blog post pairs best with our common-crawl demo and a glass of vin de Loire.

Six months ago, we founded Quickwit with the objective of building a new breed of full-text search engine that would be 10 times more cost-efficient on very large datasets. How do we intend to do this? Our search engine will search data straight from Amazon S3, achieving true decoupled compute and storage.

For distributed compute engines such as Presto or Spark, decoupled compute and storage has been a reality for a while. However, for search engines, this is a new trend.

In May 2020, Amazon Elasticsearch Service released UltraWarm, a solution to search straight from Amazon S3 that claims to search petabytes of data in minutes according to this AWS Online Tech Talk.

In March 2021, Elastic also released frozen tier search, giving Elasticsearch users the possibility to mount and start searching indexes on stored Amazon S3 in a dozen seconds. Amazon and Elastic both built their solution based on some cunning caching layer to search Elasticsearch indexes and communicate sparingly about the performance of their solutions. We, on the other hand, had the luxury to design our search engine from the ground up helping us deliver performance close to what an index residing on an SSD disk can offer.

Our ascetic quest to truly separate compute and storage led us to a solution that is entirely stateless. Not only is our solution more cost-efficient, but our search clusters are also easier to operate. One can add or remove search instances in seconds. Multi-tenant search becomes trivial.

As we were developing our engine, we started having conversations with potential users. Often, as we chanted the virtues of our future solution, our claims were met with healthy skepticism.

Let's be honest, "10 times cheaper" has a bad teleshopping vibe and there is a plethora of bold performance claims in the database world.

A compelling demo is key to carry our point through.

So we set out to build a compelling demo based on a large dataset. The Common Crawl corpus, consisting of several billion web pages, appeared as the best candidate. Our demo is simple: the user types the beginning of a phrase and the app finds the most common adjective or noun phrases that follow in the 1 billion web pages that we have indexed.

How does this demo work? At a high level, this demo is built on top of the following components:

  • an inverted search index stored on Amazon S3
  • a search engine that retrieves web pages matching a phrase query
  • an NLP server that extracts adjectives and noun phrases from matching snippets and computes the most common occurrences

Let’s dive deeper into each component.

Search index

The index for this demo contains 1 billion English pages extracted from the latest Common Crawl snapshot. We identified those pages using the excellent natural language detection library, Whatlang. Our indexing engine written in Rust ran for approximately 24 hours on a single Amazon EC2 instance (m5d.8xlarge) to produce a 6.8 TB index stored on Amazon S3. We are satisfied with that throughput considering that the engine also downloaded and decompressed (gzip) thousands of Common Crawl WET files, performed language detection on each page, stemmed each word in the corpus, and finally built and uploaded the index. This index is split into 180 shards to allow for parallel production and consumption. Quickwit usually performs shard pruning based on the shards’ metadata, unfortunately Common Crawl does not offer such an opportunity.

Search engine

Our search engine is made up of stateless search instances that simply run our code to fetch and its data directly from Amazon S3. They do not maintain a partial copy of the index or any other data structure on their local disks. This makes scaling and operating the cluster, or handling failures trivially easy. When a query is submitted to the cluster, one instance, chosen randomly among the instances constituting the cluster, becomes the coordinator for the query. The coordinator determines the list of relevant shards for the query, evenly distributes the work among peer search instances, and waits for the partial results to come back. The coordinator merges and sorts those partial results and returns them to the client. For this demo, we deployed our search engine on two Amazon EC2 instances (c5n.2xlarge). Those instances have good CPUs but also provide good network performance, which is key as each instance issues a lot of parallel requests to Amazon S3 and tends to be bound by the network.

Different shades of latency

When asked about the performance and latency of our search engine, our answer is invariably the same: “it depends on the query.”

For a “simple” query that contains five or fewer query terms and is limited to the first hundred hits or less, our search engine is able to answer in one to two seconds. For instance, the latency for the query “+Barack AND +Obama” limited to 200 hits is 1.5 seconds on average. When an index and a query allow for shard pruning, we sometimes have the ability to answer in less than a second.

However, for more complex queries that require more bandwidth and processing time, latency is increased. For instance, for this demo, we are not only interested in web pages that contain all the query terms but also those that have them in the right order - think “Barack Obama is the president” vs. “The president is Barack Obama”. In information retrieval lingo, those types of queries are called phrase queries and are supported by most search engines with a double quotes syntax. In our case, we need to fetch an additional data structure that encodes the position of each token in the web pages to process phrase queries. Furthermore, for the need of this demo, we do not limit ourselves to the first n results. On the contrary, we actually want to retrieve all the web pages that match the phrase query and for some very frequent terms, this can be a lot of pages. For that reason, we’ve capped the number of results returned per shard to 1,000. The phrase query “Barack Obama is” reaches that cap and returns 18,000 snippets in 12 to 15 seconds.

We are planning on releasing a proper benchmark of our search engine accompanied by a blog post in the upcoming months.

NLP server

The NLP server streams up to 18,000 matching snippets from the search engine and pushes them through a Python library called Pattern to identify and extract adjectives or noun phrases. The server counts each occurrence and when all the fragments have been processed, it returns the most common ones to the front-end, which is then able to display a word cloud.

Cost

We estimated that the cost of our experiment is less than $1,000 per month. Storing the index on Amazon S3 costs $160 per month and deploying a small pool of two search instances costs $650 per month. We also incurred a one-time, $45 expense for indexing the dataset. In the future, we should be able to cut costs further by deploying our code on Amazon Graviton instances.

What’s next?

In the next few months, we will start open-sourcing our search engine. We will accompany the release with a series of blog posts explaining our unique technology and sharing our vision and roadmap.

In the meantime, if you have any questions, think about an interesting use case for our search engine, or are interested in becoming a beta user, get in touch with us at hello@quickwit.io

TwitterLinkedin