Case Study Github
Case Study Github
How do you satisfy the search needs of GitHub’s 4 million users while simultaneously providing tactical
operational insights that help you iteratively improve customer service?
The Solution:
By using Elasticsearch to index over 8 million code repositories as well as indexing critical event data.
“Search is at the core of GitHub,” says Tim Pease, an Operations Engineer at GitHub. “If you go to GitHub.com/search you can
search through repositories, users, issues, pull requests, and source code.”
One goal of GitHub’s Elasticsearch implementation is to index everything that is publicly available on GitHub.com and make
it easy to find. Of course, full-text searching is fully supported, but searching based on a wide variety of criteria is also possible
and dead simple.
“You can search for a project that uses Clojure as the primary language, and has had activity over the past month, and all this
functionality is powered by Elasticsearch,” says Pease.
Elasticsearch’s flexible storage and retrieval formats, which permit both highly structured and loosely structured data to co-exist
in search storage, along with Elasticsearch’s extensive set of search primitives, made search implementation straightforward.
“You can do lots of queries on that data using Elasticsearch that a standard SQL database won’t support,” notes Pease.
“Using Elasticsearch queries, we can quickly see every action the user has done,” says Pease. “This is a great way to see whether
an account has been stolen, hijacked, or whether the user has done something naughty.”
GitHub uses Elasticsearch’s histogram facet query capability, as well as other statistical facets, to track increases in the rate of
specific types of code exceptions. That process reveal bugs in their software systems.
“Elasticsearch’s histogram facet query capability performs extremely well. We’re looking to expand its use in that particular
application,” says Rodgers.
Faced with the choice of sharding its own data in Solr in order to handle the load, or moving to Elasticsearch, the choice was
easy. “We decided to move to Elasticsearch because we figured they could shard things much better than we could,” says
Pease.
Elasticsearch offers automatic shard rebalancing to increase performance and handle failover conditions. Replica shards are
automatically distributed to new nodes in a cluster and, in the case of node failure, shards are automatically migrated from failed
nodes to good nodes.
GitHub uses Elasticsearch to index new code as soon as users push it to a repository on GitHub. The data can be searched on
very soon after, and search results are returned for both public repositories, and, for logged-in users, any private repositories
they can access.
To optimize access to search data, GitHub uses sharding extensively. In GitHub’s main Elasticsearch cluster, they have about 128
shards, with each shard storing about 120 gigabytes each.
To optimize search within a single repository, GitHub uses the Elasticsearch routing parameter based on the repository ID. “That
allows us to put all the source code for a single repository on one shard,” says Pease. “If you’re on just a single repository page,
and you do a search there, that search actually hits just one shard. Those queries are about twice as fast as searches from the
main GitHub search page.”
Elastic believes getting immediate, actionable insight from data matters. As the company behind the three open source projects — Elasticsearch, Logstash,
and Kibana — designed to take data from any source and search, analyze, and visualize it in real time, Elastic is helping people make sense of data. From stock
quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what’s possible with data, delivering on the promise that good things
come from connecting the dots.