[go: up one dir, main page]

0% found this document useful (0 votes)
137 views2 pages

Case Study Github

GitHub uses Elasticsearch to power search across over 8 million code repositories and 2 billion documents, enabling powerful search for both users and developers. Elasticsearch allows GitHub to scale effectively through robust sharding and queries to serve over 4 million users. GitHub also leverages Elasticsearch's analytic capabilities to monitor internal infrastructure for abuse, bugs, and more through advanced queries. This satisfies both regular users and developers through the Elasticsearch API.

Uploaded by

Eranga Udesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views2 pages

Case Study Github

GitHub uses Elasticsearch to power search across over 8 million code repositories and 2 billion documents, enabling powerful search for both users and developers. Elasticsearch allows GitHub to scale effectively through robust sharding and queries to serve over 4 million users. GitHub also leverages Elasticsearch's analytic capabilities to monitor internal infrastructure for abuse, bugs, and more through advanced queries. This satisfies both regular users and developers through the Elasticsearch API.

Uploaded by

Eranga Udesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

The Challenge:

How do you satisfy the search needs of GitHub’s 4 million users while simultaneously providing tactical
operational insights that help you iteratively improve customer service?

The Solution:
By using Elasticsearch to index over 8 million code repositories as well as indexing critical event data.

Enable Powerful Search For Both Leverage Analytics On Search Data


End-Users And Developers
• Reveal rogue users by querying indexed
• Scale out to meet the needs of burgeoning logging data
user base by migrating away from Apache Solr
to Elasticsearch • Find software bugs within the GitHub platform
by indexing all alerts, events, logs and
• Index and query almost any type of publicly tracking the rate of specific code exceptions
exposed data
• Make queries that go beyond standard SQL
• Enable deep programmatic search for
developer applications
• Provide near real-time indexing as soon as
users upload new data

Sophisticated Searching For Sophisticated Users


Elasticsearch powers search on GitHub, the largest hosted revision control system in the world with a demanding customer base
of over 4 million technical users. GitHub uses Elasticsearch to continually index the data from an ever-growing store of over 8
million code repositories, comprising over 2 billion documents. Using Elasticsearch, GitHub was able to let users easily search
this data.

“Search is at the core of GitHub,” says Tim Pease, an Operations Engineer at GitHub. “If you go to GitHub.com/search you can
search through repositories, users, issues, pull requests, and source code.”

One goal of GitHub’s Elasticsearch implementation is to index everything that is publicly available on GitHub.com and make
it easy to find. Of course, full-text searching is fully supported, but searching based on a wide variety of criteria is also possible
and dead simple.
“You can search for a project that uses Clojure as the primary language, and has had activity over the past month, and all this
functionality is powered by Elasticsearch,” says Pease.

Elasticsearch’s flexible storage and retrieval formats, which permit both highly structured and loosely structured data to co-exist
in search storage, along with Elasticsearch’s extensive set of search primitives, made search implementation straightforward.
“You can do lots of queries on that data using Elasticsearch that a standard SQL database won’t support,” notes Pease.

Powering Analytic Insights Behind The Firewall


GitHub utilizes Elasticsearch’s combination of search indexing and analytics capability to drive multiple projects. For example,
GitHub found that the analysis capabilities of Elasticsearch queries could be used on stored audit and logging data in order to
track users’ security-related activity.

“Using Elasticsearch queries, we can quickly see every action the user has done,” says Pease. “This is a great way to see whether
an account has been stolen, hijacked, or whether the user has done something naughty.”

To learn more about Elastic, contact sales@elastic.co | www.elastic.co


When GitHub was looking to track and analyze code exceptions generated by the various software components that power
GitHub. com, they originally used a popular NoSQL database. Code exceptions were stored in secondary indexes, and its
analysis features were used to analyze exceptions over time with the results stored back into the database.
“It didn’t work very well for our use case,” remembered Grant Rodgers, a technical staff member at GitHub. “Once we moved
everything to Elasticsearch and used its histogram facet queries, everything worked really well.”

GitHub uses Elasticsearch’s histogram facet query capability, as well as other statistical facets, to track increases in the rate of
specific types of code exceptions. That process reveal bugs in their software systems.
“Elasticsearch’s histogram facet query capability performs extremely well. We’re looking to expand its use in that particular
application,” says Rodgers.

Scaling To Millions Of Users


GitHub originally used Solr for search, but found that Solr couldn’t scale effectively and was more difficult to manage.
“As more people started using GitHub, we quickly exceeded the storage space that one Solr cluster and Solr instance could
handle,” says Pease.

Faced with the choice of sharding its own data in Solr in order to handle the load, or moving to Elasticsearch, the choice was
easy. “We decided to move to Elasticsearch because we figured they could shard things much better than we could,” says
Pease.

Elasticsearch offers automatic shard rebalancing to increase performance and handle failover conditions. Replica shards are
automatically distributed to new nodes in a cluster and, in the case of node failure, shards are automatically migrated from failed
nodes to good nodes.

Advanced Sharding For High Performance


With over 2 billion documents, all indexed by Elasticsearch, and with users constantly uploading and modifying code, search
performance is a key metric for the GitHub team. GitHub serves, on average, 300 search requests per minute.

GitHub uses Elasticsearch to index new code as soon as users push it to a repository on GitHub. The data can be searched on
very soon after, and search results are returned for both public repositories, and, for logged-in users, any private repositories
they can access.

To optimize access to search data, GitHub uses sharding extensively. In GitHub’s main Elasticsearch cluster, they have about 128
shards, with each shard storing about 120 gigabytes each.

To optimize search within a single repository, GitHub uses the Elasticsearch routing parameter based on the repository ID. “That
allows us to put all the source code for a single repository on one shard,” says Pease. “If you’re on just a single repository page,
and you do a search there, that search actually hits just one shard. Those queries are about twice as fast as searches from the
main GitHub search page.”

GitHub’s benefits using Elasticsearch


Scale Effectively High Performance
3 GitHub uses Elasticsearch’s robust sharding and
3 GitHub uses Elasticsearch’s routing parameter and
advanced queries to serve up search across data in 4 flexible sharding schemes to perform searches within
million users’ code repositories. a single repository on a single shard, doubling the
speed at which results are served.
Analytics via Advanced Queries
3 GitHub uses Elasticsearch’s histogram facet queries, Satisfies Users and Developers
as well as other Elasticsearch analytic queries, to 3 Elasticsearch satisfies the search needs of both
monitor their internal infrastructure for abuse, bugs regular users, and, via the Elasticsearch API,
and more. application developers as well.

Elastic believes getting immediate, actionable insight from data matters. As the company behind the three open source projects — Elasticsearch, Logstash,
and Kibana — designed to take data from any source and search, analyze, and visualize it in real time, Elastic is helping people make sense of data. From stock
quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what’s possible with data, delivering on the promise that good things
come from connecting the dots.

To learn more about Elastic, contact sales@elastic.co | www.elastic.co

You might also like