Google Summer of Code wrap-up: Apache Flink (previously Stratosphere)

Friday, September 26, 2014

We continue our Friday Google Summer of Code wrap-up series with Apache Flink (previously Stratosphere) who was a first time participant in the program. Organization Administrator Robert Metzger talks below about their two successful student participants as well as their project’s transition to the Apache Software Foundation incubator program. 
Apache Flink is a system for expressive, declarative, fast, and efficient data analysis. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases.

We were accepted to this year’s Google Summer of Code (GSoC) under our former project name “Stratosphere”. But during the summer our project entered the incubator of the Apache Software Foundation (ASF). Incubation is a process for new projects to enter the umbrella of the ASF. As part of the process our project name was subsequently changed from Stratosphere to Flink.

Our move to the ASF also meant quite a few changes for us and our students during the course of their projects. Both mentors and students were able to learn together about the new processes required by the ASF and in the end this transition worked out quite well for everyone involved.

The acceptance of our project into GSoC was a huge, exciting accomplishment for all of the Flink / Stratosphere developers and especially thrilling to a new, first time organization. We had two students this summer: Artem Tsikiridis and Frank Wu.

Artem worked on a full Hadoop MapReduce compatibility layer for Flink. Both Hadoop and Flink are distributed systems for processing huge amounts of data. Hadoop is an open source implementation of the MapReduce algorithm published by Google. It is widely used for a broad range of data intensive computing applications. Flink offers a broad range of operators and can be used to execute MapReduce-style applications.

Artem’s summer project concerned the implementation of a compatibility layer that exposes exactly the same APIs as Apache Hadoop. This feature allows existing Hadoop users to run their Hadoop jobs with Flink. Consequently, users are now able to utilize a faster execution engine for their existing code! Artem worked closely with the community and succeeded in bringing his changes into our main code line. His work will be available with the 0.7-incubating release of Apache Flink.

Frank Wu, our second GSoC student, worked on a large sub-project of Flink called Support for Streaming (Stratosphere Streaming). Frank initiated the development of the mini-batch processing API of Stratosphere Streaming, enabling operations on windows of tuples. Additionally, he contributed to both the iterative and stateful streaming solutions, two of the most challenging applications of streaming. Frank also provided numerous code examples for the topics he was working on. Like Artem, his work will be available with the 0.7-incubating release of Apache Flink.

I would like to thank the mentors, Fabian Hüske and Marton Belassi, as well as our second organization administrator, Ufuk Celebi, for their help with Stratosphere/Flink’s GSoC participation in the summer of 2014.

By Robert Metzger, Organization Administrator for Apache Flink

Google Summer of Code 2014 by the numbers: Part three

Wednesday, September 24, 2014

In our third statistics post for Google Summer of Code (GSoC) 2014 we have a list of schools with the highest number of student participants. For the first time in seven years a new school has claimed the top spot—congratulations to International Institute of Information Technology - Hyderabad.

# of Accepted Students 2014
# of Accepted Students 2013
Sri Lanka

Student majors in 2014 were predominantly in Computer Science and other technical fields (as expected).  But this year we also had students studying Anthropology, Cartography, Evolutionary Biology, Linguistics and even Metallurgy. GSoC certainly attracts a diverse set of students year after year! For more stats from 2014, check out the other posts in this series.

We’d like to thank the schools and professors that help spread the word about GSoC to their students. But don’t forget that students from any university are encouraged to participate! Reviewing statistics each year from GSoC is exciting, but being in “first place” is certainly not the most important part of the program. Our goal since the inception of GSoC is to get students involved in the creation of free and open source software, and to encourage contributions to projects that have the potential to make a difference worldwide — no matter what university the student attends.

By Mary Radomile, Open Source Programs

Google Summer of Code wrap-up: Twitter

Friday, September 19, 2014

Google Summer of Code 2014 has come to a close and news of the great work completed by our 1300+ student participants is starting to pour in. Our first student “wrap-up” post is from Twitter, a three time Summer of Code participant. We’ll be featuring these stories on Fridays this fall.

For the third time, Twitter had the opportunity to participate in Google Summer of Code (GSoC), and we wanted to share news on the resulting open source activities. Unlike many GSoC participating organizations that focus on a single ecosystem, Twitter has a variety of projects that span multiple programming languages and communities. They include:  

Use zero-copy read path in @ApacheParquet
Sunyu Duan worked with mentors Julien Le Dem (@J_) and Gera Shegalov (@gerashegalov) on improving performance in Parquet by using the new ByteBuffer based APIs in Hadoop. As a result of the work over the summer, performance has improved up to 40% based on initial testing and the work will make its way into the next Parquet release.

A pluggable algorithm to choose next EventLoop in Netty
Jakob Buchgraber worked with mentor Norman Maurer (@normanmaurer) to add pluggable algorithm support to Netty’s event loop (see pull request). At the start of the summer when a new EventLoop was needed to register a Channel, EventLoopGroup implementations used a round-robin like algorithm to choose the next EventLoop. This was challenging because different events may become more busy than others over time, hence the need for Jakob’s project to support pluggable algorithms to increase performance.  

Various compression codecs for Netty
Idel Pivnitskiy (@pivnitskiy) worked with mentor Trustin Lee (@trustin) to add multiple compression codes (LZ4, FastLZ and BZip2) to the Netty project. Compression codecs will allow cutting traffic and creating applications, which are able to transfer large amounts of data even more effectively and quickly.

Android Support For Pants
Mateo Rodriguez (@mateornaut) added Android support to the Pants build system (see commits) so Pants can build Android applications (APKs) on top of the many other languages and tools it supports.

A pure ZooKeeper client for Finagle 
Pierre-Antoine Ganaye (@pa-ganaye) was mentored by Evan Meagher (@evanm) to add a pure Apache ZooKeeper client to Finagle to improve performance (see project).

An SMTP client for Finagle
Lera Dymbitska (@suncelesta) worked with mentors Selvin George (@selvin) and Travis Brown (@travisbrown) to add SMTP protocol support to Finagle to improve performance (see pull request). Finagle strives to provide fully asynchronous protocol support so baking in SMTP support was required versus using third party libraries such as javamail and commons-email which are synchronous by design.

Analyze Wikipedia using Cassovary
Szymon Matejczyk (@szymonmatejczyk) worked with mentors Pankaj Gupta (@pankaj) and Ajeet Grewal (@ajeet) to enable Cassovary to analyze Wikipedia data. The result of this work improved the performance of Cassovary when dealing with large graphs. See the commits associated with the project to see how it was done.

We really enjoyed the opportunity to take part in GSoC again this year. Thanks again to our seven students, mentors and Google for the program. We hope to participate again next summer.

Chris Aniszczyk, Organization Administrator, Twitter

Security for the people

Thursday, September 18, 2014

A recent Pew study found that 86% of people surveyed had taken steps to protect their security online. This is great—more security is always good. However, if people are indeed working to protect themselves, why are we still seeing incidents, breaches, and confusion? In many cases these problems recur because the technology that allows people to secure their communications, content and online activity is too hard to use.

In other words, the tools for the job exist. But while many of these tools work technically, they don’t always work in ways that users expect. They introduce extra steps or are simply confusing and cumbersome. (“Is this a software bug, or am I doing something wrong?”) However elegant and intelligent the underlying technology (and much of it is truly miraculous), the results are in: if people can’t use it easily, many of them won’t.

We believe that people shouldn’t have to make a trade-off between security and ease of use. This is why we’re happy to support Simply Secure, a new organization dedicated to improving the usability and safety of open-source tools that help people secure their online lives.

Over the coming months, Simply Secure will be collaborating with open-source developers, designers, researchers, and others to take what’s there—groundbreaking work from efforts like Open Whisper Systems, The Guardian Project, Off-the-Record Messaging, and more—and work to make them easier to understand and use.

We’re excited for a future where people won’t have to choose between ease and security, and where tools that allow people to secure their communications, content, and online activity are as easy as choosing to use them.

By Meredith Whittaker, Open Source Research Lead and Ben Laurie, Senior Staff Security Engineer

gcloud-node - a Google Cloud Platform Client Library for Node.js

Tuesday, September 16, 2014

Today we are announcing a new category of client libraries that has been built specifically for Google Cloud Platform. The very first library, gcloud-node, is idiomatic and intuitive for Node.js developers. With today’s release, you can begin integrating Cloud Datastore and Cloud Storage into your Node.js applications, with more Cloud Platform APIs and programming languages planned.

 The easiest way to get started is by installing the gcloud package using npm:
$ npm install gcloud
With gcloud installed, your Node.js code is simpler to write, easier to read, and cleaner to integrate with your existing Node.js codebase. Take a look at the code required to retrieve entities from Datastore:
var gcloud = require('gcloud');

var dataset = new gcloud.datastore.Dataset({
projectId: 'my-project',
keyFilename: '/path/to/keyfile.json' // Details at 

dataset.get(dataset.key('Product', 123), function(err, entity) {
console.log(err, entity);
gcloud is open-sourced on Github; check out the code, file issues and contribute a PR - contributors are welcome. Got questions? Post them on StackOverflow with the [gcloud-node] tag. Learn more about the Client Library for Node.js at http://googlecloudplatform.github.io/gcloud-node/ and try gcloud-node today.

 -Posted by JJ Geewax, Software Engineer

Node.js is a trademark of Joyent, Inc. and npm is a trademark of npm, Inc.

CausalImpact: A new open-source package for estimating causal effects in time series

Wednesday, September 10, 2014

How can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries?

In principle, all of these questions can be answered through causal inference.

In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.

We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.

Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.

Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.

How the package works
The CausalImpact R package implements a Bayesian approach to estimating the causal effect of a designed intervention on a time series. Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets, clicks on other sites, or Google Trends data), the package constructs a Bayesian structural time-series model with a built-in spike-and-slab prior for automatic variable selection. This model is then used to predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had not occurred.

As with all methods in causal inference, valid conclusions require us to check for any given situation whether key model assumptions are fulfilled. In the case of CausalImpact, we are looking for a set of control time series which are predictive of the outcome time series in the pre-intervention period. In addition, the control time series must not themselves have been affected by the intervention. For details, see Brodersen et al. (2014).

A simple example
The figure below shows an application of the R package. Based on the observed data before the intervention (black) and a control time series (not shown), the model has computed what would have happened after the intervention at time point 70 in the absence of the intervention (blue).

The difference between the actual observed data and the prediction during the post-intervention period is an estimate of the causal effect of the intervention. The first panel shows the observed and predicted response on the original scale. The second panel shows the difference between the two, i.e., the causal effect for each point in time. The third panel shows the individual causal effects added up in time.
The script used to create the above figure is shown in the left part of the window below. Using package defaults means our analysis boils down to just a single line of code: a call to the function CausalImpact() in line 10. The right-hand side of the window shows the resulting numeric output. For details on how to customize the model, see the documentation.
How to get started
The best place to start is the package documentation. The package is hosted on Github and can be installed using:


By Kay H. Brodersen, Google

Software Freedom Conservancy and Google Summer of Code

Thursday, September 4, 2014

Today’s guest post comes from Bradley Kuhn, President of Software Freedom Conservancy. Conservancy and Google Summer of Code have had a long history as partners and in many cases Conservancy has made it possible for organizations to participate in our program. Read on for more details on how becoming a member of Conservancy can benefit your project in Google Summer of Code and beyond.

Software Freedom Conservancy, Inc. is a 501(c)(3) non-profit charity that serves as a home to Open Source and Free Software projects. In this post I'd like to discuss what that means and why such projects need a non-profit home. In short, Conservancy seeks to makes the lives of Free Software developers easier and it gives contributors much less administrative work to do outside of their area of focus (i.e., software development and documentation).

Google Summer of Code (GSoC) is a great example to show the value a non-profit home brings to Free Software projects. GSoC is likely the largest philanthropic program in the Open Source and Free Software community today. However, one of the most difficult things for organizations who seek to take advantage of such programs is the necessary administrative overhead. Google invests heavily in making it easy for organizations to participate in the program (for instance, by handling the details of stipend payments to students directly). However, to take full advantage of any philanthropic program, the benefiting organization has some work to do. For its member projects, Conservancy is the organization that gets that logistical work done.

Google donates $500 to the mentoring organizations for every student it mentors. However, these funds need to go somewhere. If the funds go to an individual, there are two inherent problems. First, that individual is responsible for taxes on that income. Second, funds belonging to an organization as a whole are now in the bank account of a single project leader. Conservancy solves both those problems. As a tax-exempt charity, the mentor payments are available for organizational use under its tax exemption. Furthermore, Conservancy maintains earmarked funds for each of its projects. Conservancy keeps the mentor funds for the Free Software project, and the project leaders can later vote to make use of the funds in a manner that helps the project and Conservancy's charitable mission. Often times, projects in Conservancy use these funds to send developers to important conferences to speak about the project and recruit new developers and users.

Google also offers to pay travel expenses for two mentors from each mentoring organization to attendGSoC Mentor Summit. Conservancy handles this work on behalf of its member projects in two directions. First, for developers who don't have a credit card or otherwise are unable to pay for their own flight and receive reimbursement later, our staff book the flights on Conservancy's credit card. For the other travelers, Conservancy handles the reimbursement details. And on the back end, we handle all the overhead issues in requesting the POs from Google, invoicing for the funds, and tracking to ensure payment is made.
the annual

GSoC coordination is just one of the many things that Conservancy does every day for its member projects. If there's anything other than software development and documentation that you can imagine a project needs, Conservancy does that job for its member projects. This includes not only mundane items such as travel coordination, but also issues as complex as trademark filings and defense, copyright licensing advice and enforcement, governance coordination and mentoring, and fundraising for the projects. Some of Conservancy's member projects have been so successful that they've been able to fund developers’ salaries — often part-time but occasionally full-time — for many years to allow them to focus on improving the project's software for the public benefit.

If your project seeks help with regards to handling its GSoC funds and travel, or anything else mentioned on Conservancy's list of services to member projects, Conservancy is welcoming new applications for membership. Your project could join Conservancy's more than thirty other member projects and receive these important services to help your community grow and focus on its core mission of building software for the public good.

By Bradley M. Kuhn, President, Software Freedom Conservancy

Glide 3.0: a media management library for Android

Tuesday, September 2, 2014

Today we are happy to announce the first stable release of Glide 3.0. Glide is an open source media management framework for Android that wraps media decoding, memory and disk caching, and resource pooling into a simple and easy to use interface.

Glide is specifically designed not only to be easy to use, but also to make scrolling lists of images as smooth and pleasant to use as possible. To reduce stuttery scrolling in lists caused by garbage collections due to Bitmap allocations, Glide uses reference counting behind the scenes to track and reuse Bitmap objects. To maximize the number of Bitmaps that are re-used, Glide includes a Bitmap pool capable of pooling arbitrary sizes of Bitmaps. 

The 3.0 version of Glide includes a number of improvements, including support for animated GIF and video still decoding, improved lifecycle integration to intelligently pause and restart requests, and thumbnailing support. 

Despite all of the new features, Glide’s interface is still simple and easy to use. To display an image, video still, or animated GIF in a view, you still need only one line:


Glide will intelligently determine the type of media you’re trying to load, decode it, and return a drawable object that can display the animation or still image in the view. If you want to load specifically a Bitmap, you can do that too:


You can also do more complex transformations. For example, to upload the bytes of a 250px by 250px profile photo for a user:

.into(new SimpleTarget<byte[]>(250, 250) {
public void onResourceReady(byte[] data, GlideAnimation anim) {
// Post your bytes to a background thread and upload them here.

For a more complete list of changes, documentation, or to report issues please see the Github page. To see Glide being used in a real application, check out the recently released source of the 2014 Google I/O app and their excellent blog post on image loading. Finally, for questions, comments, or suggestions please join our discussion list.

By Sam Judd, Google Engineering
