View on GitHub

Githubtrends

Decoding the GitHub social coding phenomena

Download this project as a .zip file Download this project as a tar.gz file

Problem Overview and solution

Problem Statement

Find the relationships between the watchers/committers with the popularity of a repo. How does the push/watch event of a highly popular user affect the growth curve of a GitHub repository?

Dataset

The dataset is in the form of JSON dumps of GitHub activity of various repositories and users. Please find the sample files here

Solution

The solution attempts to find the correlations between user events on a repository and its effects on the growth curve. The approach is fairly straightforward.

Algorithm

Implementation

The python implementation for the above algorithm can be found here.

Genuineness of a user’s impact

Since the change in the growth curve of a repository need not be influenced by a WatchEvent of a particular user, we try to estimate as how genuine his WatchEvent is. The algorithm used for this is as follows

User Dynamics

This section explains the methodology used to observe user behavior in the social coding context. Do user operate in flock, when one of them starts following a repo? Which users most “connected”, i.e. the users most probable to start watching a repo, when one of them does?

Algorithm

Sub-Algorithm

Input : List on chronological user sequences S Output : Count map, M, representing the occurrence count of all sub-sequences

    For all sequences Q in S:
        For every possible permutation P of sequence Q in map M:
            Check if P already present in M:
                If True:
                    M[P]++   //Increase the occurrence count

Implementation

The python implementation for the above algorithm can be found here. The script takes two command-line arguments, first one is the size of user sets and second is how many such top sets. For example, If we need the top 5 users sets each of size 3 who show high connectivity, issue the command python getFlockUsers.py 3 5

Code setup

Here is a sample of generated plot images of the repo.

Plot properties

Observations

• By looking at the plot images, most of the plots show an increased growth rate after a high profile user starts watching. The rate slowly saturates with time • If a user has particularly high number of followers, the growth rate increases substantially • If a user has lower than average number of followers(average calculated from the data), the chance of the growth rate being continually increasing is less, showing that the growth rate is mostly independent of his impact • The growth rates of very popular repos seem not to differ much, even when a high profile user starts watching • Most of the "social" effect is seen within 1 day of the event, similar behavior is also observed during news proliferation in social networking sites like Facebook. Here the workflow is usually like...User watches a repo->His followers get notified; follow the repo->Their followers. So on. • User do seem to behave in groups, a distinct set of users show high co-incidence, i.e if a user’s starts watching another set users are most probable to follow that repository. The below are set of two users, with their incidence count o (('torifat', 'mkol5222'), 642) o (('fnu', 'torifat'), 452) o (('jasolko', 'fnu'), 342) o (('hansstimer', 'payco'), 152) o (('rgigger', 'roundhead',32)

The below are set of three users, with their incidence count

o (('anggriawan', 'rgigger', 'jasolko'), 314) o (('fnu', 'jasolko', 'torifat'), 134) o (('payco', 'roundhead', 'rgigger'), 78) o (('fnu', 'hansstimer', 'torifat'), 34) o (('rgigger', 'anggriawan', 'payco'), 29)

Dataset statistics

Total Repos: 296456

Top 10 popular repos

       Repo_url                                             watchers      forks       stargazers

1 https://github.com/twitter/bootstrap 28810 0 5623 2 https://github.com/jquery/jquery 16167 16167 2025 3 https://github.com/joyent/node 14964 0 1794 4 https://github.com/h5bp/html5-boilerplate 14263 14263 2259 5 https://github.com/rails/rails 14158 0 3164 6 https://github.com/octocat/Spoon-Knife 10590 10590 8838 7 https://github.com/mxcl/homebrew 9602 9602 4358 8 https://github.com/bartaz/impress.js 8594 0 1277 9 https://github.com/documentcloud/backbone 8272 0 1099 10 https://github.com/mrdoob/three.js 7607 7607 972

Top 10 followed Users 1 defunkt 2 mojombo 3 torvalds 4 jeresig 5 schacon 6 paulirish 7 ryanb 8 pjhyett 9 visionmedia 10 dhh

Top 10 events by count

      eventType         count(eventType)

1 PushEvent 140380 2 CreateEvent 42900 3 WatchEvent 29360 4 IssueCommentEvent 20887 5 IssuesEvent 13682 6 ForkEvent 9967 7 GistEvent 9082 8 PullRequestEvent 8419 9 FollowEvent 7592 10 GollumEvent 4999

10 fastest growing repositories by number of watchers

            repo_url                     FROM  TO

1 https://github.com/mbostock/d3 5346 14027 2 https://github.com/twitter/bootstrap 27717 36352 3 https://github.com/textmate/textmate 1 6532 4 https://github.com/adobe/brackets 29 5935 5 https://github.com/rails/rails 13986 18009 6 https://github.com/AFNetworking/AFNetworking 2557 6406 7 https://github.com/FortAwesome/Font-Awesome 3333 6442 8 https://github.com/xing/wysihtml5 603 3652 9 https://github.com/HPNeo/gmaps 19 2943 10 https://github.com/ivaynberg/select2 49 2665

Top 10 Most active repos by event counts

2 https://github.com/eclipse/eclipse.platform.common CreateEvent 11903 3 https://github.com/nyarlabo/websites PushEvent 4341 4 https://github.com/itroot/reach-github-limit PushEvent 854 5 https://github.com/haskell/cabal IssueCommentEvent 518 6 https://github.com/rails/rails PullRequestEvent 515 7 https://github.com/entoo/portage PushEvent 442 8 https://github.com/pulWifi/pulWifi IssuesEvent 424 9 https://github.com/mxcl/homebrew IssueCommentEvent 406 10 https://github.com/twitter/bootstrap WatchEvent 373 11 https://github.com/KernCZ/tomato-firmware CreateEvent 334