View on GitHub


Decoding the GitHub social coding phenomena

Download this project as a .zip file Download this project as a tar.gz file

Problem Overview and solution

Problem Statement

Find the relationships between the watchers/committers with the popularity of a repo. How does the push/watch event of a highly popular user affect the growth curve of a GitHub repository?


The dataset is in the form of JSON dumps of GitHub activity of various repositories and users. Please find the sample files here


The solution attempts to find the correlations between user events on a repository and its effects on the growth curve. The approach is fairly straightforward.



The python implementation for the above algorithm can be found here.

Genuineness of a user’s impact

Since the change in the growth curve of a repository need not be influenced by a WatchEvent of a particular user, we try to estimate as how genuine his WatchEvent is. The algorithm used for this is as follows

User Dynamics

This section explains the methodology used to observe user behavior in the social coding context. Do user operate in flock, when one of them starts following a repo? Which users most “connected”, i.e. the users most probable to start watching a repo, when one of them does?



Input : List on chronological user sequences S Output : Count map, M, representing the occurrence count of all sub-sequences

    For all sequences Q in S:
        For every possible permutation P of sequence Q in map M:
            Check if P already present in M:
                If True:
                    M[P]++   //Increase the occurrence count


The python implementation for the above algorithm can be found here. The script takes two command-line arguments, first one is the size of user sets and second is how many such top sets. For example, If we need the top 5 users sets each of size 3 who show high connectivity, issue the command python 3 5

Code setup

Here is a sample of generated plot images of the repo.

Plot properties


• By looking at the plot images, most of the plots show an increased growth rate after a high profile user starts watching. The rate slowly saturates with time • If a user has particularly high number of followers, the growth rate increases substantially • If a user has lower than average number of followers(average calculated from the data), the chance of the growth rate being continually increasing is less, showing that the growth rate is mostly independent of his impact • The growth rates of very popular repos seem not to differ much, even when a high profile user starts watching • Most of the "social" effect is seen within 1 day of the event, similar behavior is also observed during news proliferation in social networking sites like Facebook. Here the workflow is usually like...User watches a repo->His followers get notified; follow the repo->Their followers. So on. • User do seem to behave in groups, a distinct set of users show high co-incidence, i.e if a user’s starts watching another set users are most probable to follow that repository. The below are set of two users, with their incidence count o (('torifat', 'mkol5222'), 642) o (('fnu', 'torifat'), 452) o (('jasolko', 'fnu'), 342) o (('hansstimer', 'payco'), 152) o (('rgigger', 'roundhead',32)

The below are set of three users, with their incidence count

o (('anggriawan', 'rgigger', 'jasolko'), 314) o (('fnu', 'jasolko', 'torifat'), 134) o (('payco', 'roundhead', 'rgigger'), 78) o (('fnu', 'hansstimer', 'torifat'), 34) o (('rgigger', 'anggriawan', 'payco'), 29)

Dataset statistics

Total Repos: 296456

Top 10 popular repos

       Repo_url                                             watchers      forks       stargazers

1 28810 0 5623 2 16167 16167 2025 3 14964 0 1794 4 14263 14263 2259 5 14158 0 3164 6 10590 10590 8838 7 9602 9602 4358 8 8594 0 1277 9 8272 0 1099 10 7607 7607 972

Top 10 followed Users 1 defunkt 2 mojombo 3 torvalds 4 jeresig 5 schacon 6 paulirish 7 ryanb 8 pjhyett 9 visionmedia 10 dhh

Top 10 events by count

      eventType         count(eventType)

1 PushEvent 140380 2 CreateEvent 42900 3 WatchEvent 29360 4 IssueCommentEvent 20887 5 IssuesEvent 13682 6 ForkEvent 9967 7 GistEvent 9082 8 PullRequestEvent 8419 9 FollowEvent 7592 10 GollumEvent 4999

10 fastest growing repositories by number of watchers

            repo_url                     FROM  TO

1 5346 14027 2 27717 36352 3 1 6532 4 29 5935 5 13986 18009 6 2557 6406 7 3333 6442 8 603 3652 9 19 2943 10 49 2665

Top 10 Most active repos by event counts

2 CreateEvent 11903 3 PushEvent 4341 4 PushEvent 854 5 IssueCommentEvent 518 6 PullRequestEvent 515 7 PushEvent 442 8 IssuesEvent 424 9 IssueCommentEvent 406 10 WatchEvent 373 11 CreateEvent 334