Open Source Analysis

Team Members

Description

Over time, the most popular programming languages have shifted dramatically. There have been trends with the rise and fall of different programming languages over time. However, it can often be quite difficult to predict these fluctuations. So for this project, we have decided to dive into a large open-source dataset that looks into contributions to open-source software and try to map what we hope, will be an accurate representation of these trends. Acquired from libraries.io, this dataset with over 397+ million rows, provides us with repository names, timestamps, programming languages, and many more attributes related to open source projects on the internet.

Our primary goal of this project is to gain insights about open source software. We will accomplish this goal by mining data from a dataset about open source projects. We will focus our work on four main areas: tracking trends in programming languages, analyzing how popular repositories have changed over time, how contributions to those repositories have changed over time, and also monitoring repository life cycles. We will use various data minining techniques and tools in order to accomplish this goal. More information about this process can be found in our final paper.

Questions and Answers

  1. What insights can we gain to improve the open source community further?
  2. How can we identify areas in the open source community that need improvement?
  3. Can we predict upcoming popular repositories?

Through various data mining tools and techniques we used, we were able to gain answers to the questions defined above.

More detail about the process we used to reach our conclusions is described in our final paper.

Application

In our application, we focus on how our conclusions will help us make better and more useful decisions regarding open source projects. Open source projects play a significant role in the development of software, and tens of thousands of open source projects run worldwide, with millions of users relying on the software. Concluding applications based on our data analysis will help us better understand open source software that influences millions across the world.

Through our cluster analysis, we were able to identify that “Stars/Contributors,” “Stars/Forks,” and “Forks/Issues” were the best indicators of future growth and popularity of a repository. These relationships will help developers working in open source software a general direction to work towards when creating new software. Furthermore, understanding the “Fork/Issue” ratio shows tremendous potential in future heavy traffic and contributions. Also by analyzing time-series of different programming languages, we can display numerous attributes that can not only be applied open source software developers but general coders. Identifying that repositories using Python or Java are currently experiencing very high contributions, and showing how trends between dying and popular programming languages can help prove the likelihood of these languages dying out soon or continuing to prosper.

Video Demonstration

In our final report, we discuss the technical aspects of data mining and what our team accomplished through the extraction and analysis of our dataset. In addition, we have also created a short video demoing our results and conclusions. The video can be found here.

Final Paper

The final paper can be found here.

Website

We have created a website which allows others to learn more about our project and interact with a force graph to better understand relationships about the conclusion we've made here.

Alt Text