Over time, the most popular programming languages have shifted dramatically. There have been trends with the rise and fall of different programming languages over time. However, it can often be quite difficult to predict these fluctuations. So for this project, we have decided to dive into a large open-source dataset that looks into contributions to open-source software and try to map what we hope, will be an accurate representation of these trends. Acquired from libraries.io, this dataset with over 397+ million rows, provides us with repository names, timestamps, programming languages, and many more attributes related to open source projects on the internet.
Our primary goal of this project is to gain insights about open source software. We will accomplish this goal by mining data from a dataset about open source projects. We will focus our work on four main areas: tracking trends in programming languages, analyzing how popular repositories have changed over time, how contributions to those repositories have changed over time, and also monitoring repository life cycles. We will use various data minining techniques and tools in order to accomplish this goal. More information about this process can be found in our final paper.
The use of clustering was essential to answering our core question: identifying areas in the open source community that need improvement.To try to find deficiencies in the open source community we clustered repositories, and from there we could identify cluster that are “lacking” and can benefit from extra community involvement. We implemented a K-Means clustering algorithm on the dataset. The below diagram is an example of using the center of each K-Means cluster to determine the average number of Open Issues/ Contributors. Through an extensive process similar to this, we were able to identify 17 different repositories in need of the most community assistance.
We also used similar techniques as described above in order to predict upcoming popular repositories. We found that 348 repositories have the potential to be very large and heavily contributed to repositories in the future. We understand that this may not be the best way to predict trends and future growth, but limited to the scope of this course, we are proud of the “predictions” we made. The below graph is an example of how contributions through JavaScript have been changing over time.
More detail about the process we used to reach our conclusions is described in our final paper.
In our application, we focus on how our conclusions will help us make better and more useful decisions regarding open source projects. Open source projects play a significant role in the development of software, and tens of thousands of open source projects run worldwide, with millions of users relying on the software. Concluding applications based on our data analysis will help us better understand open source software that influences millions across the world.
Through our cluster analysis, we were able to identify that “Stars/Contributors,” “Stars/Forks,” and “Forks/Issues” were the best indicators of future growth and popularity of a repository. These relationships will help developers working in open source software a general direction to work towards when creating new software. Furthermore, understanding the “Fork/Issue” ratio shows tremendous potential in future heavy traffic and contributions. Also by analyzing time-series of different programming languages, we can display numerous attributes that can not only be applied open source software developers but general coders. Identifying that repositories using Python or Java are currently experiencing very high contributions, and showing how trends between dying and popular programming languages can help prove the likelihood of these languages dying out soon or continuing to prosper.
In our final report, we discuss the technical aspects of data mining and what our team accomplished through the extraction and analysis of our dataset. In addition, we have also created a short video demoing our results and conclusions. The video can be found here.
The final paper can be found here.
We have created a website which allows others to learn more about our project and interact with a force graph to better understand relationships about the conclusion we've made here.