GitHub Copilot is an AI pair programmer developed by GitHub and OpenAI. It was designed to help users by autocompleting code. The way it works is that it draws context from comments and code, and suggests individual lines and whole functions.
Although GitHub claimed that their usage of public data inside its training set of Copilot is a "fair use," there is not a settled law that directly allows or forbids the usage in this case. Besides, although GitHub has made several statements toward the topics including code ownership, code responsibility, originality, and privacy, the concerns toward each of these topics have been kept appearing.
- 1 Technology
- 2 Origin
- 3 Accuracy
- 4 Achievements
- 5 Comparision Toward Related Products, Models, and Techniques
- 6 Ethical Issues
- 6.1 Copyright
- 6.1.1 Legality of Using Public Data to Train Machine Learning Systems
- 6.1.2 Open Source Code Protecting Mechanisms
- 6.1.3 Commercial Product
- 6.1.4 Threat Toward Originality
- 6.1.5 Ownership and Responsibility
- 6.1.6 Text and Data Mining May Not Be Copyright Infringement
- 6.2 Concerns From Developers
- 6.3 The Privacy of Projects After Using Copilot
- 6.4 Reliability of Code
- 6.5 Replacing Developers
- 6.1 Copyright
- 7 References
GitHub Copilot is powered by a distinct production version of Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities. The idea was generated from the observation that GPT-3, a language model which was not explicitly trained for code generation, can generate simple programs from Python docstrings. The model Codex first be used toward the scenarios of coding, but was expected to be adopted by more fields.
Copilot was trained with a repository that contains public code, made by a network of developers on the GitHub platform.
GitHub has put a few filters in place to prevent Copilot from generating offensive language, but the possibility of producing undesired outputs, including biased, discriminatory, abusive, or offensive outputs still remains.
This project is a result of Microsoft's $1 billion investment into OpenAI, the research firm now led by Y Combinator president Sam Altman.
GitHub benchmarked against a set of Python functions that have test coverage in open source repositories. They blanked out the function bodies and asked GitHub Copilot to fill them in. The model got right 43% of the time on the first try, and 57% of the time when allowed 10 attempts.
Until October 2021, GitHub said that there was about 30 percent of new code on its platform had been written with the support of GitHub Copilot.
Comparision Toward Related Products, Models, and Techniques
Kite is an AI programming assistant which supports code completions for developers. Kite and GitHub Copilot have been regarded as alternatives toward each other. Compared with GitHub Copilot, Kite currently has been integrated with more choices of code editors. The model used by GitHub Copilot is a modified version of GPT-3, while the model used by Kite is GPT-2. The training set used by GitHub Copilot contains more lines of code comparing with Kite.
In order to process the suggestion, GitHub Copilot has to upload parts of the code file the user is editing, while GitHub has stated that they would not collect any private code. Kite has stated that it is "fully functional for the most part without an internet connection," and that they would not send any code or any byproducts of the editing code to the cloud. Each of GitHub Copilot and Kite have to send some kinds of information that indicates the interaction between the user and the product, in order to improve the product itself.
Tabnine is an AI programming assistant which supports code completions for developers. Tabnine and GitHub Copilot have been regarded as alternatives toward each other. Compared with GitHub Copilot, Tabnine currently have been integrated with more choices of code editors. The model used by GitHub Copilot is a modified version of GPT-3, while the model used by Tabnine is GPT-2.
While GitHub Copilot has to keep uploading the part of code a user is editing in order to provide suggestions, Tabnine stated that the product "will continue to have an offline mode that does not require an Internet connection and does not send any data to the cloud service."
Besides supporting individual developers, Tabnine also provides a Team Learning Algorithm, which can gather the code, preferences, and patterns of a team of developers, and "continuously learning and adapting".
GPT-3, or the third generation Generative Pre-trained Transformer, is a neural network machine learning model trained using internet data to generate any type of text.  GPT -3 has been used to produce and classify text that is similar to those language natural text produced by humans. OpenAI took part in the development of both GPT-3 and Codex, which is used to power GitHub Copilot.
In some cases where a relatively simple program or function is required, both GitHub Copilot and GPT -3 can produce output that is close to the answer, while the output that is given by GitHub Copilot also has a better writing manner for code, and is also executable. This could be the result of the design goals: GPT3 is more of a text completion tool that generates the next characters/words based on the previous predictions, while Codex was finetuned on publicly available code from GitHub, and study its Python code-writing capabilities.
According to geneticprogramming.com, Genetic Programming (GP) is a type of Evolutionary Algorithm (EA), a subset of machine learning. EAs are used to discover solutions to problems humans do not know how to solve, directly. Free of human preconceptions or biases, the adaptive nature of EAs can generate solutions that are comparable to, and often better than the best human efforts.
GitHub Copilot aims specifically to the functionality of autocompleting code, while Genetic Programming has been used more in the field of automatic program synthesis.
According to two benchmarks, GitHub Copilot performs more maturely on program synthesis, and is relatively enough to be supportive in real programming activities. In addition, the result given by Genetic Programming could be less readable, and the time Genetic Programming takes could also be longer.
The generation of a model for training an artificial intelligence algorithm always involves collecting examples with the corresponding type. For Copilot, it is trained on public GitHub repositories of any license, which contains billions of lines of public code, contributed by more than 73 million developers on the GitHub platform. GitHub claimed that the model should be analyzing and generating code from the training set, instead of searching. They also admitted that GitHub Copilot sometimes does generate same code from the training set, but this appears mostly when the user haven't provided enough unique code inside the program.
GitHub's CEO Nat Friedman stated: "In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler."
Legality of Using Public Data to Train Machine Learning Systems
Until June 2021, the US government has not published any official document that directly declares the legality of the usage of publicly available data toward artificial intelligence algorithms in this case, and as a result, the case also has not been tested in court. The Free Software Foundation (FSF) has started to fund a call to examine the legal issues of GitHub Copilot. The questions given by the Free Software Foundation involve the concerns of the developers toward whether the usage can be regarded as a "fair use," the concerns of whether the code provided by the output of the program will be related to copyright infringement, the ability of GitHub Copilot to discover the violations of licenses, and so on.
In the Scenario of Applying English Law
As the action of training a model requires data input, while as for a model which is aimed to support the functionality of a programing assistant, the data have to be code. For a model, taking code as input could require an activity that is similar to copying code or text. According to English law, the action is reserved by the owner of the text, so that unless someone has gained statutory permission, the action of copying cannot be executed. Although there are provisions stating that statutory permission may be available to some people or some organizations, most of them are only applicable when the purpose is for "non-commercial" usage.
According to the statement of GitHub, GitHub "need the legal right to do things like host Your Content, publish it, and share it." As a result, at least for code being published on the GitHub platform, GitHub may have the right of copying, and thus may also have the right of making code on the GitHub platform to be involved in the training process of GitHub Copilot (and other models). But in this case, there may also be projects with licenses that allow GitHub to copy the code, but still forbid GitHub to use the code in this way.
However, it is still unclear whether English law is applicable toward the scope of training a machine learning system, or a more specified scope.
Open Source Code Protecting Mechanisms
Due to the consideration of keeping open-sourced code to be protected from being overly used by deep learning models, which was inspired by the discussions upon GitHub Copilot and other similar products that are taking advantage of public data, a prototype called CoProtector have been established. The researchers indicated that CoProtector "utilizes data poisoning techniques to arm source code repositories for defending against such exploits." The researchers believe that according to the kind of license used by an open-sourced project, the code inside the project may not be able to directly be used for free, and such kind of behaviors could have chances of causing copyright infringement.
According to 17 U.S.C. 107 - Limitations on exclusive rights: Fair use, one of the factors for fair use is "the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes." Although the users may use GitHub Copilot as a tool to produce nonprofit products as the output of the program, GitHub Copilot also allows users to use its functionality, as well as its output from the training data set that includes publicly accessible data, in commercial scenarios.
Threat Toward Originality
GitHub stated that for Copilot, "the vast majority of the code that it suggests is uniquely generated and has never been seen before." Besides, they also stated that they are working on a filter in order to keep the cases of replication on track, and decrease the possibility of such cases appearing as well.
However, the question which is still remained unclear is: whether GitHub Copilot, or any other AI programming assistants that are based on natural language processing models, is producing new code. GitHub Copilot have not proved that they are generating new code, instead of making different combinations of existing code in order to avoid the restrictions of copyright.
Ownership and Responsibility
According to GitHub, when a user uses GitHub Copilot to support programming, "the code you write with its help, belong to you, and you are responsible for it." Without the statement, the output given by GitHub Copilot may be able to be explained as being owned by GitHub, so that it not only excludes the value of the product for users, but also makes GitHub may be able to be sued due to their product output which is outside of their control.
However, there are cases when GitHub Copilot could produce code with a kind of license and insert it into a project that is not allowed by the license, without being noticed by the user of GitHub Copilot. As a result, if a user is the owner of the produced code, this will lead to a legal issue for the user.
Besides, if the user is the owner of the output of GitHub Copilot, as GitHub is using at least some information about the interaction between a user and the suggested code given by GitHub Copilot in order to train the model of GitHub Copilot, it is possible that Github may have infringed the rights reserved to the user under copyright law.
In addition to this case, the training set used by the model that powers GitHub Copilot also have the possibility of containing code that may cause an infringement of copyright, so it is unclear whether GitHub Copilot, which is taking advantage of the problematic code, should also be regarded as taking part in the infringement activity.
Text and Data Mining May Not Be Copyright Infringement
From some point of view, it is possible that the actions of text and data mining do not have to be restricted heavily by copyright, so as to the process of GitHub Copilot of training model. The process of collecting public data can be regarded as "reading and processing information" in some way. The problem is that when most of the informatic products are processing the action of "reading" toward information online, the action of "copying" will at first be required to be executed. However, it has been stated that "policymakers and courts have long recognized that digital technology would be completely unusable if every technical copy required permission." Such an example was given by the EU in 2001, which at that time "allowed such temporary, ephemeral acts of copying, which are part of a technical process, without restriction – despite the protests of the entertainment industry at the time."
Concerns From Developers
Within a week after the announcement of Copilot given by GitHub, there are several developers establishing their concerns toward the copyright issue on Twitter. One of the posts, which had earned more than 3000 likes at that time, stated: "GitHub scraped your code. And they plan to charge you for copilot after you help train it further."
Some of the developers have a concern about the responsibility when violating licenses. Although GitHub has stated that only 0.1% of the code generated by Copilot is reciting, the developers are not sure about whether the remaining 99.9% of code can be regarded as combinations of existing programming projects or not. Besides, if the violation of a coding license happened, it is vague that whether the company that developed Copilot, or the developer who used Copilot, or the company or organization which is gaining benefits from Copilot, should face the legal problem.
The Privacy of Projects After Using Copilot
GitHub stated: "In order to generate suggestions, GitHub Copilot transmits part of the file you are editing to the service." They also refuted the question of whether private code will be collected by Copilot, but admitted that they will collect the users' choices of whether or not accept each piece of suggestion given by Copilot. The degree of how close can the collected information be to reveal what the developer is doing inside the private project still remains unclear, and it might have a different impact toward projects with different security levels: the developers of open-sourced projects can ignore an information exposure at a relatively high level, while as for the developers of military projects have to be careful toward any level of data collections from the external.
Reliability of Code
On the official website of Copilot, GitHub stated that Copilot does not produce perfect codes: "GitHub Copilot tries to understand your intent and to generate the best code it can, but the code it suggests may not always work, or even make sense." GitHub also stated that the users are in charge of the code, so that the users have the responsibility of testing, reviewing, and checking the code suggestions given by Copilot.
There are cases where GitHub Copilot could produce untrustable code. The model which was used to power GitHub Copilot was trained over the set of code projects which contains those that were unvetted, so that the training result has the possibility of producing executable but buggy code. In 89 testing scenarios generated by researchers that are "relevant to high-risk cybersecurity weaknesses" including a list of "Top 25" Common Weakness Enumeration given by MITRE, including 1689 programs, around 40% of those completed by GitHub Copilot was found to be vulnerable. However, as GitHub Copilot is a close sourced system that is close to a black box, while the system is powered by a generative model, the output of the same input may be different for each time, so that the experiment result may not be reproducible.
Besides, it is also possible that the training set for GitHub Copilot may include insecure code. Even if all of the program projects included by the training set are fully secured for themselves, there may be dangerous usage of code snippets, and even if all the projects have stated the danger inside their README documents, the model of GitHub Copilot may not be able to understand the manually entered warnings. This will result in GitHub Copilot producing problematic code.
In June 2021 after Copilot was first announced, there are people establishing their worry toward the question of whether GitHub Copilot, or other AI programming assistants, will be gradually replacing the positions of developers. The model that supports GitHub Copilot works by taking the prompts given by the users as input, and then predicting the coding goal while being controlled by the users. In LSE Business Review, Ravi Sawhney wrote that the current version of GitHub Copilot has been proved by examples that it can generate executable code that can match the background content, while whether it is a useful tool for programmers to improve their productivity still remains not secured. Although GitHub Copilot has already shown the ability to make the bar of becoming a programmer lower when it was first released, Ravi Sawhney believes that the possibility of it replacing current programmers still have not been seen.
- GitHub copilot · your AI pair programmer. GitHub Copilot. (n.d.). Retrieved January 27, 2022, from https://copilot.github.com/
- Sawers, P. (2021, June 29). GitHub launches copilot to power pair programming with ai. VentureBeat. Retrieved January 27, 2022, from https://venturebeat.com/2021/06/29/github-launches-copilot-to-power-pair-programming-with-ai/
- Gershgorn, D. (2021, June 29). GitHub and OpenAI launch a new AI tool that generates its own code. The Verge. Retrieved January 27, 2022, from https://www.theverge.com/2021/6/29/22555777/github-openai-ai-tool-autocomplete-code
- Gershgorn, D. (2021, July 7). GitHub's automatic coding tool rests on untested legal ground. The Verge. Retrieved January 27, 2022, from https://www.theverge.com/2021/7/7/22561180/github-copilot-legal-copyright-fair-use-public-code
- MoneyControl. (n.d.). Explained: Everything you need to know about github copilot. Moneycontrol. Retrieved January 27, 2022, from https://www.moneycontrol.com/news/technology/explained-everything-you-need-to-know-about-github-copilot-7920251.html
- Howard, G. D. (2021). GitHub Copilot: Copyright, Fair Use, Creativity, Transformativity, and Algorithms.
- Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Yahoo! (n.d.). Ai programming tool copilot helps write up to 30% of code on github. Yahoo! News. Retrieved January 27, 2022, from https://news.yahoo.com/ai-programming-tool-copilot-helps-153003394.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAALpM1T7xnc9DESuPjGd-jHOy8tJZPKl0NXQfgqmWbutVv5qspfKUJjgreY3hcRCinYM5yz3NOW7syx4v1pImWFuUhA99mTeb3AUBvWiwChbN9mIbqdl3X_cHBA1BikviAMQAn07FaCm6NbAhu7rakAf8HWSTN1Q46wjMZnUYS8bN
- Kite - free AI coding assistant and code auto-complete plugin. Code Faster with Kite. (n.d.). Retrieved January 28, 2022, from https://www.kite.com/
- Ramnani, M. (2022, January 1). Top 8 alternatives to github copilot. Analytics India Magazine. Retrieved January 28, 2022, from https://analyticsindiamag.com/top-8-alternatives-to-github-copilot/
- Software. in 2022. (n.d.). Retrieved January 28, 2022, from https://slashdot.org/software/comparison/GitHub-Copilot-vs-Kite-vs-Tabnine/
- Kite. (n.d.). FAQ. Kite Help Desk. Retrieved January 28, 2022, from https://help.kite.com/article/105-faq
- Code faster with AI code completions. Code Faster with AI Code Completions. (n.d.). Retrieved January 28, 2022, from https://www.tabnine.com/
- Weiss, D. (2021, June 10). Tabnine is now part of Codota. Tabnine Blog. Retrieved January 28, 2022, from https://www.tabnine.com/blog/tabnine-part-of-codota/
- Schmelzer, R. (2021, June 11). What is GPT -3? everything you need to know. SearchEnterpriseAI. Retrieved January 28, 2022, from https://www.techtarget.com/searchenterpriseai/definition/GPT-3
- Wikimedia Foundation. (2022, January 27). GPT -3. Wikipedia. Retrieved January 28, 2022, from https://en.wikipedia.org/wiki/GPT-3
- Gupta, K. (2021, December 29). Who writes better code: Github copilot or GPT -3? Medium. Retrieved January 28, 2022, from https://python.plainenglish.io/who-writes-better-code-github-copilot-or-gpt-3-9e7441650c9b
- About GP. Genetic Programming. (2019, June 1). Retrieved January 28, 2022, from https://geneticprogramming.com/
- Sobania, D., Briesch, M., & Rothlauf, F. (2021). Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of GitHub Copilot and Genetic Programming. arXiv preprint arXiv:2111.07875.
- Mark A. Lemley and Bryan Casey. (2021, March 20). Fair learning. Texas Law Review. Retrieved January 27, 2022, from https://texaslawreview.org/fair-learning/
- Research recitation. GitHub Docs. (n.d.). Retrieved January 27, 2022, from https://docs.github.com/en/github/copilot/research-recitation
- In general: (1) training ML systems on public data is fair use (2) the output be...: Hacker news. In general: (1) training ML systems on public data is fair use (2) the output be... | Hacker News. (n.d.). Retrieved January 27, 2022, from https://news.ycombinator.com/item?id=27678354
- Krill, P. (2021, August 2). GitHub copilot is 'unacceptable and unjust,' says Free Software Foundation. InfoWorld. Retrieved January 27, 2022, from https://www.infoworld.com/article/3627319/github-copilot-is-unacceptable-and-unjust-says-free-software-foundation.html
- Neil. (2021, June 30). Internet, Telecoms and tech law decoded. decodedlegal Internet telecoms and tech law decoded. Retrieved January 28, 2022, from https://decoded.legal/blog/2021/06/github-copilot-initial-thoughts-from-an-english-law-perspective
- GitHub terms of service. GitHub Docs. (n.d.). Retrieved January 28, 2022, from https://docs.github.com/en/github/site-policy/github-terms-of-service
- Sun, Z., Du, X., Song, F., Ni, M., & Li, L. (2021). CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning. arXiv preprint arXiv:2110.12925.
- GitHub copilot and license restrictions. zephyrtronium. (n.d.). Retrieved January 28, 2022, from https://zephyrtronium.github.io/articles/copilot.html
- Tsai, M. J. (n.d.). GitHub copilot and copyright. Michael Tsai - Blog - GitHub Copilot and Copyright. Retrieved January 28, 2022, from https://mjtsai.com/blog/2021/07/07/github-copilot-and-copyright/
- Felix Reda. (n.d.). GitHub copilot is not infringing your copyright. Felix Reda. Retrieved January 28, 2022, from https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
- Surur. (2021, July 7). GitHub copilot receives criticism from copyright enthusiasts. MSPoweruser. Retrieved January 27, 2022, from https://mspoweruser.com/github-copilot-receives-criticism-from-copyright-enthusiasts/
- Martins, S. (2021, July 16). 4 concerns I have about github copilot. Medium. Retrieved January 27, 2022, from https://betterprogramming.pub/4-concerns-about-github-copilot-b9214d5416fa
- Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. (2021). An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions. arXiv preprint arXiv:2108.09293.
- Ramel06/30/2021, D. (n.d.). Will AI replace developers? github copilot revives existential threat angst. Visual Studio Magazine. Retrieved January 27, 2022, from https://visualstudiomagazine.com/articles/2021/06/30/github-copilot-comments.aspx
- Sawhney, R. (2021). Can artificial intelligence make software development more productive?. LSE Business Review.