GitHub Copilot

From SI410
Revision as of 07:09, 27 January 2022 by Zyongtai (Talk | contribs) (Ethical Issues)

Jump to: navigation, search
GitHub Copilot.png
Back • ↑Topics • ↑Categories

GitHub Copilot is an AI pair programmer developed by GitHub and OpenAI.[1] GitHub Copilot was designed to help users by autocompleting code.[2] GitHub Copilot draws context from comments and code, and suggests individual lines and whole functions.[1] GitHub Copilot is powered by OpenAI Codex, an AI system created by OpenAI.[1] The GitHub Copilot technical preview is available as an extension for Visual Studio Code, Neovim, and the JetBrains suite of IDEs.[1] GitHub Copilot was announced by GitHub on 29 June 2021.[3] The programing languages GitHub Copilot currently supports includes Python, JavaScript, TypeScript, Ruby, Java and Go, but also provides autocompleting functionalities on languages it does not aimed to support.[4]

Although GitHub claimed that their usage of public data inside its training set of Copilot is a "fair use," there is not a settled law that directly allows or forbids the usage in this case. [5]

Technology

GitHub Copilot is powered by a distinct production version of Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities.[6] The idea was generated from the observation that GPT-3, a language model which was not explicitly trained for code generation, can generate simple programs from Python docstrings.[6]

Copilot was trained with a repository that contains public code, made by a network of developers that exist on the GitHub platform.[4]

GitHub have put a few filters in place to prevent Copilot from generating offensive language, but the possibility of producing undesired outputs, including biased, discriminatory, abusive, or offensive outputs still remains.[1][2]

Origin

This project is a result of Microsoft's $1 billion investment into OpenAI, the research firm now led by Y Combinator president Sam Altman.[2]

Accuracy

GitHub benchmarked against a set of Python functions that have test coverage in open source repos. They blanked out the function bodies and asked GitHub Copilot to fill them in. The model got right 43% of the time on the first try, and 57% of the time when allowed 10 attempts.[1]

Achievements

Until October 2021, GitHub said that there were about 30 percent of new code on its platform had been written with the support of GitHub Copilot.[7]

Ethical Issues

Copyright

The generation of a model for training an artificial intelligence algorithms always involvs collecting examples with the corresponding type.[8] For Copilot, it is trained on public GitHub repositories of any license, which contains billions of lines of public code, contributed by more than 73 million developers that exist on the GitHub platform.[3] [4] GitHub claimed that the model should be analyzing and generating code from the training set, instead of searching.[1] They also admitted that GitHub Copilot sometimes does generate same code from the training set, but this appears mostly when the user haven't provided enough unique code inside the program.[9]

GitHub's CEO Nat Friedman stated that: "In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler." [10]

Legality of Using Public Data to Train Machine Learning Systems

Until June 2021, the US government has not published any official document that directly declares the legality of the usage of publicly avaliable data toward artificial intelligence algorithms in this case.[3]

Concerns from Developers

Within a week after the announcement of Copilot given by GitHub, there are several developers establishing their concerns toward the copyright issue on Twitter.[11] One of the posts, which had earned more than 3000 likes at that time, stated that: "GitHub scraped your code. And they plan to charge you for copilot after you help train it further."[11]

Commercial Product

GitHub Copilot is now free for a limited number of users.[1] GitHub states that: "If the technical preview is successful, our plan is to build a commercial version of GitHub Copilot in the future."[1]

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 GitHub copilot · your AI pair programmer. GitHub Copilot. (n.d.). Retrieved January 27, 2022, from https://copilot.github.com/
  2. 2.0 2.1 2.2 Gershgorn, D. (2021, June 29). GitHub and OpenAI launch a new AI tool that generates its own code. The Verge. Retrieved January 27, 2022, from https://www.theverge.com/2021/6/29/22555777/github-openai-ai-tool-autocomplete-code
  3. 4.0 4.1 4.2 MoneyControl. (n.d.). Explained: Everything you need to know about github copilot. Moneycontrol. Retrieved January 27, 2022, from https://www.moneycontrol.com/news/technology/explained-everything-you-need-to-know-about-github-copilot-7920251.html
  4. Howard, G. D. (2021). GitHub Copilot: Copyright, Fair Use, Creativity, Transformativity, and Algorithms.
  5. 6.0 6.1 Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  6. Yahoo! (n.d.). Ai programming tool copilot helps write up to 30% of code on github. Yahoo! News. Retrieved January 27, 2022, from https://news.yahoo.com/ai-programming-tool-copilot-helps-153003394.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAALpM1T7xnc9DESuPjGd-jHOy8tJZPKl0NXQfgqmWbutVv5qspfKUJjgreY3hcRCinYM5yz3NOW7syx4v1pImWFuUhA99mTeb3AUBvWiwChbN9mIbqdl3X_cHBA1BikviAMQAn07FaCm6NbAhu7rakAf8HWSTN1Q46wjMZnUYS8bN
  7. Mark A. Lemley and Bryan Casey. (2021, March 20). Fair learning. Texas Law Review. Retrieved January 27, 2022, from https://texaslawreview.org/fair-learning/
  8. Research recitation. GitHub Docs. (n.d.). Retrieved January 27, 2022, from https://docs.github.com/en/github/copilot/research-recitation
  9. In general: (1) training ML systems on public data is fair use (2) the output be...: Hacker news. In general: (1) training ML systems on public data is fair use (2) the output be... | Hacker News. (n.d.). Retrieved January 27, 2022, from https://news.ycombinator.com/item?id=27678354