Sometimes major shifts happen virtually unnoticed. On May 5, IBM announced Project CodeNet to very little media or academic attention.
CodeNet is a follow-up to ImageNet, a large-scale dataset of images and their descriptions; the images are free for non-commercial uses. ImageNet is now central to the progress of deep learning computer vision.
CodeNet is an attempt to do for Artifical Intelligence (AI) coding what ImageNet did for computer vision: it is a dataset of over 14 million code samples, covering 50 programming languages, intended to solve 4,000 coding problems. The dataset also contains numerous additional data, such as the amount of memory required for software to run and log outputs of running code.
GPT-3, OpenAI’s industry-leading NLP model, has been used to allow coding a website or app by writing a description of what you want. Soon after IBM’s news, Microsoft announced it had secured exclusive rights to GPT-3.
Microsoft also owns GitHub, — the largest collection of open source code on the internet — acquired in 2018. The company has added to GitHub’s potential with GitHub Copilot, an AI assistant. When the programmer inputs the action they want to code, Copilot generates a coding sample that could achieve what they specified. The programmer can then accept the AI-generated sample, edit it or reject it, drastically simplifying the coding process. Copilot is a huge step towards NLC, but it is not there yet.
From Google and Microsoft are creating a monopoly on coding in plain language:
xxx
There is also reason to believe that such technologies will be dominated by platform corporations due to the way machine learning works. Theoretically, programs such as Copilot improve when introduced to new data: the more they are used, the better they become. This makes it harder for new competitors, even if they have a stronger or more ethical product.
Unless there is a serious counter effort, it seems likely that large capitalist conglomerates will be the gatekeepers of the next coding revolution.