Training an OpenAI Quality Text Embedding Model from Scratch.


Text embeddings are an integral component of modern NLP applications powering retrieval-augmented-generation (RAG) for LLMs and semantic search. High quality text embeddings models are closed source and access to them is gated via the API's of leading AI companies. This talk describes how Nomic AI trained nomic-embed-text-v1 - the first fully auditable open-data, open-weights and open-training code text embedding model that outperforms the performance of OpenAI Ada-002. You will learn how text embedding models are trained, the various training decisions that impact model capabilities and tips for successfully using them in your production applications.


Andriy is the co-founder and CTO of Nomic AI - a venture backed start-up that is on a mission to democratize access to artificial intelligence. Prior to Nomic, Andriy was an early engineer at RadAI where he trained multi-billion parameter LLMs to assist radiologists and a Ph.D. student at NYU's Courant Institute for Mathematical Sciences. He cares about making AI systems and the data they are trained on more accessible to everyone.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google