Multi-Modal hackathon by Weaviate at AIx Innovation Summit and ODSC West 2023 Data Science Conference
Nov 3, 2023
I wanted to experiment more with multimodal Generative AI after the Next Gen AI hackathon. I’ve seen two students mixing generative AI art and augmented reality. Their artwork was very rudimentary (even though collaborative). I was thinking that one modality should be text-to-3D object, and then the generated art could be minted as an NFT (Non-Fungible Token), which would bring in blockchain and Web3 into the picture, unifying several popular buzzwords (augmented reality, generative AI, blockchain, NFT, Web3)
ODSC (Open Data Science Conference) West was coming up in the Bay Area with also an AIx Innovation Summit included. It caught my eye that there was a multi-modal hackathon happening as part of the event as well. I started to reach out again on social media recruiting a team. Within some weeks I had a team lined up again from individuals with amazing credentials and pedigree:
- Avi Rao front-end main role for the hackathon and AR/VR + blockchain, NFT, crypto, web3 specialist
- Yvonne Fang front-end main role for the hackathon and also AR/VR, Gen AI art and web3 specialist
- Andrew Savala back-end role for the hackathon, but also full-stack engineer and AI/ML specialist
- Quinton Mills front-end role for the hackathon, also UX design and management competencies
- Myself: back-end, GCP functions, team organization
Quinton joined us on site, while the rest of us were chatting about the directions we’ll take. My vision was:
- We would generate a 3D object by the user-supplied prompt / instructions.
- We would also optionally generate a short companion music for the 3D object based on user prompt / instruction.
- The generated assets would be indexed by a multimodal embedding model and stored in our back-end.
- The generated artwork can be viewed by the user in an Augmented Reality way while listening to the optional companion music.
- The user could also search for artwork by keywords or description.
- The user would have the opportunity to mint an NFT of an artwork.
Before the event, I started to look around for text-to-3D and text-to-music generative AI models. First I realized that generative 3D object generation is a way harder problem than I thought and it’s somewhat of a niche segment. Someone can find hundreds if not thousands of unimodal text LLMs, however, there are just a few text-to-3D objects all around. Furthermore, out of the very few models I have found, several don’t provide a ready-to-use asset that I could feed into a framework to display. Some models can generate a point cloud and do a certain degree of tessellation, however, I’d need something that provides a fully “baked” model, I couldn’t delve into NP-hard problems around tessellation.
Finally, I’ve found Meshy, which is a paid API, but as such it is used by game developers to obtain 3D assets and must produce decent enough models. Unfortunately, we could still observe issues such as the Janus problem (when the generated rubber ducky has multiple faces and beaks on the head or other body parts) or other anatomical (in case of a living creature) or structural errors.
Three separate rubber ducky models generated the straight text-to-3D object way suffering from Janus problem and / or other anatomical problems:
Fortunately, the above examples are text-to-3D generation results. I was able to achieve much better results when I performed a two-stage generation: first text to image and then image to 3D. The generated rubber ducky images were way better, and then from that the model was able to generate cleaner models with the texture one would expect. In that case, there were still challenges, such as inverted colors (the duck’s body is red and the beak is yellow), keeping the main subject in the center, and preferably having a neutral mono-color diffuse background and having no other objects.
Text -> image -> 3D object generations with color problems or unintended multiple objects:
Text -> image -> 3D object generations where the subject is cropped and partially off-screen:
These can be improved by prompt engineering. The music generation part also wasn’t trivial. I could find more models than 3D object generation, even some available for free or open, but I was dealing with significant API call delay: 10+ seconds to generate a 30-second music bit.
A main important piece of the puzzle is the multimodal embedding model. This was provided by the hackathon organizer Weaviate. The embedding models are responsible for placing the input (text snippet, audio, image) into a high-dimensional embedding space. The model is optimized through a long training process so the conceptually close inputs will be also close in the embedding space (ideally as close as possible), while different concepts would be further away (ideally as far away as possible). When I say high dimensional space I really mean it, here are for example typical dimensionality of OpenAI embedding model variations:
Model name | Code name (for API) | Dimensions |
---|---|---|
Ada | ada-002 | 1536 |
Babbage | babbage-001 | 2048 |
Curie | curie-001 | 4096 |
Davinci | davinci-001 | 12288 |
These dimensions are latent, so someone cannot assign a human concept, like the dimension of animals, food, cars, etc. This is similar to when you do dimensionality reduction or transformation by PCA (Principal Component Analysis): the resulting dimensions are truly machine-generated (but also guided by the long training process to optimize for the given criteria). The more dimension there is the more distinctive the model can be, however, it requires more storage, and vector search can take proportionally longer as well.
However, the result can be extremely interesting, especially for multimodal embeddings. I was planning the following demo examples, for given search keywords:
Search keyword | close 3D object | close music snippet |
---|---|---|
“salsa” | A bowl of salsa | Salsa music |
“metal” | Bronze sculpture | Heavy metal music |
“rubber ducky” | Rubber ducky | Old MacDonald had a farm |
Salsa 3D model, the texture shows pretty amazing details of proper salsa ingredients, however, the see-through nature is not intended:
We ran into issues with hosting the multimodal embedding model by Weaviate. Andrew Savala came to the rescue by setting up a virtual machine on his on-premise laptop while I tried to deploy a cloud instance. We ended up using Andrew’s instance with a backup on-premise instance on the laptop I used for testing.
As for the front-end Avi Rao and Yvonne worked hard to get an 8th wall application up and running. 8th wall hit two birds with one stone: it has web3 integrated, and it also supports mixed reality applications. I was crabling both by generating assets by Meshy and the audio model and to get the Cloud Functions up and running so we could wire the front-end and the back-end together. It’s a recurring problem that Google Cloud Functions pose CORS errors when the front-end tries to call it even when I try to enable CORS in all known ways.
The NFT minting portion of the application would have been done with NFTPort which is a blockchain NFT management API meant to be as handy as Stripe is for fiat transactions. We haven’t had time to get that part integrated, but I’m super proud of what the team accomplished. We also had a Weaviate 3D logo presented as an Augmented Reality object in 8th Wall. During our pitch, the sub-teams took turns explaining the details of our unique project.
The conference was awesome as well. I ended up participating in a presentation challenge and made friends with established presenters such as Cal Al-Dhudhib, other distinguished speakers, and the conference organizers. I would be happy to return in 2024. I must emphasize that all of my teammates showed such professionalism and dedication that I would hands down work with them any time in the future.