Personalization is what I missed at the Samsung NextGen AI hackathon

Sep 25, 2023

The title of this post is meant to rhyme with “Attention is all we need”. After not pursuing multimodal ideas in favor of business at the Intel Innovation Bridge hackathon I was keen to do something multimodal. Samsung organized the Next Gen AI hackathon at the breathtaking historical Fort Mason Center for Arts & Culture.

Besides the historical buildings, we had a view of Alcatraz, and there was even another tech event in another building.

I started to recruit a team on social media channels and I cannot remember where from but Yiru and Coco He expressed interest. The maximum team size was four and they also knew a mobile engineer, so with that the team was complete. Some weeks before the event our mobile engineer got stuck in China due to COVID, so I managed to recruit Kevin Moore GDE (Google Developer Expert of Android and Flutter) for the front-end role. The team:

Yiru and Coco He: product ideation, management, presentation
Kevin Moore: front-end
Csaba Toth: back-end and Generative AI expertise

I must emphasize before I get into the details of our project: all of my peers showed such professionalism and dedication that I would hands down work with them any time in a work setting.

We’ve thrown around ideas before the event, I was up for anything I just wanted something multimodal. Coco and Yiru like to travel and they were thinking about a travel agent, something with digital nomads and travelers. They honed in on the problem that we all have trouble organizing our photos, and we’d also like to present them to our friends or a wider audience on social media.

Coco and Yiru put the product under scrutiny, interviewed some people and we landed on an agent that can automatically categorize photos and easily write blog posts and social media entries with very few instructions. For example, the agent can sense which were the sunny vs the cloudy days and reflect that arc in a blog post. But primarily the agent can group photos by keywords and then summarize the experience of the matching photos. It is also possible to group by place or date range.

The front-end was an Android mobile application written in Kotlin using Jetpack Compose. This would leave the possibility open to port it to iOS, web, or other platforms via KMP (Kotlin Multi Platform) and CMP (Compose Multi Platform).

As for the back-end I immediately hit serious roadblocks. I was planning to use Google’s Imagen which is an image + text multimodal model offered on Vertex AI.

I would use the image multimodal model’s descriptive power to store very verbose descriptions for each image we encounter. The same model would also generate the keywords related to each photo. This would be a pre-computation step for the app: each time an image lands in the media storage a Google Cloud Function can be triggered and it would perform these two actions. We were storing the resulting data in Google BigQuery.
Later the application user would specify the desired date range, and then the back-end would present the unified set of keywords for the images in that scoop. Then the user can pick a keyword and the blog post would be generated along with the image references.
For the blog post generation step I was using PaLM2 LLM (Large Language Model) via the MakerSuite (now Google AI Studio) API Key because it was so much easier to set it up and use than the Vertex AI Model Garden method.

So the foundational problem was:

Imagen was extremely terse describing our photos no matter how much prompt engineering I tried.
Then I thought I would use an open LLaVA - LLaMA2 model. I looked around quickly on HuggingFace, Replicate, AnyScale but was not able to find a model that I could productionalize to be called as an API. One of the main problems was how to prompt such a model?. At that time HuggingFace’s transformer library was not supporting it natively.
There were some AWS engineers on-site and I thought I might utilize SageMaker, however, it turned out that I didn’t have a quota for good enough GPU instances. Now this GPU shortage eased, but at that time I submitted my request and I got approved too late for the hackathon.
I tried to reach out to Intel engineers because their LLaVA LLaMA2 was running super fast on Intel Cloud, they optimized everything for Intel Max server GPUs and Habana Gaudi2 accelerators. However, I was not able to make the connection during the weekend.
Finally the AWS Engineers advised me an image+text Generative AI service called Chooch which was able to describe the images with the great extent of detail we needed.
I should also mention that I communicated Imagen’s terse behavior on multiple channels to Google and the newer Google Gemini models overcome the terseness.

Coco and Yiru did an excellent job preparing our pitches and managing the progress. We didn’t know but the evaluation didn’t put that much weight on the project pitches. We didn’t make it to the finalist round, but I gained extremely valuable input at the food cart social as the event was closing. I wanted to pick Arthur Soroken’s (MakerSuite belonged to him at Google so he was happy to hear I was utilizing PaLM2) brain and what could have we done better. He pointed out that we were missing a personalized experience. First I didn’t understand since we all work off of very personal images. But then I realized that essentially we just missed a RAG (Retrieval Augmented Generation).

Here is an example: let’s imagine Yiru was at Stonehenge ten years ago but the weather was very cloudy and rainy. When ten years later she revisits Stonehenge our software recognizes that she was there and the generated blog post could reflect on that: “After 10 years fortunately you had a perfect sunny day at the historical site” (or similar). To implement that we could have added extra information embedding, vector database indexing, and RAG. So when I’d prepare the prompt for the blog generation we would also stuff the relevant retrieved experiences from the past. RAG is almost standard these days, but I had a little tunnel vision while scrambling to make other parts work. Even right there speaking with Arthur I realized how great advice he gave and I’ll always remember that: personalization!

It was interesting to see other team’s versatile projects. One that captured my attention was by two Stanford students, who combined art, Augmented Reality, and Generative AI. They presented some semi-interactive artwork in an augmented way. Another project on-the-fly fine-tuned websites you visited and you could converse about the content with the power of LLMs. After almost everyone left, only three attendees and the organizer were chatting in the hacker room. It was such an experience I’ll never forget.

Let me close with Coco and Yiru’s present: Chinese mooncakes. I tasted red bean paste and also lotus seed paste-filled ones, and they were delicious!