Gen AI & ML
Gen AI & ML
The demand for high-quality product visuals is ever-present in the online marketplace. Balancing the need for speed, budget, and artistic control can be a challenge. Can generative AI automate product imagery while maintaining artistic integrity? This article explores this possibility, using an experiment with ComfyUI to find a balance between automation and artistry.
Can a predefined workflow, tailored to a specific product category, generate diverse images with consistent style and quality?
What does effective automation look like in this context?
Following an initial exploration of home furniture in the previous article, the selection of perfume as a category in this article strategically tests automating image generation1. Perfume's association with elegance makes it an ideal test case for assessing whether generative AI can capture subtle brand nuances. The article focuses on automating lifestyle shots of perfume bottles in curated settings. The intent is to create specialized workflows for specific product categories while ensuring consistent and aesthetically pleasing results.
To simulate a creative process designing perfume product lifestyle imagery, three assumptions are made:
Desired aesthetics, such as soft lighting, rich colors, and a sense of depth, can be predefined.
Centered product composition with accentuating inspiration.
Thematic backdrops aligned with the perfume's scent profile. In this research, I defined 2 backdrop themes: floral and beach.
The primary automation goals are:
Creative Direction Input: A user-friendly interface for providing visual feedback and adjustments.
Consistency: A unified visual style across all product images.
Scalability and Efficiency: Easy scaling of image production to meet demands.
Here is the breakdown of how the automation workflow is designed into 4 parts:
User-Friendly Front-End UI: The front-end UI prioritizes simplicity and ease of use - a simple webpage with chat interface, where users can upload a product image, and a text prompt to select the background theme - Floral or Beach.
The Functional Automation Functions: functions connect the front-end UI with the backend ComfyUI running on a remote server.
ComfyUI Backend Mechanics: Remote Server configured with ComfyUI and dependencies. Some automation processes include - image masking, background theme generation based on prompt, compositing, relighting and color correction.
Output Retrieval: Generated images are automatically saved to a designated location - e.g. Google Drive.
The results demonstrate the automation potential with web user interface communicating to a local server and output saved to Google Drive. The results validated process efficiency, with an average rendering time of 30-120 seconds per image using a GPU (RTX3070+). Image quality and consistent themes demonstrated below.
Although this simplistic approach would benefit from additional processes (composting, relighting, color correction, etc.), it demonstrates that automating sequential workflows can improve consistency, usability, and creative control when generating AI-driven content.
The experiment deliberately focused on utilizing only APIs to showcase the workflow's capabilities. While this approach demonstrates the potential of automation, it also reveals that certain nuanced, creative decision-making is still best achieved with tools like ComfyUI. As such tools continue to evolve, they provide greater flexibility for experimenting and refining AI-driven content generation. Once workflows are well-defined, structuring them into sequential steps—with human oversight—enhances reliability, ensuring a balance between automation and creative control.
What do you see AI shaping creative workflows in your field? What key considerations do you find most important? Would love to hear your thoughts!
Recently I've been experimenting with FLUX.1 Tools and the FLUX Pro Fine-Tuning API to explore how generative AI can streamline product image creation. This led me to two key questions:
What might generative AI content automation look like in practice if only API calls are used?
How far a single product image could be transformed into compelling lifestyle imagery (in contrast to finetuning LoRA training that requires about 20 images)?
Inspired by Anthropic's blog post about Agentic Workflows, I hypothesized that structured, step-by-step generation—using techniques like prompt chaining—could lead to more predictable and controlled results.
🔹 Baseline: Single-shot approach - Quick but inconsistent, with AI hallucinations affecting product representation.
🔹 Experimental Condition: Multi-step agentic-like workflow with human intervention - A structured process produced better outcomes:
1. Focused only on background inference.
2. Added over-the-table elements as a separate pass.
3. Used Fill - outpainting to expand the composition dynamically.
Although this simplistic approach would benefit from additional processes (composting, relighting, color correction, etc.), it demonstrates that automating sequential workflows can improve consistency, usability, and creative control when generating AI-driven content.
The experiment deliberately focused on utilizing only APIs to showcase the workflow's capabilities. While this approach demonstrates the potential of automation, it also reveals that certain nuanced, creative decision-making is still best achieved with tools like ComfyUI. As such tools continue to evolve, they provide greater flexibility for experimenting and refining AI-driven content generation. Once workflows are well-defined, structuring them into sequential steps—with human oversight—enhances reliability, ensuring a balance between automation and creative control.
What do you see AI shaping creative workflows in your field? What key considerations do you find most important? Would love to hear your thoughts!
Retrieval-Augmented Generation (RAG) chatbot focuses on The Americans with Disabilities Act (ADA) standards, with the goal of providing precise, actionable guidance on ADA design standards.
The ADA-RAG chatbot is designed specifically to field queries related to ADA standards, delivering exact sections, chapters, and bullet points that guide architects, designers, and compliance officers. This tool does not stray from its primary function; if asked about unrelated topics, it maintains focus by not providing an answer, emphasizing its specialized nature. The ADA-RAG chatbot stands as a prime example of how targeted technological solutions can streamline complex information retrieval, making compliance more straightforward and accessible.
Technical Specifications
RAG Process: data loading, splitting, embedding, and data storing.
Hosting: VM instance in the cloud for dependable access and scalability.
Environment: Dockerized image on Linux for reliability and easy updates.
LLM: Google Gemini for LLM and text embedding.
Database: Pinecone vectorstore database.
Debug and Monitor: Langsmith
Generative AI is opening new avenues in eCommerce, transforming content creation to captivate and inspire customers. My latest experiment explores how varied product imagery can enhance visualization—recognizing it as a key driver for KPIs like conversion rates and average order value.
In this experiment, I trained a Low-Rank Adaptation (LoRA) model with Flux.1-Dev using about 20 reference images of a puffer bag, an accessory with distinct appeal. The resulting photorealistic images showcase the product’s charm across diverse lifestyle settings, suggesting new possibilities for engaging customers in dynamic ways. These visuals highlight how AI innovation can align seamlessly with content needs in eCommerce.
This exploration supports the idea that AI can be applied effectively to deliver inspiring outcomes and shape future business strategy.
Building on my last experiment with conversational AI on Apple Vision Pro (https://lnkd.in/gF6BBENE), I've now enhanced the prototype to include multimodal inputs. Watch the examples in this video.
The latest iteration integrates cutting-edge features such as speech-to-text (STT), Voice Activity Detection (VAD), real-time camera image capture (See what I see), and advanced text-to-speech (TTS). These combined techniques create a bridge between human interaction and AI functionality, resulting in a more intuitive, seamless, and natural experience. From consulting with a virtual interior designer, asking for dinner recipe suggestions, translating foreign languages in real-time, or getting instant tips—all becoming more accessible through enriched interactions that understand both your voice and visual context.
This advancement underscores AI's power to create new possibilities for simplifying our daily routines and making life more delightful.
Where else do you see this technology having a transformative impact on our lives?
Generative AI continues to amaze me! I’ve been experimenting a bit for a few years, and the pace of evolution is astonishing.
Back in 2021, I immersed myself in Pix2Pix and StyleGAN2-ADA, training models on thousands of carefully crafted image pairs to identify edge contours and color blocks, transforming them into vivid imageries. Inspired by Egon Schiele’s bold lines and emotive portraits, I used some of his sketches as my experimental canvas. Despite long cloud-based training sessions (72+ hours!) and unpredictable results, the potential was clear.
Now, in 2024, I’ve resumed this journey using a local setup with pre-trained model checkpoints like Dreamshaper(SD1.5), Epicrealism (SD1.5), and Flux1.1. The difference is extraordinary! Running complex generative tasks locally on my laptop feels surreal compared to just a few years ago. See the video below.
Next on my horizon is exploring ControlNet and fine-tuning pre-trained models with my own custom materials.
After months of grappling with a growing collection of personal notes, I’m thrilled to introduce a solution that significantly eases the burden: the RAG Notebook Analyzer. This tool employs Retrieval-Augmented Generation technology and offers flexible data processing either offline on your local machine or online using APIs like Gemini or OpenAI.
Inspiration from Personal Challenges
My journey into effective Personal Knowledge Management (PKM) has been fraught with challenges. The primary issues were the management of overlapping topics and the vast quantity of notes that made traditional retrieval methods cumbersome. Coupled with significant concerns about data privacy, these challenges necessitated a new approach.
Introducing the RAG Notebook Analyzer
Determined to solve these problems, I developed the RAG Notebook Analyzer. Here’s a breakdown of the solution:
Flexible Data Processing: Choose between offline processing for privacy or online for efficiency. Our tool accommodates over 300 documents in Markdown and PDF formats.
Enhanced Data Handling: Documents are intelligently split and merged for deep analysis. This setup supports a wide array of document types and large collection sizes.
Customizable User Experience: Users can select from various LLMs and adjust search parameters for tailored output. The entire system is framed in an intuitive app interface using Streamlit, making it accessible and interactive.
Results and Broader Impact
This personalized system isn't just an organizational tool—it transforms the way I interact with my information, making note management not only manageable but also enjoyable. Moreover, the implications of this technology extend far beyond personal use.
The RAG Notebook Analyzer holds immense potential for professionals and teams. Imagine the efficiency gains for legal professionals analyzing legal documents, enterprise teams collaborating on projects, or researchers sifting through vast datasets.
Future enhancements I aim to incorporate:
Inclusion of Multimodal Documents: Expanding to process images, videos, and other non-textual data.
Wider Applications: This technology is ripe for adaptation in sectors like legal documentation, enterprise knowledge management, clinical support systems, and customer service.
Curious about uncovering deeper insights through data visualization?
I recently built a tiny Streamlit app designed to make it easy to compare multiple data source and analyzing trends and hidden patterns. I picked the stock market example given the rich data source as input. It’s not financial advice, but rather a fun and simple way to explore data visualization! The line graph approach used here isn't limited to stock data; it can be applied to a variety of fields. For instance, comparing year-over-year (YoY) sales trends, or tracking the effectiveness of different marketing campaigns. Line graphs are powerful for identifying both immediate patterns and long-term shifts.
The key goals for the app:
- Define custom time frames and data input frequencies
- Overlay multiple datasets to identify potential correlations or trends
- Visualize moving averages to compare short-term fluctuations with long-term patterns
Here’s a snapshot of what you can explore: How do different stocks respond to market events? What trends emerge over a week, a month, or even a year? Can layering the data reveal hidden correlations? And what insights can moving averages provide about short-term versus long-term movements?
How do you think data visualization can bring more insight and context to complex datasets? I'd love to hear your thoughts!
This is an exploration of conversational AI prototype using OpenAI APIs and SwiftUI on Apple Vision Pro. The goal is to understand the potential of spatial computing to create truly immersive AI interactions. 🗣️💬🤖
The experiment focused on defining different AI agent roles (e.g. Philosopher, Therapist, Interior Designer) and integrating user voice input with AI agent voice output. The goal? To create an intuitive and natural user dialogue experience within the immersive environment of Apple Vision Pro.
Imagine This:
You're in the middle of a design project, seeking inspiration. With a simple voice command, you invite a virtual interior designer into your workspace.
You find yourself in a thought-provoking conversation with a philosophy professor, walking alongside you in the digital world.
Or, feeling overwhelmed, you call upon a virtual therapist. Their calming voice appears beside you as you walk through a peaceful forest setting, offering support and guidance.
Key Features
Expert Friends: Need design help? Chat with the Interior Designer. Feeling overwhelmed? Talk to the Therapist. Want to ponder life? Have a conversation with the Philosopher Professor.
Your Voice is Key: Simply speak your thoughts – no typing needed! The AI listens naturally.
AI that Gets You: You can control the amount of content the AI agent generates, making the conversation perfect for your needs.
Immersive Voice: The AI's text comes to life with natural-sounding voice output.
Your Chat Window: You can easily see chat history, double-check your input, and refer back to any useful information later.
You're in Control: A simple stop button lets you pause or end the AI's voice output.
This is just the beginning of the possibilities spatial computing offers for conversational AI. I'm eager to continue exploring multi-modal input, more natural voice models, and the full potential of spatial computing for collaborative AI.
Introduce a python AI web app with that transcribes audio/video meeting notes, translates different languages, and extracts insights from various file types!
Challenges:
As a knowledge worker, capturing notes during meetings or while learning can be time-consuming and distracting. Sometimes, I’m so focused on taking notes that I miss key parts of the conversation. Other times, I’m in a rush, and my handwritten notes are barely legible!
Solution: A Python Streamlit app with the following exploration
I explored several AI-driven solutions to address these issues, focusing on:
- Audio and video transcription—both practical and economical approaches
- Speech-to-text solutions, comparing cloud-based services and local implementations.
- Direct transcription from video files, eliminating the need for manual audio extraction.
- Identifying and annotating different speakers and marking timestamps.
- Multilingual transcription and translation, allowing for processing materials in various languages.
- Expanding beyond video and audio, looking into extracting insights from images, text files, and PDFs.
Outcome:
A Python Streamlit web app tailored to my workflow that handles video and audio transcriptions, supports multilingual translations, and processes multimodal information—everything from text files to images and PDFs.
The multi-modal LLM prototype I developed aims to revolutionize learning through AI integration.
Features
Options to select LLM models
Adjust LLM temperature (predictable versus degree of randomness)
Youtube video visualizer
Summarized notes
(new) Time code for easier reference
👉 Check out the prototype code here: https://lnkd.in/g8tbwR9w
Why This Matters
In today's era of knowledge accessibility, platforms like YouTube offer a wealth of information. However, navigating through vast content can be daunting. LLMs like the one in this prototype play a vital role by summarizing lengthy videos, enabling quick understanding of key points.
This project showcases the impact of LLMs, specifically Google's multimodal LLM Gemini, in streamlining online learning. Easily summarize YouTube videos to make informed choices on what to explore further, optimizing your learning experience.
Conclusion
This prototype underscores the transformative potential of LLMs in reshaping online education. By embracing these tools, we enhance time management and content engagement. This project reflects the innovation and engineering excellence essential in driving meaningful solutions within the AI landscape.
It exciting to complete certificate for the AI Product Management specialization from Duke University! This program included three comprehensive courses: Machine Learning Foundations for Product Managers, Managing Machine Learning Projects, and Human Factors in AI.
This specialization provided a well-rounded mix of practical and academic insights into managing the design and development of machine learning products, focusing on best practices, AI project leadership, and building human-centered AI solutions that prioritize privacy and ethics.
This learning experience helped me bridge the gap between theoretical knowledge and real-world AI product work, particularly in applying systematic frameworks and collaborating with data science and AI experts. I’ve gained fresh perspectives on refining my approach, making me more intentional and strategic in the AI product development process by leveraging proven frameworks and best practices.
With this enhanced skill set and newfound perspectives, I’m excited to further navigate the rapidly evolving field of AI product management!
For those interested, here are my capstone projects:
Experimenting on NVidia Instant-ngp to train a NeRF model in seconds.
ML Model: NVlabs Instant-ngp
Environment: Windows 10 Anaconda
To experiment on Instant-ngp, I recorded a 20-second, 1080p video of a toy car with a mobile phone. After preparing my own NeRF dataset from the video clip, I started the interactive training and rendering in the UI. I am blown away with the results and see how researchers are able to accelerate NeRF training from hours down to seconds! With the possibility to view NeRF in realtime and to generate 3D geometry outputs, NeRF shows exciting future of AI and 3D visualization.
This is an experimenting on how to generate photorealistic synthetic human faces using StyleGAN2-ADA. Here's the training result in a video format.
Dataset: Flickr-Faces-HQ Dataset (FFHQ)
ML Model: StyleGAN2-ada
Environment: Google Colab Pro
To train the model, I used I used the first 6,000 images with 1K resolution. With the trained model, I applied the image projection into latent space. The result is a progression of synthetic faces that shares similar visual landmarks and features, and even the glasses, and hair style!
Exploring ML for Shoe Design using StyleGAN2-ADA. Here's the training result in video format.
Dataset: Shoes dataset from Kaggle with 7,000 images
ML Model: StyleGAN2-ada
Environment: Google Colab Pro
My hypothesis that ML could be used to assist/ inspire the product design process using StyleGAN2 ML model. By projecting inspiration images to latent space, the following are results in video showing the meandering progression process.
First, using a regular shoes as the inspiration (left) to see what variation of the target images (right) the ML model could generate.
Let's try something different. How about a red fox?
What about a cat?
Experimentation of Neural Style Transfer with TensorFlow on Colab
Environment: TensorFlow 2.0 on Colab
Experiment of Art Style Transfer, or Pistache -- transforming an image through any selected artistic style, by pairing an original content (source) with an art piece (influence).
This is an exploration by using TensorFlow with pre-trained library through Android Studio to an Android mobile phone device.
Environment: Android Studio, Android Device