Showing 50 open source projects for "visual framework"

View related business solutions
  • Earn up to 15% annual interest with Nexo. Icon
    Earn up to 15% annual interest with Nexo.

    More flexibility. More control.

    Generate interest, access liquidity without selling, and execute trades seamlessly. All in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • Earn up to 15% annual interest with Nexo. Icon
    Earn up to 15% annual interest with Nexo.

    Access competitive interest rates on your digital assets.

    Generate interest, borrow against your crypto, and trade a range of cryptocurrencies — all in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • 1
    1D Visual Tokenization and Generation

    1D Visual Tokenization and Generation

    This repo contains the code for 1D tokenizer and generator

    The 1D Visual Tokenization and Generation project from ByteDance introduces a novel “one-dimensional” tokenizer designed for images: instead of representing images with large grids of 2D tokens (as in many prior generative/image-modeling systems), it compresses images into as few as 32 discrete tokens (or more, optionally) — thereby achieving a very compact, efficient representation that drastically speeds up generation and reconstruction while retaining strong fidelity.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    InternGPT

    InternGPT

    Open source demo platform where you can easily showcase your AI models

    ...The framework connects multiple specialized AI models that perform tasks such as object detection, segmentation, captioning, and visual editing while coordinating them through a central conversational interface. This architecture enables the system to plan actions, execute visual operations, and return results in a coherent dialogue with the user.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Self-Operating Computer

    Self-Operating Computer

    A framework to enable multimodal models to operate a computer

    ...Notably, it was the first known project to implement a multimodal model capable of viewing and controlling a computer screen. The framework supports features like Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting to enhance visual grounding capabilities. It is designed to be compatible with macOS, Windows, and Linux (with X server installed), and is released under the MIT license.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 4
    InternLM-XComposer-2.5

    InternLM-XComposer-2.5

    InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System

    ...It incorporates visual understanding modules that allow the model to analyze images and integrate them into coherent narrative outputs. The framework also supports tasks such as image captioning, multimodal reasoning, and layout generation for structured visual documents. By combining language generation with visual composition capabilities, the system enables new forms of content creation that integrate written explanations with automatically generated visual components.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Accounting practice management software Icon
    Accounting practice management software

    Accountants, accounting firms, tax attorneys, tax professionals

    Canopy is a cloud-based practice management software for accounting and tax firms, offering tools for client engagement, document management, workflow automation, and time & billing. Its Client Engagement platform centralizes interactions with a secure portal, customizable branding, and email integration, while the Document Management system enables organized, paperless file storage. The Workflow module enhances visibility into tasks and projects through templates, task assignments, and automation, reducing human error. Additionally, the Time & Billing feature tracks billable hours, generates invoices, and processes payments, ensuring accurate financial management. With its comprehensive features, Canopy streamlines operations, reduces stress, and enhances client experiences.
    Learn More
  • 5
    Skywork-R1V4

    Skywork-R1V4

    Skywork-R1V is an advanced multimodal AI model series

    Skywork-R1V is an open-source multimodal reasoning model designed to extend the capabilities of large language models into vision-language tasks that require complex logical reasoning. The project introduces a model architecture that transfers the reasoning abilities of advanced text-based models into visual domains so the system can interpret images and perform multi-step reasoning about them. Instead of retraining both language and vision models from scratch, the framework uses a lightweight visual projection layer that connects a pretrained vision backbone with a reasoning-capable language model. This design allows the model to analyze images while maintaining strong textual reasoning performance, enabling tasks such as solving visual math problems, interpreting scientific diagrams, and answering questions about images.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    R1-V

    R1-V

    Witness the aha moment of VLM with less than $3

    R1-V is an initiative aimed at enhancing the generalization capabilities of Vision-Language Models (VLMs) through Reinforcement Learning in Visual Reasoning (RLVR). The project focuses on building a comprehensive framework that emphasizes algorithm enhancement, efficiency optimization, and task diversity to achieve general vision-language intelligence and visual/GUI agents. The team's long-term goal is to contribute impactful open-source research in this domain.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    DriveLM

    DriveLM

    Driving with Graph Visual Question Answering

    DriveLM is a research-oriented framework and dataset designed to explore how vision-language models can be integrated into autonomous driving systems. The project introduces a new paradigm called graph visual question answering that structures reasoning about driving scenes through interconnected tasks such as perception, prediction, planning, and motion control. Instead of treating autonomous driving as a purely sensor-driven pipeline, DriveLM frames it as a reasoning problem where models answer structured questions about the environment to guide decision making. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    UFO³

    UFO³

    Weaving the Digital Agent Galaxy

    UFO is an open-source framework developed by Microsoft for building intelligent agents that automate interactions with graphical user interfaces on the Windows operating system. The system allows users to issue natural language instructions that are translated into automated actions across multiple desktop applications. Using a dual-agent architecture, the framework analyzes both visual interface elements and system control structures in order to understand how applications should be manipulated. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    Open-AutoGLM

    Open-AutoGLM

    An open phone agent model & framework

    ...Unlike traditional automation scripts that depend on brittle heuristics, Open-AutoGLM uses pretrained large language and vision-language models to interpret visual context and natural language instructions, giving the agent robust adaptability across apps and interfaces.
    Downloads: 8 This Week
    Last Update:
    See Project
  • Cortex: Boost Developer Coding Skills Icon
    Cortex: Boost Developer Coding Skills

    Cortex makes coding easier and faster for developers. See how our portal connects tools and cuts busywork.

    Cortex is a simple portal that helps developers work smarter by linking all your tools, setting clear rules, and slashing repetitive tasks. It speeds up onboarding, updates old code, and fixes issues fast. Over 100 big companies use it to save time and get better results.
    Try it now!
  • 10
    ComfyUI-LTXVideo

    ComfyUI-LTXVideo

    LTX-Video Support for ComfyUI

    ComfyUI-LTXVideo is a bridge between ComfyUI’s node-based generative workflow environment and the LTX-Video multimedia processing framework, enabling creators to orchestrate complex video tasks within a visual graph paradigm. Instead of writing code to apply effects, transitions, edits, and data flows, users can assemble nodes that represent video inputs, transformations, and outputs, letting them prototype and automate video production pipelines visually. This integration empowers non-programmers and rapid-iteration teams to harness the performance of LTX-Video while maintaining the clarity and flexibility of a dataflow graph model. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    LlamaGen

    LlamaGen

    Autoregressive Model Beats Diffusion

    LlamaGen is an open-source research project that introduces a new approach to image generation by applying the autoregressive next-token prediction paradigm used in large language models to visual generation tasks. Instead of relying on diffusion models, the framework treats images as sequences of tokens that can be generated progressively using transformer architectures similar to those used for text generation. The project explores how scaling autoregressive models and improving image tokenization techniques can produce competitive results compared with modern diffusion-based image generators. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    LISA

    LISA

    LISA: Reasoning Segmentation via Large Language Model

    LISA is an open-source multimodal AI system designed to enable language models to perform pixel-level reasoning and segmentation tasks on images. The project introduces a framework where a large language model can interpret natural language instructions and produce segmentation masks that highlight relevant regions in an image. Instead of relying solely on predefined object categories, the model is capable of reasoning about complex textual queries and translating them into visual segmentation outputs. This approach allows the system to identify objects or regions in images based on semantic descriptions, contextual reasoning, and world knowledge. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    DINOv3

    DINOv3

    Reference PyTorch implementation and models for DINOv3

    DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while maintaining or improving feature quality. ...
    Downloads: 18 This Week
    Last Update:
    See Project
  • 14
    firerpa LAMDA

    firerpa LAMDA

    The most powerful Android RPA agent framework

    lamda is an Android RPA agent framework that provides visual remote desktop control and automation at scale, geared toward testing, automation validation, and device management. It exposes a clean UI to monitor and interact with connected devices and includes tooling to script actions reliably across apps and OS versions. The project emphasizes low-friction setup and powerful control primitives so teams can move from interactive validation to repeatable automation.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    PaperBanana

    PaperBanana

    Extension of Google Research’s PaperBanana

    PaperBanana is an open-source agentic framework designed to automatically generate publication-quality academic diagrams and statistical plots directly from text descriptions. The project focuses on helping researchers, educators, and data scientists transform conceptual descriptions of figures into structured visual outputs suitable for research papers, presentations, and technical reports.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    ViMax

    ViMax

    Director, Screenwriter, Producer, and Video Generator All-in-One

    ViMax is an open-source framework for performing large-scale multi-modal vision-language modeling and reasoning by combining powerful image encoders with advanced language models to solve complex visual tasks. It integrates components like visual encoders, cross-modal fusion techniques, and reasoning modules so that users can go beyond simple captioning or classification to perform tasks such as visual question answering, multi-image inference, and structured scene understanding. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Janus

    Janus

    Unified Multimodal Understanding and Generation Models

    Janus is a sophisticated open-source project from DeepSeek AI that aims to unify both visual understanding and image generation in a single model architecture. Rather than having separate systems for “look and describe” and “prompt and generate”, Janus uses an autoregressive transformer framework with a decoupled visual encoder—allowing it to ingest images for comprehension and to produce images from text prompts with shared internal representations.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 18
    Agent S

    Agent S

    Agent S: an open agentic framework that uses computers like a human

    ...Agent S combines powerful foundation models (such as GPT-5) with grounding models like UI-TARS to translate visual inputs into precise executable actions. It supports flexible deployment via CLI, SDK, or cloud, and integrates with multiple model providers including OpenAI, Anthropic, Gemini, Azure, and Hugging Face endpoints. With optional local code execution, reflection mechanisms, and compositional planning, Agent S provides a scalable and research-driven framework for building advanced computer-use agents.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 19
    HunyuanWorld 1.0

    HunyuanWorld 1.0

    Generating Immersive, Explorable, and Interactive 3D Worlds

    ...HunyuanWorld-1.0 surpasses existing open-source methods in visual quality and geometric consistency, demonstrated by superior scores in BRISQUE, NIQE, Q-Align, and CLIP metrics.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 20
    VideoRAG

    VideoRAG

    "VideoRAG: Chat with Your Videos

    VideoRAG is a retrieval-augmented generation (RAG) framework tailored for video content that enables AI systems to answer questions, summarize, and reason over long videos by combining visual embeddings with contextual search. The system works by first breaking video into clips, extracting visual and audio-textual features, and indexing them into embeddings, then using an LLM with a retriever to pull relevant segments on demand.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    ManiSkill

    ManiSkill

    SAPIEN Manipulation Skill Framework

    ManiSkill is a benchmark platform for training and evaluating reinforcement learning agents on dexterous manipulation tasks using physics-based simulations. Developed by Hao Su Lab, it focuses on robotic manipulation with diverse, high-quality 3D tasks designed to challenge perception, control, and planning in robotics. ManiSkill provides both low-level control and visual observation spaces for realistic learning scenarios.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    LTX-2

    LTX-2

    Python inference and LoRA trainer package for the LTX-2 audio–video

    ...The framework targets both interactive graphical applications and media-rich experiences, making it a solid foundation for games, creative tools, or visualization systems that demand both performance and flexibility. While being low-level, it also provides sensible defaults and helper abstractions that reduce boilerplate and help teams maintain clear, maintainable code.
    Downloads: 85 This Week
    Last Update:
    See Project
  • 23
    AppAgent

    AppAgent

    Multimodal Agents as Smartphone Users, an LLM-based multimodal agent

    AppAgent is an open-source multimodal agent framework designed to enable large language models to operate smartphone applications through natural interactions with graphical user interfaces. The system allows an AI agent to interpret visual information from the screen and translate natural language instructions into actions such as tapping, swiping, and navigating between application screens.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Watermark Anything

    Watermark Anything

    Official implementation of Watermark Anything with Localized Messages

    Watermark Anything (WAM) is an advanced deep learning framework for embedding and detecting localized watermarks in digital images. Developed by Facebook Research, it provides a robust, flexible system that allows users to insert one or multiple watermarks within selected image regions while maintaining visual quality and recoverability. Unlike traditional watermarking methods that rely on uniform embedding, WAM supports spatially localized watermarks, enabling targeted protection of specific image regions or objects. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    GLM-4.5V

    GLM-4.5V

    GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning

    GLM-4.5V is the preceding iteration in the GLM-V series that laid much of the groundwork for general multimodal reasoning and vision-language understanding. It embodies the design philosophy of mixing visual and textual modalities into a unified model capable of general-purpose reasoning, content understanding, and generation, while already supporting a wide variety of tasks: from image captioning and visual question answering to content recognition, GUI-based agents, video understanding, and long-document interpretation. GLM-4.5V emerged from a training framework that leverages scalable reinforcement learning (with curriculum sampling) to boost performance across tasks ranging from STEM problem solving to long-context reasoning, giving it broad applicability beyond narrow benchmarks. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB