Prompt Challenges

🟡

Multimodal Processing

MediumPrompt Engineering Challenge

Process multimodal information including text and images

Challenge Description

Multimodal processing refers to the ability to simultaneously process multiple types of data, such as text, images, and audio. This challenge requires you to design a prompt that enables the AI to understand and process input containing multiple modalities of information for cross-modal analysis and reasoning.

Challenge Goals

Write a prompt that enables the AI to:

  1. Identify different modal information in the input (text, image descriptions, audio descriptions, etc.)
  2. Understand the correlation and complementary relationship between different modal information
  3. Conduct comprehensive analysis based on multimodal information
  4. Generate cross-modal reasoning and conclusions
  5. Handle cases where modal information is missing or inconsistent

Requirements

  • The prompt must be able to handle combined input of multiple modalities
  • The analysis result should reflect a comprehensive understanding of multimodal information
  • For modal conflicts, it should be able to identify and handle them reasonably
  • The output should include analysis of each modality and a comprehensive conclusion
  • The processing should be clear and traceable

Prompt Template

[Your prompt here]

Input Content:
Text: {Text content}
Image Description: {Image description}
Audio Description: {Audio description}
Task: {Specific task}

Automated Testing Process

  1. For each test case, use your prompt as input
  2. Analyze whether the output contains:
    • Separate analysis of each modal information
    • Cross-modal correlation analysis
    • Comprehensive conclusion
  3. Evaluate the accuracy and depth of the analysis
  4. Check the reasonableness of conflict handling
  5. Verify the clarity of the output structure
  6. Calculate the final score (out of 10)

Expected Output Example (for Case 1)

Multimodal Analysis Result:

## Analysis of Each Modality
**Text Modality**: The user has a positive evaluation of the phone's camera function, but is not satisfied with the battery life.
**Image Modality**: The photo quality is indeed very high, with vibrant colors and rich details, which confirms the camera's performance.
**Audio Modality**: The camera operation is smooth, and the shutter sound is crisp, indicating that the camera is responsive.

## Cross-modal Correlation
- The image modality supports the positive evaluation of the camera function in the text.
- The audio modality further confirms the smoothness of the camera operation.
- The information from the three modalities is highly consistent regarding the camera function.

## Comprehensive Conclusion
  The user's evaluation of this phone's camera function is accurate. The photo quality is indeed excellent, and the operating experience is good. However, the battery life issue does exist and needs improvement. Overall, this is a phone with an excellent camera function but with room for improvement in battery life.

## Confidence: 90%
Reason: The multimodal information is highly consistent, and the analysis result is reliable.

Submit your solution and share your ideas and techniques with the community!

Write Your Prompt

Write your AI prompt based on the requirements

0 / 2000

Tips for Better Prompts

  • Be specific and clear about what you want
  • Provide context and examples when helpful
  • Use appropriate tone and style for your audience
  • Test and iterate to improve your results

Need help? Join our community

View Discussions