TOEIC Speaking Q3-4 Describe a Picture: The 30-Second Structured Response

The picture loads. A workplace scene — five people around a conference table, papers spread, one person standing at a whiteboard. You have 45 seconds to prepare and 30 seconds to describe it. You notice everything: the colors, the body language, the clock on the wall, the laptop on the table, the fact that one person looks bored. When the record light comes on, you start: "This picture shows a meeting room with some people..." — and suddenly 10 seconds have gone and you're still scene-setting. You rush. You describe three features in fragments. The timer cuts off mid-sentence.

Q3-4 punishes candidates who describe pictures the way they'd describe pictures in conversation. Thirty seconds is not a speech; it's a structured short-form delivery with a fixed three-tier shape. The rater isn't evaluating how much you noticed in the picture — they're evaluating whether you organized a clear description under time pressure, using appropriate vocabulary, grammatically accurate sentences, and cohesive connectives.

The tasks that win 3/3 are the ones where the candidate delivered 5-6 grammatically complete sentences in the structure the rubric rewards. The tasks that land 1-2 are the ones where the candidate crammed 8 features into fragments, or used all 30 seconds on scene-setting.

What Describe a Picture Actually Looks Like

Q3 and Q4 are the third and fourth items on TOEIC Speaking. Each presents a single photograph on screen. You have 45 seconds of preparation time and 30 seconds to describe it.

Feature	Describe a Picture (Q3-4)
Task count	2
Prep time each	45 seconds
Speak time each	30 seconds
Input	A single photograph
Typical content	Workplace scenes, everyday life, outdoor/indoor environments
Rubric (0-3)	Pronunciation, Intonation/Stress, Grammar, Vocabulary, Cohesion
Expected sentence count	5-6 complete sentences
Total task-family minutes	~2.5 minutes of the 20-minute test

The rubric expands from Q1-2's two dimensions to five. Pronunciation and Intonation/Stress remain. Grammar, Vocabulary, and Cohesion get added. That expansion is the reason Q3-4 is harder than Q1-2 even though it looks simpler — now you're composing original sentences under time pressure, and five criteria evaluate them simultaneously.

The 30-second cap is the single biggest design constraint. Thirty seconds at a natural English pace is 70-85 words — roughly 5-6 sentences. Everything about preparation and delivery has to respect that ceiling.

The Three-Tier Structure That Wins 3s

Raters are trained to hear a structured response. The expected shape, worn down to essentials:

Tier 1 — Opening (1 sentence, ~5 seconds). A general statement that locates and labels the scene. "This picture shows a meeting room," "This looks like a busy office lobby," "This appears to be a restaurant during lunchtime." One sentence, one purpose: orient the listener.

Tier 2 — Main features (3-4 sentences, ~20 seconds). The heart of the response. Describe the people, their actions, the objects they're interacting with, and the layout. Use prepositions of position (in the background, on the left, next to, behind, across from). Use present continuous for ongoing actions.

Tier 3 — Inference or closing observation (1 sentence, ~5 seconds). One sentence that goes slightly beyond pure description — the atmosphere, an inference about the situation, or a concluding summary. "Overall, the people seem focused on an important discussion," "It appears to be early morning based on the lighting," "Everyone looks professional and engaged."

Five to six sentences total. The structure is not optional — it's what separates Vocabulary 3 / Cohesion 3 from Vocabulary 2 / Cohesion 1. A response that lists eight features without the three-tier frame scores lower than a response with the frame even if the framed response mentions fewer features.

Cohesion Markers That Prop Up the Rubric

Cohesion is the scored criterion that test-takers most often underuse. It measures whether your sentences connect smoothly — whether one flows into the next rather than dropping as standalone fragments.

High-value cohesion markers for Q3-4:

Purpose	Markers
Existence / introduction	There is / There are, I can see, The picture shows
Location	On the left / right, In the background / foreground, In the middle, Next to, Behind, Across from
Addition	Also, In addition, Besides that, Another [thing] is
Contrast	However, On the other hand, Meanwhile
Concluding / summarizing	Overall, In general, It seems that, Based on the picture

Scattered use of three or four of these markers in a 5-6 sentence response is enough for a Cohesion 3. No markers at all — just strings of declarative sentences — typically caps Cohesion at 1-2.

Avoid marker overload. Using a connective in every sentence reads as forced. Two or three well-placed markers per response is the sweet spot.

Vocabulary That Scores and Vocabulary That Sinks

Vocabulary criterion rewards accuracy and appropriateness, not rarity. A response with clear, accurate everyday words scores higher than a response with sophisticated vocabulary used slightly incorrectly.

Categories worth building active stock in:

People descriptors. Not just "man / woman" but customer, employee, waiter, passenger, pedestrian, office worker, receptionist, presenter, colleague. Naming the likely role places the scene in a specific workplace context.

Action verbs in present continuous. Holding, gesturing, pointing, typing, examining, explaining, reviewing, arranging, unpacking, loading, serving, waiting, walking past. A response anchored on vague verbs (doing, having, being) scores lower than one with specific action verbs.

Location prepositional phrases. In the foreground, in the middle of the room, to the left of the table, on the wall behind them, across the aisle, next to the window. Precise location anchors the scene and also counts toward Cohesion.

Atmosphere / inference adjectives. Focused, casual, formal, busy, empty, crowded, well-lit, outdoor, modern, professional, relaxed. These are what Tier 3 inference sentences are built from.

Avoid vague placeholders: stuff, things, people, doing stuff, some people. Avoid vocabulary above your comfort zone — using a word you're 60% sure of produces more grammatical errors than a safer synonym.

The 45-Second Preparation Routine

The 45-second prep window is longer than the speak window. Use the full time — it compresses your delivery.

A 45-second prep routine that works:

0-10s — Scan and classify. Quickly identify the setting (office? restaurant? park? street? airport?), the approximate number of people, and the main activity. Pick a genre label: workplace meeting, retail service, outdoor scene, social gathering.

10-25s — Plan Tier 2. Identify 3-4 main features you will describe. For each, mentally draft the sentence — subject + verb + object + modifier. Don't try to remember exact words; have the sentence's shape ready.

25-35s — Plan the opening and closing. Opening sentence: one clean "This picture shows..." type line that names the scene. Closing sentence: one inference or overall observation.

35-45s — Silently rehearse the opening. The first five seconds of your response set the rater's impression. Don't start cold.

By the time the record light turns on, you have a five-to-six-sentence plan in your head, with the opening rehearsed and the main features sequenced. You're not composing under pressure; you're executing a plan.

Prioritization Is the Hidden Skill

Thirty seconds is short enough that what you leave out matters as much as what you include. Strong responses prioritize ruthlessly:

People and their actions > static objects (if a person is visible, lead with them; don't spend 8 seconds on a shelf of books when there's a person at a desk)
Specific before general (a waiter serving coffee > people in a restaurant)
Main subject first, background features last
One inference > three shallow observations

The rater-visible signal of good prioritization is that your 30 seconds feel complete, not cut off mid-thought. A response that describes 2-3 features fully beats one that rushes through 6 features as fragments.

Common Pitfalls That Drop Scores

Running out of time before Tier 3. Spending too long on the opening or on scene-setting so that the response ends mid-feature list. Fix: drill with a 30-second timer; learn how many sentences fit.

Awkward silence mid-response. Pausing for 3+ seconds to search for a word. Fix: have backup phrasings — if you can't remember a word, switch to "a person is working at something on the table" rather than pausing to recall the specific item name.

Passive-voice overuse. "The meeting is being held by several people" instead of "Several people are holding a meeting." Passive can work occasionally but overusing it telegraphs uncertainty. Default to active.

Repetitive sentence openers. Every sentence starting with "There is..." or "I can see..." Fix: vary openings by using location phrases ("In the background, several people...") or actions ("Two workers are loading...").

Inferring things that aren't supported. "They're planning a birthday party" when nothing in the picture suggests a party. Fix: keep inference cautious. "It seems to be a casual meeting" is safer than a specific claim.

Describing what's not there. Mentioning objects you imagine but can't actually see. Fix: everything you mention should be pointable-at in the image.

Worked Example

A photograph: a street vendor's fruit stall with the vendor adjusting fruit displays, two customers looking at apples, and bicycles in the background.

Weak response (describes too much, no structure):

"There's a fruit stand and a man is working and there are apples and some bananas and two customers and bicycles and maybe it's a market..."

One long run-on sentence. No structure. Tier 1/2/3 all jammed together. Cohesion 1. Vocabulary 1. Grammar 1.

Strong response (structured, prioritized, uses markers):

"This picture shows an outdoor fruit stall at what looks like a market. In the foreground, a vendor is adjusting a display of apples. Two customers are standing next to the stall, examining the fruit. In the background, I can see several bicycles parked along the street. Overall, the scene appears to be a busy morning at a local market."

Five sentences. Clear Tier 1 opening ("This picture shows..."). Tier 2 with specific actions ("adjusting," "examining"), location markers ("In the foreground," "In the background"), and specific people roles ("vendor," "customers"). Tier 3 inference ("a busy morning at a local market"). Cohesion 3. Vocabulary 3. Grammar 3.

Both responses describe roughly the same content. The structured one scores at least one full point higher on three of five criteria.

Picture Genres and What Each One Demands

TOEIC Q3-4 pictures span several recurring genres. Each rewards a slightly different descriptive emphasis.

Workplace meeting or office scene. People seated around a table, at computers, in conference rooms. Prioritize: role labels (manager, colleagues, presenter), posture and gesture (leaning forward, pointing at a screen, taking notes), atmosphere (formal, collaborative, focused). A plausible inference sentence here is about the meeting's likely purpose or mood.

Retail or service environment. Customers and staff in stores, restaurants, cafes, banks. Prioritize: customer-staff relationship (a cashier is helping a customer, a waiter is taking an order), the transaction or interaction in progress, the kind of establishment. Inference sentence often locates the time of day or business type.

Outdoor public scene. Streets, parks, markets, plazas. Prioritize: the main activity if any (a vendor is arranging produce, pedestrians are crossing the street), the setting (urban, residential, commercial), atmospheric details (sunny, crowded, early morning). Inference sentence often speculates about the location's function or time of day.

Transportation and travel. Airports, train stations, bus stops, airplanes interiors. Prioritize: travelers and their luggage or positions (a passenger is checking his boarding pass, travelers are waiting in line), infrastructure (boarding gate, information board, baggage carousel), the specific transit moment (pre-boarding, arrival, in transit).

Home or personal scene. Kitchens, living rooms, gardens, private settings. Prioritize: the activity (someone is preparing a meal, reading a book), the domestic setting details (kitchen counter, bookshelf, sofa), and the casual mood typical of home scenes.

Naming the genre in your opening sentence ("This picture shows an outdoor market," "This appears to be a formal meeting in a conference room") frames the rest of your description and cues the rater that you identified the scene correctly.

What to Drill

Timed 30-second deliveries. Practice with a strict 30-second timer. Record and replay. Count whether you delivered 5-6 complete sentences.

Vocabulary building by category. Maintain and grow active stock in the five categories above (people roles, action verbs, location phrases, atmosphere adjectives, scene types). Review weekly.

Prep-routine drilling. Practice the 45-second prep routine until it's automatic. Many candidates waste prep time staring blankly or trying to plan sentence word-for-word — the routine above keeps prep productive.

Picture diversity. TOEIC Q3-4 pictures come from many genres (workplace meetings, retail environments, transportation, outdoor public spaces, personal home settings). Practice across genres so no category catches you flat.

Shadowing model responses. Listen to high-scoring model responses, shadow them for rhythm and sentence flow. This internalizes the pacing of 5-6 sentences in 30 seconds.

How Q3-4 Feeds Your Overall Speaking Score

Q3-4 is where candidates whose Q1-2 was strong first discover the rubric expands. Pronunciation and Intonation/Stress still count — but Grammar, Vocabulary, and Cohesion now join them. A candidate who landed 3/3 on Q1-2 can easily drop to 2/3 on Q3-4 if grammar fragments or cohesion gaps surface. Conversely, Q3-4 is also where candidates with weaker pronunciation but strong organization can claw back points, because cohesion and grammar are about sentence structure — not sound.

Strong Q3-4 performance also prepares Q5-7 and Q8-10, which use the same five-criteria rubric at shorter time scales with tighter prep windows. Mastering the three-tier structure here carries over.

On ExamRift, TOEIC Speaking Q3-4 practice is built around the three-tier structure and five-criteria rubric. Every practice picture includes a 45-second prep timer, a 30-second recording window, and AI-evaluated feedback on each of the five rubric dimensions. Sample responses are provided at each scoring level (1, 2, 3) so you can hear what separates a 2 from a 3 in concrete terms. The practice bank spans workplace, retail, outdoor, transit, and home scenes — the full genre range Q3-4 draws from — and the dashboard tracks which rubric dimension is capping your score.

Q3-4 should be the reliable 3/3 block that lifts your overall Speaking score into the 160+ range. With a disciplined 45-second prep, a three-tier delivery, and scored practice on all five rubric dimensions, it becomes exactly that.

Ready to master the 30-second structured response? Practice TOEIC Speaking Q3-4 on ExamRift with timed prep, AI feedback on all five rubric dimensions, and sample responses at every scoring level.