
Planning a troubleshooting approach
I realize that I have the capability to run shell commands, but I should first ask the user for specifics about their project. I need details like which dataset, table, or report they're working with, their expected behavior, when things started going wrong, and any recent changes. I’m thinking of providing a checklist highlighting common issues and including a couple of SQL snippets. It seems straightforward; I can skip the planning tool for this investigation.Creating a troubleshooting checklist
I'm thinking about how to effectively assist with this issue. I’ll start by asking four quick questions: where the problem is, what the symptoms are, when it started, and what changed. I should also clarify how "不对劲" shows up, like totals dropping, spikes, duplicates, or nulls, and identify the data source or location.
Next, I want to provide immediate checks with SQL templates, such as checking freshness and trends, identifying duplicates, assessing NULL rates, detecting outliers, checking for timezone shifts, and verifying join counts.Structuring a data review process
I'm focusing on setting up a clear approach for data verification. I should suggest checking any recent changes in the WHERE clause, pipeline runs, and git diffs. It might help if the user shares a sample or SQL/model for review. I can offer to run local commands if they're able to provide access to the repo.
I'm thinking of organizing everything under the title "数据核查" and using structured bullets. I'll include quick clarification questions first, then suggest they check high-impact items with SQL snippet templates. I'll keep it generic and adaptable to dialects. Finally, I can propose additional support, like reviewing queries or running tests, all while keeping the tone friendly in Chinese.
具体是哪个数据/报表哪里“不对劲”?给我一点上下文我能更快定位。先帮我确认这几件事:
- 哪个表/报表/指标出问题?预期 vs 实际分别是什么?
- 异常从什么时候开始的?是持续偏差还是偶发尖峰/断崖?
- 最近有没有上线改动(ETL/SQL、模型、上游接口、埋点、时区/分区设置)?
- 你用的仓库/方言是啥(Postgres/BigQuery/Snowflake/ClickHouse…)?主键/时间列名?
先自查这些高命中点(附通用 SQL 模板,按你方言调整函数名):
- 新鲜度与行数趋势
-- 按天行数与最近分区
select date_trunc('day', created_at) d, count(*) c
from your_table
where created_at >= now() - interval '14 day'
group by 1 order by 1;
-- 最近一批/分区时间
select max(created_at) as max_event_time, max(ingested_at) as max_load_time from your_table;
- 重复与主键唯一性
-- 主键是否重复
select id, count(*) c
from your_table
group by 1
having count(*) > 1
order by c desc
limit 50;
- 连接爆炸(多重匹配)

-- 预估连接基数
select
(select avg(cnt) from (select key, count(*) cnt from A group by key) t) as a_avg_mult,
(select avg(cnt) from (select key, count(*) cnt from B group by key) t) as b_avg_mult;
-- 连接后是否倍增
select count(*) as after_join
from A join B using (key);
select (select count(*) from A) as a_cnt, (select count(*) from B) as b_cnt;
- null 比例/默认值飙升(埋点断流、解析失败)
select date_trunc('day', created_at) d,
avg((target_col is null)::int)::float as null_rate
from your_table
group by 1 order by 1;
- 单位/时区偏移(K→CNY、ms→s、UTC→本地)
-- 量纲突变,分位数对比
select
percentile_disc(0.5) within group (order by metric) as p50,
percentile_disc(0.9) within group (order by metric) as p90,
percentile_disc(0.99) within group (order by metric) as p99
from your_table
where created_at >= now() - interval '7 day';
- 去重/窗口逻辑是否失效
-- 期望唯一的业务键在时间窗内是否多条
select business_key, date_trunc('day', created_at) d, count(*) c
from your_table
group by 1,2 having count(*)>1
order by c desc limit 50;
- 过滤条件回归(where 少了/多了)
-- 最近一周 vs 往前一周的指标对比
with cur as (
select sum(metric) m from your_table
where created_at >= now()-interval '7 day'
),
prev as (
select sum(metric) m from your_table
where created_at >= now()-interval '14 day' and created_at < now()-interval '7 day'
)
select cur.m as cur_7d, prev.m as prev_7d, (cur.m/ nullif(prev.m,0))-1 as wow
from cur, prev;
把你觉得异常的一个具体指标/SQL/图表贴过来(或表名+关键列名:主键、事件时间、度量列),我可以:
- 复核 SQL/模型,给出最小复现与修正建议;
- 写一组快速“体检”查询/测试(唯一性、引用完整性、分布漂移);
- 如果你用 dbt,我能加上 tests 与 freshness 监控,避免再发生。
.jpg)