广告赞助
广告赞助

数据越看越不对劲(数据越查越蹊跷)

heWHE

Planning a troubleshooting approach

I realize that I have the capability to run shell commands, but I should first ask the user for specifics about their project. I need details like which dataset, table, or report they're working with, their expected behavior, when things started going wrong, and any recent changes. I’m thinking of providing a checklist highlighting common issues and including a couple of SQL snippets. It seems straightforward; I can skip the planning tool for this investigation.Creating a troubleshooting checklist

I'm thinking about how to effectively assist with this issue. I’ll start by asking four quick questions: where the problem is, what the symptoms are, when it started, and what changed. I should also clarify how "不对劲" shows up, like totals dropping, spikes, duplicates, or nulls, and identify the data source or location.

Next, I want to provide immediate checks with SQL templates, such as checking freshness and trends, identifying duplicates, assessing NULL rates, detecting outliers, checking for timezone shifts, and verifying join counts.Structuring a data review process

I'm focusing on setting up a clear approach for data verification. I should suggest checking any recent changes in the WHERE clause, pipeline runs, and git diffs. It might help if the user shares a sample or SQL/model for review. I can offer to run local commands if they're able to provide access to the repo.

![erecr](/assets/images/9867CBB5.jpg)

I'm thinking of organizing everything under the title "数据核查" and using structured bullets. I'll include quick clarification questions first, then suggest they check high-impact items with SQL snippet templates. I'll keep it generic and adaptable to dialects. Finally, I can propose additional support, like reviewing queries or running tests, all while keeping the tone friendly in Chinese.

具体是哪个数据/报表哪里“不对劲”?给我一点上下文我能更快定位。先帮我确认这几件事:

  • 哪个表/报表/指标出问题?预期 vs 实际分别是什么?
  • 异常从什么时候开始的?是持续偏差还是偶发尖峰/断崖?
  • 最近有没有上线改动(ETL/SQL、模型、上游接口、埋点、时区/分区设置)?
  • 你用的仓库/方言是啥(Postgres/BigQuery/Snowflake/ClickHouse…)?主键/时间列名?

先自查这些高命中点(附通用 SQL 模板,按你方言调整函数名):

  • 新鲜度与行数趋势
-- 按天行数与最近分区
select date_trunc('day', created_at) d, count(*) c
from your_table
where created_at >= now() - interval '14 day'
group by 1 order by 1;

-- 最近一批/分区时间
select max(created_at) as max_event_time, max(ingested_at) as max_load_time from your_table;
  • 重复与主键唯一性
-- 主键是否重复
select id, count(*) c
from your_table
group by 1
having count(*) > 1
order by c desc
limit 50;
  • 连接爆炸(多重匹配)

认值飙

-- 预估连接基数
select
  (select avg(cnt) from (select key, count(*) cnt from A group by key) t) as a_avg_mult,
  (select avg(cnt) from (select key, count(*) cnt from B group by key) t) as b_avg_mult;
-- 连接后是否倍增
select count(*) as after_join
from A join B using (key);
select (select count(*) from A) as a_cnt, (select count(*) from B) as b_cnt;
  • null 比例/默认值飙升(埋点断流、解析失败)
select date_trunc('day', created_at) d,
  avg((target_col is null)::int)::float as null_rate
from your_table
group by 1 order by 1;
  • 单位/时区偏移(K→CNY、ms→s、UTC→本地)
-- 量纲突变,分位数对比
select
  percentile_disc(0.5) within group (order by metric) as p50,
  percentile_disc(0.9) within group (order by metric) as p90,
  percentile_disc(0.99) within group (order by metric) as p99
from your_table
where created_at >= now() - interval '7 day';
  • 去重/窗口逻辑是否失效
-- 期望唯一的业务键在时间窗内是否多条
select business_key, date_trunc('day', created_at) d, count(*) c
from your_table
group by 1,2 having count(*)>1
order by c desc limit 50;
  • 过滤条件回归(where 少了/多了)
-- 最近一周 vs 往前一周的指标对比
with cur as (
  select sum(metric) m from your_table
  where created_at >= now()-interval '7 day'
),
prev as (
  select sum(metric) m from your_table
  where created_at >= now()-interval '14 day' and created_at < now()-interval '7 day'
)
select cur.m as cur_7d, prev.m as prev_7d, (cur.m/ nullif(prev.m,0))-1 as wow
from cur, prev;

把你觉得异常的一个具体指标/SQL/图表贴过来(或表名+关键列名:主键、事件时间、度量列),我可以:

  1. 复核 SQL/模型,给出最小复现与修正建议;
  2. 写一组快速“体检”查询/测试(唯一性、引用完整性、分布漂移);
  3. 如果你用 dbt,我能加上 tests 与 freshness 监控,避免再发生。