从相当大的SQL Server
表(即 300,000 + 行)中删除重复行的最佳方法是什么?
当然,由于存在RowID
标识字段,因此这些行将不是完美的重复项。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
假设没有空值,则对唯一列进行GROUP BY
,并SELECT
MIN (or MAX)
RowId 作为要保留的行。然后,只需删除所有没有行 ID 的内容:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
如果您使用的是 GUID 而不是整数,则可以替换
MIN(RowId)
与
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
另一种可能的方式是
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
我在上面使用ORDER BY (SELECT 0)
,因为在出现平局时要保留哪一行是任意的。
例如,要以RowID
顺序保留最新的,可以使用ORDER BY RowID DESC
执行计划
执行计划通常比接受的答案更简单,更有效,因为它不需要自我连接。
但是,并非总是如此。 GROUP BY
解决方案可能是首选的一种情况是,将优先选择哈希聚合而不是流聚合。
ROW_NUMBER
解决方案将始终提供几乎相同的计划,而GROUP BY
策略则更为灵活。
可能支持散列聚合方法的因素是
在第二种情况的极端版本中(如果每个组中只有很少的组,每个组中有很多重复项),还可以考虑简单地将行插入以保存到新表中,然后执行TRUNCATE
将原始行复制回去以最大程度地减少与删除相比的日志记录行的比例很高。
在 Microsoft 支持站点上有一篇很好的文章,关于删除重复项 。这非常保守 - 他们让您按照单独的步骤进行所有操作 - 但在大型表上应该可以很好地工作。
我过去曾使用自联接来做到这一点,尽管它可能带有 HAVING 子句:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
以下查询对于删除重复的行很有用。此示例中的表具有ID
作为标识列,并且具有重复数据的Column3
Column1
, Column2
和Column3
。
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
以下脚本显示了一个查询中GROUP BY
, HAVING
, ORDER BY
用法,并返回带有重复列及其计数的结果。
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
这将删除重复的行,但第一行除外
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
请参阅( http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server )
我希望 CTE 从 SQL Server 表中删除重复的行
强烈建议您遵循本文:: http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
通过保持原始
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
不保留原始
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
要获取重复的行:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
要删除重复的行:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
快速删除所有重复行(对于小型表):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;