如何删除重复的行?

从相当大的SQL Server表(即 300,000 + 行)中删除重复行的最佳方法是什么?

当然,由于存在RowID标识字段,因此这些行将不是完美的重复项。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

答案

假设没有空值,则对唯一列进行GROUP BY ,并SELECT MIN (or MAX) RowId 作为要保留的行。然后,只需删除所有没有行 ID 的内容:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

如果您使用的是 GUID 而不是整数,则可以替换

MIN(RowId)

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

另一种可能的方式是

; 

--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 
                                       ORDER BY ( SELECT 0)) RN
         FROM   #MyTable)
DELETE FROM cte
WHERE  RN > 1;

我在上面使用ORDER BY (SELECT 0) ,因为在出现平局时要保留哪一行是任意的。

例如,要以RowID顺序保留最新的,可以使用ORDER BY RowID DESC

执行计划

执行计划通常比接受的答案更简单,更有效,因为它不需要自我连接。

执行计划

但是,并非总是如此。 GROUP BY解决方案可能是首选的一种情况是,将优先选择哈希聚合而不是流聚合。

ROW_NUMBER解决方案将始终提供几乎相同的计划,而GROUP BY策略则更为灵活。

执行计划

可能支持散列聚合方法的因素是

  • 分区列上没有有用的索引
  • 相对较少的组,每组中重复项相对较多

在第二种情况的极端版本中(如果每个组中只有很少的组,每个组中有很多重复项),还可以考虑简单地将行插入以保存到新表中,然后执行TRUNCATE将原始行复制回去以最大程度地减少与删除相比的日志记录行的比例很高。

在 Microsoft 支持站点上有一篇很好的文章,关于删除重复项 。这非常保守 - 他们让您按照单独的步骤进行所有操作 - 但在大型表上应该可以很好地工作。

我过去曾使用自联接来做到这一点,尽管它可能带有 HAVING 子句:

DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField 
AND dupes.secondDupField = fullTable.secondDupField 
AND dupes.uniqueField > fullTable.uniqueField

以下查询对于删除重复的行很有用。此示例中的表具有ID作为标识列,并且具有重复数据的Column3 Column1Column2Column3

DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/
                  HAVING MAX(ID) IS NOT NULL)

以下脚本显示了一个查询中GROUP BYHAVINGORDER BY用法,并返回带有重复列及其计数的结果。

SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid

Postgres:

delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU 
FROM   (SELECT *, 
               Row_number() 
                 OVER ( 
                   partition BY col1, col1, col3 
                   ORDER BY rowid DESC) [Row] 
        FROM   mytable) LU 
WHERE  [row] > 1

这将删除重复的行,但第一行除外

DELETE
FROM
    Mytable
WHERE
    RowID NOT IN (
        SELECT
            MIN(RowID)
        FROM
            Mytable
        GROUP BY
            Col1,
            Col2,
            Col3
    )

请参阅( http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server

我希望 CTE 从 SQL Server 表中删除重复的行

强烈建议您遵循本文:: http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

通过保持原始

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

不保留原始

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

要获取重复的行:

SELECT
name, email, COUNT(*)
FROM 
users
GROUP BY
name, email
HAVING COUNT(*) > 1

要删除重复的行:

DELETE users 
WHERE rowid NOT IN 
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);

快速删除所有重复行(对于小型表):

select  distinct * into t2 from t1;
delete from t1;
insert into t1 select *  from t2;
drop table t2;