插件窝 干货文章 Python 数据清洗之 URL 字段整理与去重教程

Python 数据清洗之 URL 字段整理与去重教程

example https page com 334    来源:    2025-03-13

在数据分析和处理过程中,URL 字段的整理与去重是一个常见的任务。URL 字段可能包含重复的链接、不完整的链接、或者需要进一步处理的参数。本文将介绍如何使用 Python 对 URL 字段进行整理与去重。

1. 导入必要的库

首先,我们需要导入一些常用的 Python 库,如 pandas 用于数据处理,urllib.parse 用于解析 URL。

import pandas as pd
from urllib.parse import urlparse, urlunparse

2. 创建示例数据

假设我们有一个包含 URL 的 DataFrame,如下所示:

data = {
    'url': [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page1?query=1',
        'http://example.com/page1',
        'https://example.com/page2#section1',
        'https://example.com/page3',
        'https://example.com/page1?query=2',
        'https://example.com/page4',
        'https://example.com/page4',
        'https://example.com/page5'
    ]
}

df = pd.DataFrame(data)
print(df)

输出:

                                url
0          https://example.com/page1
1          https://example.com/page2
2  https://example.com/page1?query=1
3            http://example.com/page1
4  https://example.com/page2#section1
5           https://example.com/page3
6  https://example.com/page1?query=2
7          https://example.com/page4
8          https://example.com/page4
9           https://example.com/page5

3. URL 标准化

在去重之前,我们需要对 URL 进行标准化处理。标准化包括:

  • 统一协议(如将 http 转换为 https
  • 去除 URL 中的查询参数和片段标识符(如 ?query=1#section1
def normalize_url(url):
    # 解析 URL
    parsed_url = urlparse(url)

    # 统一协议为 https
    if parsed_url.scheme == 'http':
        parsed_url = parsed_url._replace(scheme='https')

    # 去除查询参数和片段标识符
    parsed_url = parsed_url._replace(query='', fragment='')

    # 重新构建 URL
    normalized_url = urlunparse(parsed_url)

    return normalized_url

# 应用标准化函数
df['normalized_url'] = df['url'].apply(normalize_url)
print(df)

输出:

                                url              normalized_url
0          https://example.com/page1  https://example.com/page1
1          https://example.com/page2  https://example.com/page2
2  https://example.com/page1?query=1  https://example.com/page1
3            http://example.com/page1  https://example.com/page1
4  https://example.com/page2#section1  https://example.com/page2
5           https://example.com/page3  https://example.com/page3
6  https://example.com/page1?query=2  https://example.com/page1
7          https://example.com/page4  https://example.com/page4
8          https://example.com/page4  https://example.com/page4
9           https://example.com/page5  https://example.com/page5

4. URL 去重

现在,我们可以根据标准化后的 URL 进行去重操作。

# 去重
df_unique = df.drop_duplicates(subset=['normalized_url'])

print(df_unique)

输出:

                                url              normalized_url
0          https://example.com/page1  https://example.com/page1
1          https://example.com/page2  https://example.com/page2
5           https://example.com/page3  https://example.com/page3
7          https://example.com/page4  https://example.com/page4
9           https://example.com/page5  https://example.com/page5

5. 进一步处理

如果需要进一步处理 URL,例如提取域名、路径等信息,可以使用 urlparse 函数。

# 提取域名和路径
df_unique['domain'] = df_unique['normalized_url'].apply(lambda x: urlparse(x).netloc)
df_unique['path'] = df_unique['normalized_url'].apply(lambda x: urlparse(x).path)

print(df_unique)

输出:

                                url              normalized_url         domain      path
0          https://example.com/page1  https://example.com/page1  example.com  /page1
1          https://example.com/page2  https://example.com/page2  example.com  /page2
5           https://example.com/page3  https://example.com/page3  example.com  /page3
7          https://example.com/page4  https://example.com/page4  example.com  /page4
9           https://example.com/page5  https://example.com/page5  example.com  /page5

6. 总结

通过以上步骤,我们成功地对 URL 字段进行了标准化和去重处理。这些步骤可以帮助我们在数据清洗过程中更好地处理和分析 URL 数据。

完整代码

import pandas as pd
from urllib.parse import urlparse, urlunparse

# 创建示例数据
data = {
    'url': [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page1?query=1',
        'http://example.com/page1',
        'https://example.com/page2#section1',
        'https://example.com/page3',
        'https://example.com/page1?query=2',
        'https://example.com/page4',
        'https://example.com/page4',
        'https://example.com/page5'
    ]
}

df = pd.DataFrame(data)

# URL 标准化函数
def normalize_url(url):
    parsed_url = urlparse(url)
    if parsed_url.scheme == 'http':
        parsed_url = parsed_url._replace(scheme='https')
    parsed_url = parsed_url._replace(query='', fragment='')
    normalized_url = urlunparse(parsed_url)
    return normalized_url

# 应用标准化函数
df['normalized_url'] = df['url'].apply(normalize_url)

# 去重
df_unique = df.drop_duplicates(subset=['normalized_url'])

# 提取域名和路径
df_unique['domain'] = df_unique['normalized_url'].apply(lambda x: urlparse(x).netloc)
df_unique['path'] = df_unique['normalized_url'].apply(lambda x: urlparse(x).path)

print(df_unique)

通过这个教程,你应该能够轻松地对 URL 字段进行整理与去重。希望这对你的数据分析工作有所帮助!