Web scraping carries inherent risks and should be performed ethically. This guide demonstrates practical implementations using three popular .NET libraries.
Core Components
1. Base Interface and Model
1
2
3
4
5
6
7
8
9
10
| public interface IHotNews
{
Task<IList<HotNews>> GetHotNewsAsync();
}
public class HotNews
{
public string Title { get; set; }
public string Url { get; set; }
}
|
Implementation Examples
1. HtmlAgilityPack
Official Resources:
Installation:
1
| Install-Package HtmlAgilityPack
|
Blog Post Scraper:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| public class HotNewsHtmlAgilityPack : IHotNews
{
public async Task<IList<HotNews>> GetHotNewsAsync()
{
var web = new HtmlWeb();
var doc = await web.LoadFromWebAsync("https://www.cnblogs.com/");
return doc.DocumentNode.SelectNodes("//*[@id='post_list']/article/section/div/a")
.Select(node => new HotNews
{
Title = node.InnerText,
Url = node.GetAttributeValue("href", "")
}).ToList();
}
}
|
Console Output Example:
1
2
3
4
| 24 articles scraped
[Title 1] https://example.com/post1
[Title 2] https://example.com/post2
...
|
2. AngleSharp
Official Resources:
Installation:
1
| Install-Package AngleSharp
|
CSS Selector Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| public class HotNewsAngleSharp : IHotNews
{
public async Task<IList<HotNews>> GetHotNewsAsync()
{
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var doc = await context.OpenAsync("https://www.cnblogs.com");
return doc.QuerySelectorAll("article.post-item")
.Select(item => new HotNews
{
Title = item.QuerySelector("section>div>a").TextContent,
Url = item.QuerySelector("section>div>a").GetAttribute("href")
}).ToList();
}
}
|
Output Verification:
Same structured output as HtmlAgilityPack implementation.
3. PuppeteerSharp (SPA Support)
Official Resources:
Installation:
1
| Install-Package PuppeteerSharp
|
Core Workflow:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| // Initialize browser
await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true });
// Scrape SPA content
var page = await browser.NewPageAsync();
await page.GoToAsync("https://juejin.im", WaitUntilNavigation.Networkidle0);
// Example 1: Get full HTML
var html = await page.GetContentAsync();
// Example 2: Save screenshot
await page.ScreenshotAsync("juejin.png");
// Example 3: Generate PDF
await page.PdfAsync("juejin.pdf");
|
Key Features:
- Handles JavaScript-rendered pages
- Automated browser interactions
- File export capabilities (PNG/PDF)
Execution Setup
1
2
3
4
5
6
7
8
9
10
11
12
13
| // Program.cs
static async Task Main(string[] args)
{
var services = new ServiceCollection()
.AddSingleton<IHotNews, HotNewsHtmlAgilityPack>() // Switch implementations
.BuildServiceProvider();
var scraper = services.GetRequiredService<IHotNews>();
var results = await scraper.GetHotNewsAsync();
Console.WriteLine($"Scraped {results.Count} items:");
results.ForEach(item => Console.WriteLine($"{item.Title.PadRight(50)}\t{item.Url}"));
}
|
Key Considerations
- Legality: Always verify website scraping policies (check
robots.txt) - Rate Limiting: Implement delays between requests
- Data Parsing: Combine XPath/CSS selectors with regex for complex extraction
- Error Handling: Use try-catch blocks for network instability
- SPA Handling: PuppeteerSharp requires Chrome runtime (~150MB download)
For production scenarios, consider:
- Proxy rotation
- User-agent randomization
- Headless browser pooling
Complete code samples available on GitHub.