This Python-based project consists of two main scripts:
-
Data Fetcher and Analyzer (
gpu_all_info_10_page_auto_stop.py
): A web scraping tool for extracting GPU listings from "Tori.fi", specifically in the Uusimaa region's computer components category. It captures details like titles, prices, descriptions, and more, saving them to a CSV file. -
Data Cleaner (
gpu_all_info_data_cleaning.py
): Enhances the data set by categorizing GPUs, identifying VRAM, and extracting GPU variants from the scraped data. It uses the output from the first script and processes it further to add meaningful insights. -
All other scripts are variations of these two scripts above and they show the jourey (starting from basic functions on the base_code dir)
Title | Price | GPU Category | VRAM | GPU Variant | Description | Link | Date | Condition | Timestamp |
---|---|---|---|---|---|---|---|---|---|
Rtx 3090 24gb, helsinki | 1100.00 | 3090 | 24gb | None | Lisätiedot Hankin tämän tekoälyllä tehtävään kuvanmuokkaukseen, mutta en enää tarvitse sitä. Ei ole louhittu, eikä juurikaan muuten käytetty. | Link | 26 syyskuuta 08:43 | Erinomainen | 01:28:26 2023-10-02 |
GeForce GTX 1050 Ti GAMING X 4G | 50.00 | 1050 | Unknown | TI | Lisätiedot Muutaman vuoden käytössä ollut normaali kuntoinen näytönohjain. | Link | 26 syyskuuta 08:38 | Erinomainen | 01:28:27 2023-10-02 |
Amd Radeon RX 570 Series näytönohjain | 120.00 | 570 | 8 gb | None | Lisätiedot Kunto: Hyvä.Käyttöaika: 3 vuotta.Muisti: 8 GB. | Link | 25 syyskuuta 19:59 | Hyvä | 01:28:28 2023-10-02 |
- Web Scraping: Uses
requests
andBeautifulSoup
for scraping GPU listings. - Data Extraction: Captures title, price, description, link, posting date, and condition.
- Data Recording: Saves data in
gpu_all_info.csv
. - Link Tracking: Keeps a record of processed links in
processed_links.txt
. - Error Handling: Manages HTTP request failures and missing data.
- Timestamps: Includes timestamps for data fetch time.
- Pandas for Data Manipulation: Utilizes
pandas
for handling the CSV file. - GPU Categorization: Classifies GPUs based on predefined keywords.
- VRAM Identification: Extracts VRAM details from the listings.
- GPU Variant Extraction: Identifies specific GPU variants like 'XT', 'TI', or 'Super'.
- New Columns: Adds 'GPU_Category', 'VRAM', and 'GPU_Variant' columns to the dataset.
- Output File: Generates a cleaned and enhanced CSV file named
gpu_cleaned_data.csv
.
- Install Dependencies: Ensure
requests
,bs4
(Beautiful Soup), andcsv
are installed. - Run the Script: Execute in a Python environment to start data scraping.
- Output: Check
gpu_all_info.csv
andprocessed_links.txt
.
- Run the Script: After the first script, run the data cleaner script.
- Output: Generates
gpu_cleaned_data.csv
with added columns and cleaned data.
Both scripts are for educational purposes and demonstrate web scraping and data cleaning. Users must adhere to "Tori.fi" terms of service and consider legal and ethical implications of web scraping and data processing.
Feel free to explore the code and reach out for any questions or suggestions! """