RAAFYA Web Scraper

RAAFYA Web Scraper is a versatile web scraping tool designed to extract structured data from web pages. It utilizes Selenium to fetch HTML content, converts it to markdown, and extracts relevant information using OpenAI's GPT models. The scraped data is then formatted and saved as JSON, CSV, or markdown files.

Features

Web Scraping with Selenium: Fetches and cleans HTML content from the web.
Markdown Conversion: Converts HTML content to markdown for easy readability.
Dynamic Data Extraction: Uses OpenAI models to extract specific fields from the scraped content.
Cost Calculation: Automatically calculates the cost of token usage for OpenAI models.
Data Export: Exports the extracted data in JSON, CSV, and markdown formats.

Installation

Clone the Repository:

git clone https://github.com/yazanrisheh/AI-Webscraper.git
cd AI-Webscraper

Install Dependencies:

Ensure you have Python 3.8 or later installed. Install the required Python packages using:
```
pip install -r requirements.txt
```
Download ChromeDriver:

Download the ChromeDriver compatible with your operating system from this link. Place the chromedriver.exe in the ./chromedriver-win64/ directory or update the path in the scraper.py script accordingly.
Set Up Environment Variables:

Create a .env file in the root directory and add your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```
Alternatively, you can copy the provided .env.example and fill in your API key.
```
cp .env.example .env
```

Usage

Command-Line Usage

Run the Web Scraper:

Modify the URL and fields you want to scrape in the scraper.py file, then execute:
```
python scraper.py
```
Scraped Data Output:

The scraped data will be saved in the output/ directory as markdown and JSON files.

Streamlit App Usage

Start the Streamlit App:

Launch the Streamlit app to interactively scrape websites:
```
streamlit run streamlit.py
```
Interact with the App:
- Enter the URL of the webpage you want to scrape.
- Specify the fields you want to extract.
- Click the "Scrape" button to initiate the scraping process.
Download Results:

After scraping, you can download the results in JSON, CSV, or markdown format directly from the app.

Project Structure

scraper.py: Contains the main logic for web scraping, data cleaning, markdown conversion, and data formatting.
streamlit.py: Provides a Streamlit interface for interactive scraping and data download.
requirements.txt: Lists the Python dependencies required for the project.
.env.example: Example of the environment variable file required for the OpenAI API key.
output/: Directory where the scraped data will be saved.

Contributing

Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.

License

This project is licensed under the MIT License.

Contact

For any inquiries or issues, feel free to contact me at yazanrisheh@hotmail.com or +971509108917

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py
streamlit.py		streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAAFYA Web Scraper

Features

Installation

Usage

Command-Line Usage

Streamlit App Usage

Project Structure

Contributing

License

Contact

About

Releases

Packages

Languages

yazanrisheh/AI-Webscraper

Folders and files

Latest commit

History

Repository files navigation

RAAFYA Web Scraper

Features

Installation

Usage

Command-Line Usage

Streamlit App Usage

Project Structure

Contributing

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages