Analysis of fake news language evolution in time
Report Bug
·
Request Feature
As part of Texas A&M — University of Cyprus Student Exchange Program, this is a research internship project for Summer 2019. As the technology evolved to stop the propagation of fake news, the propagandists and the people that deliberately share false content adapted. Analyzing evolution of fake news' language provides new cues in determining fake news.
Due to the fact that several fake news articles might have been removed from the web, this project utilizes Web Archive's snapshots, where webpages are available overtime.
The project has been devised into three steps:
- Data Crawling and Web Scraping
- Using the Scrapy framework, the project breaks down this process into two crawlers:
cdx.py spider
- This collects valid snapshots from Wayback CDX Server API. It deploys crawlers to the snapshot urls to extract urls from the page using Scrapy's link extractor. After collecting urls, it inserts the data into MongoDB.
url_article.py spider
- This must start off by making aggregations to the urls collection (default in
config.py
) in MongoDB. It then filters the aggregated urls and parse the article by using the Newspaper3k library. The spider inserts the articles' metadata into an article collection or filter collection.
- This must start off by making aggregations to the urls collection (default in
- One could avoid using the separated crawlers by using
article.py spider
, which is a combination of the two spiders. It is able to both collect and filter urls from Wayback CDX Server API snapshots then crawl the articles' url. It avoids the insertions in urls collection. (At this time, it has not been fully tested for functionality.)
- Using the Scrapy framework, the project breaks down this process into two crawlers:
- Text Analytics and Natural Language Processing (NLP)
- This project employs Check-It's1 feature engineering component. It divides linguistic features into:
- Part-of-Speech
- Readability and Vocabulary Richness
- Sentiment Analysis
- Surface and Syntax Punctuation
- Currently, the code for the feature engineering is not publicly available, but this repository will update when it has become available. For now, this article's sentiment analysis section provides an alternative to Check-It's sentiment score.
- This project employs Check-It's1 feature engineering component. It divides linguistic features into:
- Statistical Analysis on Time-series Data
- This process takes in the outputted CVS file based from the extracted features from the previous step. By using pandas library to convert the CVS file into a data frame, Matplotlib and seaborn library handles plotting the data into a time-series graphs. To make statistical analysis from the plots, one must apply their knowledge and intuition to approach a conclusion.
Essentially, the main purpose of this project is the data crawling and web scraping component for future work though.
1 Paschalides et al. (2019) Demetris Paschalides, Alexandros Kornilakis, Chrysovalantis Christodoulou, Rafael Andreou, George Pallis, Marios D. Dikaiakos, and Evangelos Markatos. 2019. Check-It: A Plugin for Detecting and Reducing the Spread of Fake News and Misinformation on the Web. arXiv:1412.6980 https://arxiv.org/abs/1905.04260v1
To get a local copy up and running follow these simple steps.
- Clone the repo
git clone https://github.com/anguyen120/fake-news-in-time.git
- Go inside the repo folder
cd /folder/to/fake-news-in-time
- Install pip packages
pip3 install -r requirements.txt
If you plan on storing into MongoDB, please be sure to have it running beforehand. If not, please make the necessary adjustments and handling in the code for your desired preference.
Before starting, it is recommended to check each component's respective *config.py
.
scrapy_config.py
is located in.../fake-news-in-time/scrapy_archive/archive/
scrapy_config.py
is located in.../fake-news-in-time/feature_engineering/
timeseries_config.py
is located in.../fake-news-in-time/timeseries/
A curated list of fake and factual news sites has been provided. The lists are influence by the blacklist in Check-It. It is slightly modified by using Newspaper3k's popular urls function in the factual news sites. Though you are more than welcome to use your own.
To deploy the spider, go to your terminal:
cd folder/to/fake-news-in-time/scrapy_archive/archive/
Depending on your preferences, this component could be launch withcdx.py spider
then url_article.py spider
or only article.py spider
.
For cdx.py spider
then url_article.py spider
:
scrapy crawl cdx
- Until the urls collection is at an appropriate size, then call the
url_article.py spider
:scrapy crawl url_article
For article.py spider:
scrapy crawl article
It is highly encouraged to run multiple spiders in the same process. If you are interested, Scrapy provides documentation to do so.
As previously mentioned, Check-It's feature engineering is not publicly available at this time, which this component utilizes. For now, feature.py
is provided for a skeleton for aggregating the article collection and stored the extracted features in a csv file.
Before running this, there should be a csv file contained extracted features of the articles in the same dictionary as timeseries.py
(.../fake-news-in-time/timeseries/
).
To run timeseries.py
from your terminal, go to the time-series component path:
cd folder/to/fake-news-in-time/timeseries/
Run the script:
python3 timeseries.py
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Alan Nguyen - anguyen120@protonmail.com
Project Link: https://github.com/anguyen120/fake-news-in-time
- Demetris "Jimmy" Paschalides
- LInC
- www.flaticon.com
- Web free icon made by Pixelmeetup from www.flaticon.com is licensed by CC 3.0 BY
- Network free icon made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY
- Statistics free icon made by Eucalyp from www.flaticon.com is licensed by CC 3.0 BY