Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accept paper IDs with or without a dot #136

Open
wants to merge 61 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
9b6c805
accept paper IDs without a dot
Dabbrivia Jan 26, 2020
a4be9d8
accept pid with or without dot
Dabbrivia Jan 27, 2020
4206d3c
Multiprocessing etc. for 200k articles
Dabbrivia Feb 10, 2020
55fa5cc
Merge branch 'master' of https://github.com/Dabbrivia/arxiv-sanity-pr…
Dabbrivia Feb 10, 2020
f77b3f8
multiprocessing for 200k+ articles
Dabbrivia Feb 10, 2020
0bee725
merged with the prod server code
Dabbrivia Feb 11, 2020
a5cff8d
use export.arxiv.org for metadata fetching
Dabbrivia Feb 11, 2020
f2eae10
list of all current papers in metadata db for shell etc.
Dabbrivia Feb 11, 2020
468c0d8
use extended regex for arxiv ID schemas
Dabbrivia Feb 14, 2020
99f5032
Update main.html
Dabbrivia Feb 16, 2020
0c0ac4f
Merge branch 'master' of https://github.com/Dabbrivia/arxiv-sanity-pr…
Dabbrivia Feb 16, 2020
dca72b3
Update main.html
Dabbrivia Feb 16, 2020
5e382a4
Merge branch 'master' of https://github.com/Dabbrivia/arxiv-sanity-pr…
Dabbrivia Feb 16, 2020
f0a1538
adjusted the header to reflect the impressum
Dabbrivia Feb 16, 2020
b6eec66
dummy for pdfs that failed conversion to pdf as opposed to not yet pr…
Feb 16, 2020
dece94f
reverting to the original directory for thumbs but simlinking it (/da…
Feb 17, 2020
b62cac2
fixed merge error in compute_batch
Dabbrivia Feb 18, 2020
43cce28
getting only new and missing pdfs from export.arxiv.org. using tarbal…
Feb 18, 2020
17f1fb3
moved tfidf.p to data directory as it's overt 2Gb
Dabbrivia Feb 19, 2020
915f8fc
moved tfidf.p to data directory as it's overt 2Gb
Dabbrivia Feb 19, 2020
16847eb
fixed merge error in compute_batch
Dabbrivia Feb 19, 2020
747950c
correction in the file naming schema for tarbals
Feb 19, 2020
0c5c89c
Merge branch 'txt_as_pdf' of github.com:Dabbrivia/arxiv-sanity-preser…
Feb 19, 2020
8add49c
Merge branch 'master' into txt_as_pdf
Dabbrivia Feb 19, 2020
37d3031
Merge pull request #3 from Dabbrivia/txt_as_pdf
Dabbrivia Feb 19, 2020
5b7f631
all intermediate pickles to /data/pickles as they are large
Feb 19, 2020
5d034f9
astype(np.float32) makes the matrix smaller
Feb 19, 2020
b875c5b
Revert "astype(np.float32) makes the matrix smaller"
Feb 19, 2020
4edfd4e
use dir_basename_from_pid from utils.py
Feb 19, 2020
436856e
use num_partitions everywhere for Pool, otherwise not all cores used
Feb 19, 2020
e114244
correct messages for writing
Feb 19, 2020
a4c842e
added a todo
Feb 19, 2020
e96a63d
ready to run daily updates
Feb 19, 2020
5dd8ddf
Merge branch 'master' of github.com:Dabbrivia/arxiv-sanity-preserver
Feb 19, 2020
8366519
harvest cs as well
Feb 19, 2020
8f59118
sleep 1 sec every 4 requests as per arXiv guidelines
Feb 20, 2020
76c877b
sleep 20 sec every 500 requests just in case
Feb 20, 2020
19ebe8f
start worker instance after pdfs are downloaded as it takes a long time
Feb 20, 2020
43e8e0a
download urls in 4 threads throttled according to arxiv rules
Feb 20, 2020
423cac5
more intelligence about aws and debugging
Feb 20, 2020
e74d42c
commented crontab settings
Dabbrivia Feb 22, 2020
4ba3659
works in cron
Feb 24, 2020
0688631
run updates via cron
Dabbrivia Feb 24, 2020
9c0220f
run server on aws instance reboot
Dabbrivia Feb 26, 2020
d5b8643
add shebang
Dabbrivia Feb 26, 2020
1ea06e7
run twitter daemon in background
Dabbrivia Feb 26, 2020
45c8c87
run twitter_daemon.py, grep monitoring GETs out
Dabbrivia Feb 26, 2020
e7f0d3f
execute rights for owner
Feb 26, 2020
e7bb0fc
show verbose account creation help and fix download_pdfs
Feb 27, 2020
519d2c8
mitigate api change in werkzeug
Dabbrivia Jul 23, 2020
f623ab4
requirements for OAI_seed_db.py
Dabbrivia Jul 23, 2020
befb6ff
calculate correct number of partitions for single core
Dabbrivia Jul 23, 2020
6d9a337
Update daily_update.sh
Dabbrivia Jul 24, 2020
21b2b9f
Update download_pdfs.py
Dabbrivia Jul 24, 2020
b6279d1
content of twitter.txt and secret.txt
Dabbrivia Jul 25, 2020
5aa14f8
Merge branch 'master' of https://github.com/Dabbrivia/arxiv-sanity-pr…
Dabbrivia Jul 27, 2020
26b5a9c
process all db files on multicore
Dabbrivia Jul 28, 2020
8cc71f2
use Abbriva DEMO title for templates
Dabbrivia Aug 2, 2020
90eeec2
use tar instead of rsync, transfer as.db
Dabbrivia Aug 3, 2020
7c4cc2c
create all needed directories on /data/
Dabbrivia Aug 3, 2020
0679a1d
nginx proxy to serve on port 80
Dabbrivia Aug 25, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions OAI_seed_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
"""
Queries arxiv OAI and downloads paper XML data.

This script was adataped from another arXiv-metadata project on github. I
should cite them here, but I need to find the url again.
"""

import os
import time
import datetime
import dateutil
import pickle
import random
import argparse
import urllib.request
import re
import requests

from utils import Config, safe_pickle_dump
from lxml import etree, objectify
from parse_OAI_XML import parse_xml

if __name__ == "__main__":

# parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument('--set', type=str,
default='physics:cond-mat',
#default='physics:hep-th',
help='category used for arxiv OAI of form physics:arxivcat')
parser.add_argument('--from-date', type=str, default=datetime.date.isoformat(datetime.date.today()-datetime.timedelta(1)), help='Start date in YYYY-MM-DD')
parser.add_argument('--until-date', type=str, default=datetime.date.isoformat(datetime.date.today()), help='End date in YYYY-MM-DD, default is today')
args = parser.parse_args()

# misc hardcoded variables
resume_re = re.compile(r".*<resumptionToken.*?>(.*?)</resumptionToken>.*")
base_url = 'http://export.arxiv.org/oai2?' # base api query url
req = {u"verb": "ListRecords",
u"metadataPrefix": u"arXivRaw", u"set": args.set, u"from": args.from_date, u"until": args.until_date,}
print('Searching arXiv with query: '+str(req))

max_tries = 10

num_added_total = 0
failures = 0
count = 0
while True:
# Send the request.
r = requests.post(base_url, data=req)

# Handle the response.
code = r.status_code
print("Received Response Code:", code)

if code == 503:
# Asked to retry
to = int(r.headers["retry-after"])
print(u"Got 503. Retrying after {0:d} seconds.".format(to))

time.sleep(to)
failures += 1
if failures >= max_tries:
print(u"Failed too many times...")
break

elif code == 200:
failures = 0

# Write to file.
content = r.text
#print(content)
count += 1

#Save a backup of xml from arXiv in case screw up parsing (don't bother them too often)
file_name = u"raw"+datetime.date.isoformat(datetime.date.today())+"-{0:08d}.xml".format(count)
print(u"Writing to: {0}".format(file_name))
with open(file_name, u"w") as f:
f.write(content)

#Call a function from parse_xml.py to convert OAI-RAW to API format
parse_xml(file_name)
#num_added_total += num_added

# Look for a resumption token.
token = resume_re.search(content)
if token is None:
break
token = token.groups()[0]

# If there isn't one, we're all done.
if token == "":
print(u"All done.")
break

print(u"Resumption token: {0}.".format(token))

# If there is a resumption token, rebuild the request.
req = {u"verb": u"ListRecords",
u"resumptionToken": token}

# Pause so as not to get banned.
to = 20
print(u"Sleeping for {0:d} seconds so as not to get banned."
.format(to))
time.sleep(to)

else:
# Wha happen'?
r.raise_for_status()

11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,13 @@ The processing pipeline requires you to run a series of scripts, and at this sta

Optionally you can also run the `twitter_daemon.py` in a screen session, which uses your Twitter API credentials (stored in `twitter.txt`) to query Twitter periodically looking for mentions of papers in the database, and writes the results to the pickle file `twitter.p`.

Structure of the `twitter.txt`:
<pre>consumer_key
consumer_secret
access_token_key
access_token_secret
</pre>

I have a simple shell script that runs these commands one by one, and every day I run this script to fetch new papers, incorporate them into the database, and recompute all tfidf vectors/classifiers. More details on this process below.

**protip: numpy/BLAS**: The script `analyze.py` does quite a lot of heavy lifting with numpy. I recommend that you carefully set up your numpy to use BLAS (e.g. OpenBLAS), otherwise the computations will take a long time. With ~25,000 papers and ~5000 users the script runs in several hours on my current machine with a BLAS-linked numpy.
Expand All @@ -52,7 +59,7 @@ I have a simple shell script that runs these commands one by one, and every day

If you'd like to run the flask server online (e.g. AWS) run it as `python serve.py --prod`.

You also want to create a `secret_key.txt` file and fill it with random text (see top of `serve.py`).
You also want to create a `secret_key.txt` file and fill it with random text (see top of `serve.py`). `cat /dev/urandom | base64 | head -c 1000 > secret_key.txt`

### Current workflow

Expand All @@ -67,6 +74,8 @@ python analyze.py
python buildsvm.py
python make_cache.py
```
### Crontab entry
```21 04 * * * . /home/ubuntu/.profile; echo "START $(date)">>/data/daily_update.log; /home/ubuntu/arxiv-sanity-preserver/daily_update.sh 2>&1 1>>/data/daily_update.log; "FINISH $(date)">>/data/daily_update.log; ```

I run the server in a screen session, so `screen -S serve` to create it (or `-r` to reattach to it) and run:

Expand Down
96 changes: 75 additions & 21 deletions analyze.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# -*- coding: utf-8 -*-
# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
"""
Reads txt files of all papers and computes tfidf vectors for all papers.
Dumps results to file tfidf.p
Expand All @@ -9,50 +11,69 @@
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

from utils import Config, safe_pickle_dump
from utils import Config, safe_pickle_dump, dir_basename_from_pid
from joblib import Parallel, delayed

import multiprocessing
import pandas as pd
import numpy as np
from multiprocessing import Pool
import scipy.sparse as sp

import regex

seed(1337)
max_train = 5000 # max number of tfidf training documents (chosen randomly), for memory efficiency
max_train = 25000 # max number of tfidf training documents (chosen randomly), for memory efficiency
max_features = 5000

# read database
db = pickle.load(open(Config.db_path, 'rb'))

# read all text files for all papers into memory

def read_txt_path(p):
with open(p, 'r') as f:
try: # some problems with unicode may arize
txt = f.read()
except:
txt = ""
return txt

txt_paths, pids = [], []
n = 0
for pid,j in db.items():
n += 1
idvv = '%sv%d' % (j['_rawid'], j['_version'])
txt_path = os.path.join('data', 'txt', idvv) + '.pdf.txt'

txt_path = os.path.join(Config.txt_dir, dir_basename_from_pid(pid,j)+".txt")

if os.path.isfile(txt_path): # some pdfs dont translate to txt
with open(txt_path, 'r') as f:
txt = f.read()
txt = read_txt_path(txt_path)

if len(txt) > 1000 and len(txt) < 500000: # 500K is VERY conservative upper bound
txt_paths.append(txt_path) # todo later: maybe filter or something some of them
pids.append(idvv)
print("read %d/%d (%s) with %d chars" % (n, len(db), idvv, len(txt)))
#print("read %d/%d (%s) with %d chars" % (n, len(db), idvv, len(txt)))
else:
print("skipped %d/%d (%s) with %d chars: suspicious!" % (n, len(db), idvv, len(txt)))
pass
else:
print("could not find %s in txt folder." % (txt_path, ))
print("in total read in %d text files out of %d db entries." % (len(txt_paths), len(db)))

# compute tfidf vectors with scikits
v = TfidfVectorizer(input='content',
encoding='utf-8', decode_error='replace', strip_accents='unicode',
lowercase=True, analyzer='word', stop_words='english',
v = TfidfVectorizer(input='content',
encoding='utf-8', decode_error='replace', strip_accents='unicode',
lowercase=True, analyzer='word', stop_words='english',
token_pattern=r'(?u)\b[a-zA-Z_][a-zA-Z0-9_]+\b',
ngram_range=(1, 2), max_features = max_features,
ngram_range=(1, 2), max_features = max_features,
norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True,
max_df=1.0, min_df=1)

# create an iterator object to conserve memory
def make_corpus(paths):
for p in paths:
with open(p, 'r') as f:
txt = f.read()
yield txt
yield read_txt_path(p)

# train
train_txt_paths = list(txt_paths) # duplicate
Expand All @@ -62,12 +83,39 @@ def make_corpus(paths):
train_corpus = make_corpus(train_txt_paths)
v.fit(train_corpus)

# export texts for topic modelling
corpus = make_corpus(txt_paths) # don't forget to rewind
pattern = regex.compile('((?=[^!?.,\ ])\W|\d)+', regex.UNICODE)
clean_txt=(pattern.sub(' ',str(text)[:1000]) for text in corpus)
texts_df=pd.DataFrame(clean_txt, columns=['Text',])
texts_df.to_excel('diego_texts.xlsx',index=True)
del corpus

# https://github.com/rafaelvalero/ParallelTextProcessing/blob/master/parallelizing_text_processing.ipynb
num_cores = multiprocessing.cpu_count()
num_partitions = num_cores-1 if num_cores > 1 else 1 # I like to leave some cores for other processes
print('num_partitions',num_partitions)

#TODO we actually don't need a dataframe, transform corpus to np.array directly
def parallelize_dataframe(df, func):
a = np.array_split(df, num_partitions)
del df
pool = Pool(num_partitions)
sparse_mtrx = sp.vstack(pool.map(func, a), format='csr')
pool.close()
pool.join()
return sparse_mtrx

def transform_func(data):
tfidf_matrix = v.transform(data["text"])
return tfidf_matrix

# transform
print("transforming %d documents..." % (len(txt_paths), ))
corpus = make_corpus(txt_paths)
X = v.transform(corpus)
print(v.vocabulary_)
print(X.shape)
data_pd = pd.DataFrame(corpus)
data_pd.rename(columns = {0:'text'},inplace = True)
X = parallelize_dataframe(data_pd, transform_func)

# write full matrix out
out = {}
Expand All @@ -83,12 +131,10 @@ def make_corpus(paths):
out['ptoi'] = { x:i for i,x in enumerate(pids) } # pid to ix in X mapping
print("writing", Config.meta_path)
safe_pickle_dump(out, Config.meta_path)
del out
del data_pd

print("precomputing nearest neighbor queries in batches...")
X = X.todense() # originally it's a sparse matrix
sim_dict = {}
batch_size = 200
for i in range(0,len(pids),batch_size):
def compute_batch(i):
i1 = min(len(pids), i+batch_size)
xquery = X[i:i1] # BxD
ds = -np.asarray(np.dot(X, xquery.T)) #NxD * DxB => NxB
Expand All @@ -97,5 +143,13 @@ def make_corpus(paths):
sim_dict[pids[i+j]] = [pids[q] for q in list(IX[:50,j])]
print('%d/%d...' % (i, len(pids)))


print("precomputing nearest neighbor queries in batches...")
X = X.todense().astype(np.float32) # originally it's a sparse matrix
sim_dict = {}
batch_size = 200
Parallel( n_jobs=-1, prefer="threads", verbose=5)(
delayed(compute_batch)(i) for i in range(0,len(pids),batch_size))

print("writing", Config.sim_path)
safe_pickle_dump(sim_dict, Config.sim_path)
21 changes: 21 additions & 0 deletions dabbrivia_list_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""
Reads txt files of all papers and prints out their filenames.
"""
import os
import pickle

from utils import Config, safe_pickle_dump

# read database
db = pickle.load(open(Config.db_path, 'rb'))

# read all text files for all papers into memory
txt_paths, pids = [], []
for pid,j in db.items() :
if j['_rawid'][:4].isdigit() and '.' in j['_rawid']:
print(j['_rawid'][:4]+'/'+j['_rawid']+'.pdf')
elif '/' in j['_rawid']:
print(j['_rawid'].split("/")[1][:4]+'/'+"".join(j['_rawid'].split("/"))+'.pdf')
else:
print(j['_rawid'][:4]+'/'+j['arxiv_primary_category']['term'].split(".")[0]+j['_rawid']+'.pdf')

Loading