-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload to the Internet Archive #63
Comments
I've gotten a To test it out, I uploaded the big VA report from earlier this year.
Which made this:
Interestingly, 10 minutes after upload, the Internet Archive auto-produced a I've sent an email to the Archive asking for guidance or documentation on how we can best structure the collection. In the meantime, I may just upload everything now, once, and worry about creating a sophisticated script for managing cost-effective sync and re-uploading of metadata later. |
Also, the Internet Archive has absolutely insane public logging for all this.
|
Some more pages relevant to our collection:
The Internet Archive is extremely cool. |
I'm glad that storing this stuff on the Archive is going so well. It's really the perfect home for this stuff. Well, I mean, a .gov site is the perfect home for this. But, short of that, the Archive is the best home for this. |
The plan now, after talking with IA, is to store each report as an "item" in the collection, rather than putting them in one bucket. An "item" (bucket) is supposed to have the same metadata for everything. Currently, the https://archive.org/details/unitedstates-data The Archive is willing to make a Collection for the items, but needs at least 50 "items" uploaded. So each item would be its own bucket, with an ID something like FWIW, not having resolved @divergentdave's work on finding duplicate IDs across years wouldn't come into play here, if I put the year in the ID. The year is a more brittle piece of data than I'd prefer to put in the ID of the item, though. |
It seems like Internet Archive item identifiers can't easily be changed once uploaded. (or deleted, naturally) If we use our report_id in the item identifier, we'll need to make sure we've fixed all our outstanding QA issues before we start uploading. (particularly same-year duplicate IDs and bad 404 pages, maybe duplicate files) It might also be a good idea to manually review new reports going forward before sending them to IA, in case one of the scrapers starts emitting spurious reports. |
I agree that we should get our ID QA in order before submitting everything...and you've basically done that, which is outstanding. I do need to regenerate my archive. But, I think the downside of uploading duplicate or wrongly ID'd content to the Archive is low. It'll happen, we'll make a good faith effort to keep it in order (and automating the running of the |
For anyone watching this thread, I'm doing some work that will build into a general-purpose Internet Archive uploader, at https://github.com/konklone/bit-voyage. |
So Harvard's Perma.cc automatically uploads to the Internet Archive, using ia-wrapper. def upload_to_internet_archive(self, link_guid):
# setup
asset = Asset.objects.get(link_id=link_guid)
link = asset.link
identifier = settings.INTERNET_ARCHIVE_IDENTIFIER_PREFIX+link_guid
warc_path = os.path.join(asset.base_storage_path, asset.warc_capture)
# create IA item for this capture
item = internetarchive.get_item(identifier)
metadata = {
'collection':settings.INTERNET_ARCHIVE_COLLECTION,
'mediatype':'web',
'date':link.creation_timestamp,
'title':'Perma Capture %s' % link_guid,
'creator':'Perma.cc',
# custom metadata
'submitted_url':link.submitted_url,
'perma_url':"http://%s/%s" % (settings.HOST, link_guid)
}
# upload
with default_storage.open(warc_path, 'rb') as warc_file:
success = item.upload(warc_file,
metadata=metadata,
access_key=settings.INTERNET_ARCHIVE_ACCESS_KEY,
secret_key=settings.INTERNET_ARCHIVE_SECRET_KEY,
verbose=True,
debug=True)
if success:
print "Succeeded."
else:
print "Failed."
self.retry(exc=Exception("Internet Archive reported upload failure.")) |
A zip viewer for the contents of the bulk data zip I just uploaded: https://ia902205.us.archive.org/zipview.php?zip=/25/items/us-inspectors-general.bulk/us-inspectors-general.bulk.zip Intended landing page for the bulk data file: https://archive.org/details/us-inspectors-general.bulk There's no automatic download link for an entire collection, so I'll plan to upload every item in the collection individually, and then upload a bulk file separately. I have an individual report uploaded and successfully rendering in the Archive's book viewer here: https://archive.org/details/us-inspectors-general.treasury-2014-OIG-14-023 |
Using their S3-compatible API:
http://archive.org/help/abouts3.txt
I have an Archive account, under
eric@konklone.com
, and I generated my S3(-like) credentials. I'm not actually sure whether the code to do this upload belongs in this repository -- it could just as easily be a script in a public repo on my own account that runs as a cron on the same box -- but I'm including it here to solicit discussion, and to publicize that I want to get this stuff into the Archive.I'll also be contacting the Archive directly to see if they have any above-and-beyond interest in this collection.
/cc @waldoj @spulec
Resources:
Todos:
unitedstates-data
_meta.xml
to the bucket (done automatically, actually)s3help
in the subject)--archive
.The text was updated successfully, but these errors were encountered: