Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#DIGREPO-935: Process Merritt Atom Feed to ingest ETDs #184

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5c18a57
Display merritt ark on ETD view page
supsrinivas Aug 8, 2019
719db6a
Add Feedjira to parse merritt atom feed.
supsrinivas Aug 21, 2019
2bcb172
Namespacing models to fit the bill on Rails 5
supsrinivas Sep 11, 2019
a96713d
Add merritt_id to solr as a stored_searchable index
supsrinivas Sep 12, 2019
49def01
Ad merritt related config
supsrinivas Sep 16, 2019
36fe751
Slim down merritt related tables
supsrinivas Sep 16, 2019
593937a
Add merritt importer
supsrinivas Sep 18, 2019
e2c6534
Supplemental files get their own subdirectory
supsrinivas Sep 18, 2019
08dae69
Updates fixing errors from 1st run of import:merritt_etds task
supsrinivas Sep 23, 2019
4471969
Spec-ing out etd importer .escape_encodings & .get_content
supsrinivas Sep 23, 2019
bf492da
Moving files around. naming is hard.
supsrinivas Sep 23, 2019
d44d5f4
Another naming change.
supsrinivas Sep 24, 2019
7cdc275
Add Merritt::IngestEtd to handle ingesting.
supsrinivas Oct 3, 2019
a022778
DO NOT EXPOSE CREDS to Merritt
supsrinivas Oct 3, 2019
af18cb1
rubocop fixes
supsrinivas Oct 3, 2019
7e0323e
Proquest::XML Specs much?
supsrinivas Oct 3, 2019
bdd1818
Rubocop fixes
supsrinivas Oct 3, 2019
f8662b8
Goodbye Httparty!
supsrinivas Oct 7, 2019
0a17700
Add .env to handle merritt creds.
supsrinivas Oct 10, 2019
33e2119
Merritt::Feed table is not set up to capture last parsed info for ato…
supsrinivas Oct 14, 2019
626876c
Adding error handling to import:merritt_etds task. Updates from code …
supsrinivas Oct 16, 2019
1597761
Trying to fix travis build error: 'Expected feature release number in…
supsrinivas Oct 16, 2019
1e9c724
I never learn on the double quoted string check. I have used single q…
supsrinivas Oct 16, 2019
d7c4cbd
Fixing ObjectFactoryWriter#put WebMock::NetConnectNotAllowedError spe…
supsrinivas Oct 17, 2019
742e62f
Rubocop corrections
supsrinivas Oct 17, 2019
e1e9454
Found and fixed a bug in Academic Department field.
supsrinivas Oct 17, 2019
7e786ae
Academic Department ended up with a sneaky space between University a…
supsrinivas Oct 17, 2019
69fb05b
Require files needed to run import:merritt_etds task
supsrinivas Oct 18, 2019
df0e92d
Add genre attribute to ETD.
supsrinivas Oct 23, 2019
9e2b851
Add Genre i.e. form_of_work to merritt ETDs
supsrinivas Oct 30, 2019
4a8d682
Cut back on calls to DB.
supsrinivas Nov 4, 2019
4270218
DIGREPO-950 Place a comma after surname for degree supervisor
supsrinivas Nov 4, 2019
cc68993
DIGREPO-951 Add space between University and Department in degree gra…
supsrinivas Nov 4, 2019
0d21722
DIGREPO-954 Strip whitespace from dissertation keywords
supsrinivas Nov 4, 2019
5ed2bed
DIGREPO-956 Add ETD metadata property Physical Description
supsrinivas Nov 5, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
MERRITT_USER=tjohnson
MERRITT_PASSWORD=fakepwd
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
sudo: false
language: ruby
dist: trusty

cache:
bundler: true
Expand Down
3 changes: 3 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ end
gem "puma"
gem "rails", "~> 5.1.5"

gem "feedjira"
gem "font-awesome-sass"
gem "jquery-rails", "~> 4.0"
gem "sass-rails", "~> 5.0"
Expand All @@ -24,6 +25,7 @@ gem "blacklight_range_limit"
gem "curation_concerns", "~> 1.7.7"
gem "ezid-client", "~> 1.2"
gem "hydra-role-management"
gem "iso639"
gem "linked_vocabs",
git: "https://github.com/projecthydra-labs/linked_vocabs.git"
gem "marc"
Expand Down Expand Up @@ -59,6 +61,7 @@ gem "american_date", "~> 1.1.0"
group :development, :test do
gem "awesome_print"
gem "byebug"
gem "dotenv"
gem "factory_bot_rails"

# Used exact gem versions for solr_wrapper and fcrepo_wrapper
Expand Down
13 changes: 11 additions & 2 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,7 @@ GEM
activesupport
diff-lcs (1.3)
dot-properties (0.1.3)
dotenv (2.7.5)
dropbox_api (0.1.10)
faraday (~> 0.9, ~> 0.8)
oauth2 (~> 1.1)
Expand Down Expand Up @@ -298,6 +299,9 @@ GEM
faraday (>= 0.7.4, < 1.0)
fcrepo_wrapper (0.7.0)
ruby-progressbar
feedjira (3.0.0)
loofah (>= 2.2.1)
sax-machine (>= 1.0)
ffi (1.9.23)
flot-rails (0.0.7)
jquery-rails
Expand Down Expand Up @@ -391,6 +395,7 @@ GEM
activesupport (<= 6)
inflecto (0.0.2)
iso-639 (0.2.8)
iso639 (1.3.2)
jbuilder (2.7.0)
activesupport (>= 4.2.0)
multi_json (>= 1.2)
Expand Down Expand Up @@ -721,7 +726,7 @@ GEM
multipart-post
oauth2
ruby-progressbar (1.9.0)
rubyzip (1.2.1)
rubyzip (1.2.3)
safe_yaml (1.0.4)
sass (3.5.6)
sass-listen (~> 4.0.0)
Expand All @@ -734,6 +739,7 @@ GEM
sprockets (>= 2.8, < 4.0)
sprockets-rails (>= 2.0, < 4.0)
tilt (>= 1.1, < 3)
sax-machine (1.3.2)
scrub_rb (1.0.1)
settingslogic (2.0.9)
signet (0.8.1)
Expand Down Expand Up @@ -864,13 +870,16 @@ DEPENDENCIES
ci_reporter
curation_concerns (~> 1.7.7)
database_cleaner
dotenv
ezid-client (~> 1.2)
factory_bot_rails
fcrepo_wrapper (= 0.7.0)
feedjira
font-awesome-sass
highline
http_logger
hydra-role-management
iso639
jbuilder (~> 2.0)
jquery-rails (~> 4.0)
kaminari_route_prefix
Expand Down Expand Up @@ -915,4 +924,4 @@ DEPENDENCIES
webmock

BUNDLED WITH
1.16.2
1.17.1
3 changes: 3 additions & 0 deletions app/controllers/catalog_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -312,6 +312,9 @@ def enforce_show_permissions(_opts = {})
config.add_show_field solr_name("identifier", :displayable),
label: "ARK"

config.add_show_field solr_name("merritt_id", :displayable),
label: "Merritt ARK"

config.add_show_field solr_name("accession_number", :symbol),
label: "Local Identifier"

Expand Down
7 changes: 4 additions & 3 deletions app/indexers/etd_indexer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,11 @@ def dissertation
object.dissertation_year.first
end

# Derive department by stripping "UC, SB" from the degree grantor field
def department(solr_doc)
Array(solr_doc["degree_grantor_tesim"]).map do |a|
a.sub(/^University of California, Santa Barbara\. /, "")
# "degree_grantor_label_tesim"
# ["University of California, Santa Barbara.Computer Science"]
Array(solr_doc["degree_grantor_label_tesim"]).map do |a|
a.split(".").last.strip
end
end

Expand Down
3 changes: 0 additions & 3 deletions app/models/concerns/metadata.rb
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,6 @@ module Metadata
index.as :displayable
end

# For Merritt ARKs
property :merritt_id, predicate: RDF::Vocab::DC11.identifier

property :accession_number,
predicate: RDF::URI(
"http://opaquenamespace.org/ns/cco/accessionNumber"
Expand Down
5 changes: 5 additions & 0 deletions app/models/etd.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ class ETD < ActiveFedora::Base
index.as :displayable
end

# For Merritt ARKs
property :merritt_id, predicate: ::RDF::Vocab::DC11.identifier do |index|
index.as :displayable, :stored_searchable
end

include NestedAttributes

has_subresource :proquest
Expand Down
49 changes: 49 additions & 0 deletions app/models/merritt/feed.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# frozen_string_literal: true

class Merritt::Feed < ActiveRecord::Base
self.table_name_prefix = "merritt_"

HOME = "https://merritt.cdlib.org"

def self.etd_feed_url(page = 1)
HOME + "/object/recent.atom?collection=ark:/13030/m5pv6m0x&page=#{page}"
end

def self.parse(page = 1)
url = etd_feed_url(page)
uri = URI.parse(url)
request = Net::HTTP::Get.new(uri.request_uri)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
response = http.request request

if response.code.to_i != 200
raise StandardError,
"Error in Merritt::Feed.parse:"\
"#{url} returned #{response.code} #{response.message}"
end

xml = response.body
Feedjira.parse(xml)
end

# Feed pages are available as
# links in the following order
# current, first, last & next
def self.current_page
parse.links.first.split("&").last.split("=").last.to_i
end

def self.first_page
parse.links[1].split("&").last.split("=").last.to_i
end

def self.last_page
parse.links[2].split("&").last.split("=").last.to_i
end

def self.last_modified
parse.last_modified
end
end
2 changes: 2 additions & 0 deletions config/application.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ defaults: &defaults
oai_record_prefix: "oai:ucsb"
oai_identifier_prefix: "https://alexandria.ucsb.edu/lib/"
uploads_dir: <%= ENV['ADRL_UPLOADS'] || Rails.root.join("tmp", "uploads") %>
merritt_etd_dir: <%= ENV['ADRL_MERRITT_ETD_DIR'] || Rails.root.join("tmp", "merrittetds") %>

development:
<<: *defaults
Expand All @@ -44,3 +45,4 @@ production:
log_dir: /var/log/samvera
minter_state: /opt/alexandria/shared/minter-state
uploads_dir: /opt/alexandria/uploads
merritt_etd_dir: /opt/alexandria/merrittetds
14 changes: 14 additions & 0 deletions config/merritt.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
defaults: &defaults
merritt_user: <%= ENV.fetch("MERRITT_USER") %>
merritt_pwd: <%= ENV.fetch("MERRITT_PASSWORD") %>
merritt_etd_dir: <%= ENV['ADRL_MERRITT_ETD_DIR'] || Rails.root.join("tmp", "merrittetds") %>

development:
<<: *defaults

test:
<<: *defaults

production:
<<: *defaults
merritt_etd_dir: /opt/alexandria/merrittetds
9 changes: 9 additions & 0 deletions db/migrate/20191010183125_create_merritt_feeds.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
class CreateMerrittFeeds < ActiveRecord::Migration[5.1]
def change
create_table :merritt_feeds do |t|
t.integer :last_parsed_page, index: true, null: false
t.datetime :last_modified, index: true
t.timestamps
end
end
end
11 changes: 10 additions & 1 deletion db/schema.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
#
# It's strongly recommended that you check this file into your version control system.

ActiveRecord::Schema.define(version: 20170308175556) do
ActiveRecord::Schema.define(version: 20191010183125) do

create_table "bookmarks", force: :cascade do |t|
t.integer "user_id", null: false
Expand Down Expand Up @@ -56,6 +56,15 @@
t.index ["user_id"], name: "index_curation_concerns_operations_on_user_id"
end

create_table "merritt_feeds", force: :cascade do |t|
t.integer "last_parsed_page", null: false
t.datetime "last_modified"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["last_modified"], name: "index_merritt_feeds_on_last_modified"
t.index ["last_parsed_page"], name: "index_merritt_feeds_on_last_parsed_page"
end

create_table "searches", force: :cascade do |t|
t.text "query_params"
t.integer "user_id"
Expand Down
3 changes: 1 addition & 2 deletions doc/ingesting.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,6 @@ What happens when you run `bin/ingest -f etd /path/to/etds/etdadmin_upload*`, th
}
```


2. Next, `bin/ingest` passes the path to the XML file of each ETD to
{Importer::ETDParser.parse_file}, which parses the XML and queries
the SRU API, returning a single string of MARC containing the
Expand All @@ -276,7 +275,7 @@ What happens when you run `bin/ingest -f etd /path/to/etds/etdadmin_upload*`, th
5. {ObjectFactoryWriter#put} tidies up the information extracted from
the MARC and passes it (as `attributes`) and the ETD hash to
{ObjectFactoryWriter#build_object}, which in turn creates the
{Importer::Factory} for ETDs.
{Importer::Factory} for ETDs i.e. #{ETDFactory}.

6. {Importer::Factory::ETDFactory} is what saves the ETD into Fedora.
It inherits from {Importer::Factory::ObjectFactory} and most of its
Expand Down
9 changes: 9 additions & 0 deletions lib/identifier.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,13 @@ def self.ark_to_noid(ark)
def self.ark_to_id(ark)
ark_to_noid(ark)
end

def self.merritt_ark_to_noid(ark)
return unless (matches = %r{^ark:/\d{5}/(m\w{7,9})$}.match(ark))
matches[1]
end

def self.merritt_ark_to_id(ark)
merritt_ark_to_noid(ark)
end
end
101 changes: 101 additions & 0 deletions lib/merritt/import_etd.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# frozen_string_literal: true

require "net/https"
require "uri"

module Merritt::ImportEtd
# metadata, dissertation
# & supplemental files
def self.import(etd)
# tmp/merrittetds
merritt_etd_dir = Rails.application.config_for(:merritt)["merritt_etd_dir"]
Dir.mkdir merritt_etd_dir unless File.directory? merritt_etd_dir
# tmp/merrittetds/download_m5bp580k
download_path = File.join(merritt_etd_dir, "download_#{ark(etd)}")
Dir.mkdir download_path unless File.directory? download_path

[metadata_url(etd), dissertation_url(etd)].each do |url|
# tmp/merrittetds/download_m5bp580k/Soriano_ucsb_0035N_14420_DATA.xml
# tmp/merrittetds/download_m5bp580k/Soriano_ucsb_0035N_14420.pdf
file_path = File.join(download_path, escape_encodings(url))
create_proquest_file(file_path, url)
end

if supp_urls(etd).present?
# tmp/merrittetds/download_m5bp580k/supplements
supp_path = File.join(download_path, "supplements")
Dir.mkdir(supp_path) unless File.directory?(supp_path)

supp_urls(etd).each do |supp_url|
# tmp/merrittetds/download_m5bp580k/supplements/av.jpeg
supp_file_path = File.join(supp_path,
escape_encodings(supp_url))
create_proquest_file(supp_file_path, supp_url)
end
end
download_path
end

def self.ark(etd)
etd.entry_id.split("/").last
end

# includes proquest pdf, xml
# & optional supplemental files
def self.proquest_files(etd)
etd.links.select { |l| l.match("ucsb_0035") }
end

def self.metadata_file(etd)
proquest_files(etd).find { |f| f.match("xml") }
end

def self.dissertation_file(etd)
proquest_files(etd).find { |f| f.match("pdf") }
end

def self.supp_files(etd)
proquest_files(etd) - [metadata_file(etd), dissertation_file(etd)]
end

def self.metadata_url(etd)
Merritt::Feed::HOME + metadata_file(etd)
end

def self.dissertation_url(etd)
Merritt::Feed::HOME + dissertation_file(etd)
end

def self.supp_urls(etd)
supp_files(etd).map { |f| Merritt::Feed::HOME + f }
end

def self.get_content(url)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Get.new(uri.request_uri)
request.basic_auth(Rails.application.config_for(:merritt)["merritt_user"],
Rails.application.config_for(:merritt)["merritt_pwd"])
response = http.request(request)
raise Net::HTTPError.new(response.body, response) if response.code != 200.to_s
response.body
end

def self.escape_encodings(url)
# double unescape urls to convert encodings
# like %252F => %2F => / in urls as shown below
# https://merritt.cdlib.org/d/ark:/13030/m5bp580k/9
# /producer/Soriano_ucsb_0035N_67/av.jpeg
str = CGI.unescape(CGI.unescape(url))
str.split("producer/").last
str.split("/").last
end

def self.create_proquest_file(file_path, url)
File.open(file_path, "wb") do |f|
f.write get_content(url)
end
end
end
Loading