Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tool and scripts to interactively explore webgraphs #16

Merged
merged 1 commit into from
Jul 16, 2024

Conversation

sebastian-nagel
Copy link
Contributor

@sebastian-nagel sebastian-nagel commented Apr 15, 2024

  1. build the cc-webgraph project, see README.md
  2. download into the current directory
    1. the webgraph files (*.graph, *.properties)
    2. the webgraph files of the transpose of the graph
    3. the vertex file
  3. build the vertex name map by running, here for the graph cc-main-2023-24-sep-nov-feb-domain
    `.../cc-webgraph/src/script/webgraph_ranking/graph_explore_build_vertex_map.sh cc-main-2023-24-sep-nov-feb-domain cc-main-2023-24-sep-nov-feb-domain-vertices.txt.gz
  4. launch the JShell to explore the graph
    jshell --class-path .../cc-webgraph/target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar
    |  Welcome to JShell -- Version 17.0.10
    |  For an introduction type: /help intro
    
    jshell> import org.commoncrawl.webgraph.explore.GraphExplorer
    
    jshell> GraphExplorer g = new GraphExplorer("cc-main-2023-24-sep-nov-feb-domain")
    g ==> org.commoncrawl.webgraph.explore.GraphExplorer@34b7bfc0
    
    jshell> g.indegree("org.commoncrawl")
    $3 ==> 1693
    
    jshell> g.outdegree("org.commoncrawl")
    $4 ==> 253
    
    jshell> g.ls("org.commoncrawl")
    0: #152649      ai.spawning
    1: #1964504     au.com.spatialsource
    2: #2163216     au.net.cedar
    3: #2893468     be.youtu
    4: #3311953     blog.liyanxu
    5: #5719144     cern.home
    6: #6874710     cl.uchile
    

See also the graph exploration tutorial

@sebastian-nagel sebastian-nagel force-pushed the explore-graphs branch 3 times, most recently from 153db36 to 8fffff2 Compare June 29, 2024 12:48
@sebastian-nagel sebastian-nagel marked this pull request as ready for review June 29, 2024 12:49
- the class GraphExplorer allows to explore webgraphs using the JShell
- the class Graph holds all webgraph-related data as memory-mapped data:
  the graph, its transpose and the map to translate between vertex labels
  and IDs. It provides methods to access successors and predecessors, etc.
- the script graph_explore_download_webgraph.sh downloads all files
  required for exploring a graph
- the script graph_explore_build_vertex_map.sh builds a map of vertex
  labels to vertex ID and verifies that all graph files required for
  graph exploration are downloaded.
- utility methods
  - get a common subset (intersection) or the union
    of the successors or predecessors of a list of vertices
  - class CountingMergedIntIterator to count occurrences of integers
    given a list of int iterators as input
  - print list of vertices
  - load and save vertex lists from/to files
  - count top-level domains in lists of vertices
- JShell script to load a graph
- tutorial / quick start graph exploration
@sebastian-nagel sebastian-nagel merged commit 15d1daf into main Jul 16, 2024
3 checks passed
@sebastian-nagel sebastian-nagel deleted the explore-graphs branch July 16, 2024 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant