Added Implementation for TableParser #46

nnilayy · 2023-11-10T05:34:06Z

The pull request introduces a new TableParser module, which would help to parse the TableElement objects and retrieve the parsed table data accordingly. To utilize it, follow these steps:

Obtain the desired TableElement object.
Instantiate TableParser by passing the TableElement object to the constructor.
Invoke the parse method to process the table.
Retrieve the parsed data as either a:

DataFrame using the table_as_df method
JSON using the table_as_json method

Example:

Elijas · 2023-11-11T02:07:47Z

Hello @nnilayy,

Thank you for the pull request!

I wanted to suggest a small enhancement for your development workflow. Have you considered integrating the Ruff linter into your IDE? It's a fantastic tool that can automatically resolve various issues upon saving a file. This could significantly streamline your coding process and reduce the time spent on manual corrections.

For the current pull request, there's no need for you to take any action; I'll take care of applying the necessary fixes. However, it might be a valuable addition to future work.

Warm regards, and happy coding! 🙌

P.S. Here's an example output of the Ruff plugin:

Elijas · 2023-11-11T05:14:58Z

@nnilayy,

Firstly, a big thank you for your dedication and hard work on the table parsing feature – it's a crucial part of our project at Alphanome AI!

I've taken the liberty of refining a few minor points (with Pull Request #47) highlighted by the Ruff linter and other checks. For your reference, you can run these checks on your own using ruff check --fix sec_parser/ && mypy && pytest tests/unit/. Alternatively, if you have Task installed, simply execute task pre-commit-checks to run all the checks as outlined in our Taskfile.yml.

Unit Tests and Quality Assurance

Ensuring the quality of our parser is pivotal. We're focusing on introducing robust Unit Tests to achieve this. You can understand more about our testing philosophy here.

Here's what I've added to guide our testing efforts:

A sample table from Apple's latest 10-Q report as a test case:

sec-parser/tests/unit/semantic_elements/table_element/_data_for_table_parser.py

Lines 1 to 5 in 46b5ff8

    
           TABLE_10Q_AAPL_0000320193_23_000077__001 = """ 
        
           <div style="margin-top:6pt;text-align:justify"> 
        
            <table style="border-collapse:collapse;display:inline-table;margin-bottom:5pt;vertical-align:text-bottom;width:100.000%"> 
        
             <tr> 
        
              <td style="width:1.0%">

The expected parsing output for this table:

sec-parser/tests/unit/semantic_elements/table_element/test_table_parser.py

Lines 11 to 30 in 46b5ff8

    
           @pytest.mark.parametrize( 
        
               ("name", "html_input", "expected_output"), 
        
               tests := [ 
        
                   ( 
        
                       "simple table", 
        
                       TABLE_10Q_AAPL_0000320193_23_000077__001, 
        
                       pd.DataFrame( 
        
                           data={ 
        
                               "1": [ 
        
                                   "Due after 1 year through 5 years", 
        
                                   "Due after 5 years through 10 years", 
        
                                   "Due after 10 years", 
        
                                   "Total fair value", 
        
                               ], 
        
                               "2": ["$", "11148", "16646", "$"], 
        
                               "3": ["76267.0", "11148.0", "16646.0", "104061.0"], 
        
                           }, 
        
                       ), 
        
                   ), 
        
               ],

A visual comparison of the input and expected output:

Could you review these test cases and adjust them if they don't align with your expectations for the output? Also, it would be great if you could add more diverse or edge-case examples to our test suite (you could do it just by adding more items to the tests := [ list). This would immensely help in covering a wider range of scenarios and ensuring robustness in our parser.

Thank you for your continuous contributions and support. Your efforts are greatly valued in our community!

nnilayy · 2023-11-11T08:22:19Z

Hey Elijas,
Thank you so much for refining the code on my behalf and for your feedback regarding the use of Ruff Linter and other IDE tools. My apologies for a sloppy commit history. I'm gonna work on keeping up with the common practice and the latest IDE tools to make the workflow smoother and more professional.

As for the tests, I'll work on adding more edge case tests for the parser. As for the output, I do have something different in mind, as I think the parser could be refined even more to reduce data redundancy. As in this example, if we see the output data frame, for columns 2 & 3, the rows except $ signs are the same, so they could be merged, Ultimately leaving a 2-column data frame, the implementation for which is already in the parser. So yeah regarding the output and further refining the parser, I would love to discuss that with you in Discord.

Again thank you so much for the feedback and for merging this pr.

Elijas · 2023-11-11T08:33:14Z

refining the code on my behalf

I forgot to add link to the refinement PR, in case you'd want to have a reference -- #47

sloppy commit history.

No worries at all, Github provides with a very convenient "Squash and merge" functionality, which overwrites all PR commit history with 1 commit with a custom message. So for contributor convenience, we're just doing 1 PR = 1 commit 👍

I'm gonna work on keeping up

Once again, no worries at all -- not having latest linters and stuff like that should never be a reason to re-consider creating a PR or delay submitting a contribution 🙌

Thanks again for the PR! 🎉

nnilayy added 6 commits November 10, 2023 09:00

Added Implementation for TableParser

bc674aa

Added Implementation for TableParser

83476a2

Added Implementation for TableParser

2bdd2d4

Added Implementation for TableParser

37f4acb

Added Implementation for TableParser

0a7cd3a

Added Implementation for TableParser

13bf8e5

nnilayy requested a review from Elijas as a code owner November 10, 2023 05:34

nnilayy added 3 commits November 10, 2023 11:13

Update table_parser.py

4071670

Changed Line Formatting of Code

887c494

Replaced range-len with enumerate

c42a86c

Elijas merged commit 9d5c8c3 into alphanome-ai:main Nov 11, 2023
0 of 3 checks passed

Elijas mentioned this pull request Nov 11, 2023

feat(table_parser): clean up, add unit tests #47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Implementation for TableParser #46

Added Implementation for TableParser #46

nnilayy commented Nov 10, 2023 •

edited

Loading

Elijas commented Nov 11, 2023 •

edited

Loading

Elijas commented Nov 11, 2023 •

edited

Loading

nnilayy commented Nov 11, 2023

Elijas commented Nov 11, 2023 •

edited

Loading

Added Implementation for TableParser #46

Added Implementation for TableParser #46

Conversation

nnilayy commented Nov 10, 2023 • edited Loading

Elijas commented Nov 11, 2023 • edited Loading

Elijas commented Nov 11, 2023 • edited Loading

Unit Tests and Quality Assurance

nnilayy commented Nov 11, 2023

Elijas commented Nov 11, 2023 • edited Loading

nnilayy commented Nov 10, 2023 •

edited

Loading

Elijas commented Nov 11, 2023 •

edited

Loading

Elijas commented Nov 11, 2023 •

edited

Loading

Elijas commented Nov 11, 2023 •

edited

Loading