Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Implementation for TableParser #46

Merged
merged 9 commits into from
Nov 11, 2023
Merged

Conversation

nnilayy
Copy link
Contributor

@nnilayy nnilayy commented Nov 10, 2023

The pull request introduces a new TableParser module, which would help to parse the TableElement objects and retrieve the parsed table data accordingly. To utilize it, follow these steps:

  1. Obtain the desired TableElement object.
  2. Instantiate TableParser by passing the TableElement object to the constructor.
  3. Invoke the parse method to process the table.
  4. Retrieve the parsed data as either a:
  • DataFrame using the table_as_df method
  • JSON using the table_as_json method

Example:
TableParser

@nnilayy nnilayy requested a review from Elijas as a code owner November 10, 2023 05:34
@Elijas
Copy link
Contributor

Elijas commented Nov 11, 2023

Hello @nnilayy,

Thank you for the pull request!

I wanted to suggest a small enhancement for your development workflow. Have you considered integrating the Ruff linter into your IDE? It's a fantastic tool that can automatically resolve various issues upon saving a file. This could significantly streamline your coding process and reduce the time spent on manual corrections.

For the current pull request, there's no need for you to take any action; I'll take care of applying the necessary fixes. However, it might be a valuable addition to future work.

Warm regards, and happy coding! 🙌


P.S. Here's an example output of the Ruff plugin:
image

@Elijas Elijas merged commit 9d5c8c3 into alphanome-ai:main Nov 11, 2023
0 of 3 checks passed
@Elijas
Copy link
Contributor

Elijas commented Nov 11, 2023

@nnilayy,

Firstly, a big thank you for your dedication and hard work on the table parsing feature – it's a crucial part of our project at Alphanome AI!

I've taken the liberty of refining a few minor points (with Pull Request #47) highlighted by the Ruff linter and other checks. For your reference, you can run these checks on your own using ruff check --fix sec_parser/ && mypy && pytest tests/unit/. Alternatively, if you have Task installed, simply execute task pre-commit-checks to run all the checks as outlined in our Taskfile.yml.

Unit Tests and Quality Assurance

Ensuring the quality of our parser is pivotal. We're focusing on introducing robust Unit Tests to achieve this. You can understand more about our testing philosophy here.

Here's what I've added to guide our testing efforts:

  1. A sample table from Apple's latest 10-Q report as a test case:

TABLE_10Q_AAPL_0000320193_23_000077__001 = """
<div style="margin-top:6pt;text-align:justify">
<table style="border-collapse:collapse;display:inline-table;margin-bottom:5pt;vertical-align:text-bottom;width:100.000%">
<tr>
<td style="width:1.0%">

  1. The expected parsing output for this table:

@pytest.mark.parametrize(
("name", "html_input", "expected_output"),
tests := [
(
"simple table",
TABLE_10Q_AAPL_0000320193_23_000077__001,
pd.DataFrame(
data={
"1": [
"Due after 1 year through 5 years",
"Due after 5 years through 10 years",
"Due after 10 years",
"Total fair value",
],
"2": ["$", "11148", "16646", "$"],
"3": ["76267.0", "11148.0", "16646.0", "104061.0"],
},
),
),
],

  1. A visual comparison of the input and expected output:
    Input and Output Visual

Could you review these test cases and adjust them if they don't align with your expectations for the output? Also, it would be great if you could add more diverse or edge-case examples to our test suite (you could do it just by adding more items to the tests := [ list). This would immensely help in covering a wider range of scenarios and ensuring robustness in our parser.

Thank you for your continuous contributions and support. Your efforts are greatly valued in our community!

@nnilayy
Copy link
Contributor Author

nnilayy commented Nov 11, 2023

Hey Elijas,
Thank you so much for refining the code on my behalf and for your feedback regarding the use of Ruff Linter and other IDE tools. My apologies for a sloppy commit history. I'm gonna work on keeping up with the common practice and the latest IDE tools to make the workflow smoother and more professional.

As for the tests, I'll work on adding more edge case tests for the parser. As for the output, I do have something different in mind, as I think the parser could be refined even more to reduce data redundancy. As in this example, if we see the output data frame, for columns 2 & 3, the rows except $ signs are the same, so they could be merged, Ultimately leaving a 2-column data frame, the implementation for which is already in the parser. So yeah regarding the output and further refining the parser, I would love to discuss that with you in Discord.

Again thank you so much for the feedback and for merging this pr.

@Elijas
Copy link
Contributor

Elijas commented Nov 11, 2023

refining the code on my behalf

I forgot to add link to the refinement PR, in case you'd want to have a reference -- #47

sloppy commit history.

No worries at all, Github provides with a very convenient "Squash and merge" functionality, which overwrites all PR commit history with 1 commit with a custom message. So for contributor convenience, we're just doing 1 PR = 1 commit 👍

I'm gonna work on keeping up

Once again, no worries at all -- not having latest linters and stuff like that should never be a reason to re-consider creating a PR or delay submitting a contribution 🙌

Thanks again for the PR! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants