Hacker Newsnew | comments | ask | jobs | submitlogin
Python API for Hacker News (github.com)
48 points by karangoeluw 200 days ago | comments


mapleoin 200 days ago | link

I tried building a REST API once for a challenge if anyone is interested: https://github.com/mapleoin/newhackers

-----

thejosh 200 days ago | link

Does it use https://www.hnsearch.com/api ?

-----

bnejad 200 days ago | link

Not OP, but from a quick glance at the source it doesn't appear to. It downloads the pages and uses beautiful soup (python html parsing library).

-----

jonesetc 200 days ago | link

Nope, it's scraping.

-----

Xeoncross 200 days ago | link

Is that an official API? How long has it been around?

-----

karangoeluw 200 days ago | link

Completely unofficial. I started creating it a month ago.

-----

dholowiski 200 days ago | link

Wow, that's great. I use another one and it's quite unreliable. Thanks!

-----

karangoeluw 199 days ago | link

You can use mine and compare the two, and based on your feedback either I or any other dev can improve it.

-----

karangoeluw 200 days ago | link

It scraped, slices.

-----

eeadc 200 days ago | link

That library is in many ways deprecated and broken: At first, it uses only old-style classes because it doesn't inherits object explicitly. Furthermore, it uses print in a method; it would be more "Pythonic" to return a str object, which was formatted using str.format.

I think the future is Python 3, and new implementations in Python 2 syntax are simply unneccessary. I would suggest the usage of Python-3-style syntax, which is also valid in Python 2.7 (which isn't hard).

-----

karangoeluw 199 days ago | link

> At first, it uses only old-style classes because it doesn't inherits object explicitly.

Please explain this further.

> usage of Python-3-style syntax, which is also valid in Python 2.7

Will do this

-----

mapleoin 199 days ago | link

See http://docs.python.org/2/reference/datamodel.html#new-style-... for the distinction

-----

karangoeluw 196 days ago | link

Alright. Fixed.

-----

gpsarakis 200 days ago | link

Nice effort. Just a few remarks:

- You should certainly use Requests http://docs.python-requests.org/en/latest/

- The Story class seems somewhat redundant. You could possibly use collections.namedtuple as a container for properties or simply a dictionary. The print_story method could just be the __str__ special method.

- JSON output would be useful.

-----

karangoeluw 199 days ago | link

I will try and implement these. Thanks for the suggestions.

-----

karangoeluw 196 days ago | link

Fixed.

-----

Sharma 200 days ago | link

I think screen scrapping is not allowed by HN. Few tries with these APIs might get your IP banned!

-----

scott_s 200 days ago | link

I don't think there's a prohibition to screen scraping, but if you make too many requests to the server in a certain amount of time, your IP will be banned to prevent the server from melting.

-----

sprizzle 200 days ago | link

The robots.txt file doesn't seem to disallow scraping. https://news.ycombinator.com/robots.txt

-----

karangoeluw 200 days ago | link

Scraping the listing pages seems allowed though.

-----

EugeneOZ 200 days ago | link

Agree. Also HN have RSS.

-----

addflip 200 days ago | link

I don't get why you're using a try except block for the num_comments variable. You shouldn't be casting to an int if it doesn't have the attribute.

-----

karangoeluw 200 days ago | link

The meta text on any page can be this:

> 21 points by johns 15 minutes ago | discuss

or

> 152 points by ar7hur 3 hours ago | 58 comments

If the rgex matches (case 2), then I cast it to an int. Otherwise (case 1, 0 comments).

-----

sprizzle 200 days ago | link

It's silly to use BeautifulSoup to parse the page when you could use a simple RegEx:

<td class=\"title\"><a href=\"(.?)\"(.?)>(.?)</a>(.?)</td>

-----

kaeawc 200 days ago | link

"HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain"

http://stackoverflow.com/questions/1732348/regex-match-open-...

-----

michaelmcmillan 200 days ago | link

I am willing to sacrifice my soul and everything that is holy.

-----

karangoeluw 200 days ago | link

Regex to parse HTMl is probably the single worst thing you can do.

-----

lloeki 200 days ago | link

Crafting a wide purpose regex to parse whatever HTML comes in is bad.

Building a regex to extract relevant data from simple, fixed-form page data, bypassing tags irrelevant to the problem at hand is not.

-----

untothebreach 200 days ago | link

...until the HTML changes.

I haven't look at their parsing code, so I have no idea if it is any better than using a regex, but if the regex assumes too much, simply reordering the attributes in a tag (or something similar) could break a regex-based solution.

-----

joshbaptiste 200 days ago | link

Some people, when confronted with a problem... bah you know the rest.

-----

Goranek 200 days ago | link

BeautifulSoup is great, as long as you're using open source HTML5 parser from Google. https://github.com/google/gumbo-parser

-----

sprizzle 200 days ago | link

Arg, there should be asterisks after every period.

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: