What a nice project. What inspired this initially?
FYI there's a broken link in your readme:
https://rumca-js.github.io/internet full internet search
hobs 2 hours ago [-]
Cant you just request the ICANN’s zone files and have the canonical list of the day?
didip 2 hours ago [-]
This is amazing. Thanks for sharing!
bufferoverflow 1 hours ago [-]
[dead]
lxe 19 minutes ago [-]
This is a cool hobby project, but why is this notable? Why a FastCompany article? I'm trying to figure out anything that sets this apart from thousands of other little hobby search projects.
I understand companies like Perplexity or Brave or DuckDuckGo "rivialing Google", but building a hobby index and crawler is nice, and worthy of a "Show HN: "... but an actual media article?
gowld 43 seconds ago [-]
It's only notable as a clickbait narrative for ignorant readers -- FastCompany's target market
luizfelberti 4 hours ago [-]
I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.
3RTB297 13 minutes ago [-]
You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?
I'll add it to the mile-long list of things that should exist and be online public goods.
moduspol 3 hours ago [-]
Is the common crawl usable for something like this?
Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)
ge96 3 hours ago [-]
The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.
kccqzy 3 hours ago [-]
Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.
The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?
lucb1e 2 hours ago [-]
It claims I reached the article limit. The last time I saw a fastcompany link must have been a decade ago! I was nostalgically looking forward to read another article of theirs. Alas...
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
udkl 1 hours ago [-]
I absolutely devoured Wilson Lins articles recently .. they are very high quality and informative for any amateur interested in search engines and LLMs! - https://blog.wilsonl.in/search-engine/
cheema33 5 hours ago [-]
I tried the search site at https://searcha.page/ by searching for something random and got the following message:
"An error has occurred building the search results."
eschulz 7 minutes ago [-]
Before this happened to me, my first search returned an impressive SERP.
authnopuz 5 hours ago [-]
hug of death? I fear the temperature will get very high in his laundry room
DannyBee 5 hours ago [-]
I'm sure it depends on how much laundry he is doing - his dryer is probably heated entirely by servers.
He can then exhaust the remaining server heat through the dryer vent stack.
debo_ 4 hours ago [-]
Keep going. I love dry humor.
4 hours ago [-]
ArekDymalski 3 hours ago [-]
Untill the exhaust starts "Feeling leaky" I guess.
'Google rival' is quite a stretch, surely 'search engine' is not just more accurate, but clearer too with all that Google does today, as if that's new.
ofrzeta 3 hours ago [-]
"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"
why do I never get deals like that when I am shopping for the homelab on eBay?
progval 3 hours ago [-]
You need to spend a lot of time looking through badly labeled offers, and be willing to buy from sellers with no reputation.
Gormo 31 minutes ago [-]
TheServerStore.com often has good deals. I actually bought a brand new 64-core EPYC 7702 server with 256 GB RAM and 8TB NVMe storage for about $3K fully assembled earlier this year.
robrtsql 3 hours ago [-]
I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?
throwawayffffas 1 hours ago [-]
I got a 7551p plus motherboard and ram for about 600 bucks from China this January. I may have overpaid but it works great, and gets the job done.
_fat_santa 3 hours ago [-]
Not for a CPU but earlier this year I bought a Thinkpad workstation off eBay for $500. It's a machine from 2020 and when it was new cost $5,700.
I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.
saalweachter 3 hours ago [-]
Has eBay fixed their "and then they ship you a box of rocks" problem?
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
apetresc 2 hours ago [-]
My understanding is that eBay sides with the buyer on all disputes, to the point of ridiculousness. So you should be fine.
The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.
buildbot 2 hours ago [-]
Yep selling is way more risky. Ebay might be the most safe (refund wise) marketplace for buyers… I have more trouble with amazon.
buildbot 2 hours ago [-]
Yes, it’s extremely rare to be stuck with a broken/wrong/missing item as a buyer on eBay. Selling is quite risky in some ways because eBay will nearly always side with a buyer. Every missing or broken thing I have purchased has been refunded or replaced. On the other hand, 3 things I have sold were claimed to not arrive. The only case where eBay decided in my favor was when the buyer had signed for the package in a literal USPS office :)
throwawayffffas 1 hours ago [-]
You don't get that with used old stuff, you get it with unrealistic low prices for new stuff.
A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.
accrual 2 hours ago [-]
> Has eBay fixed their "and then they ship you a box of rocks" problem?
I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.
ThatMedicIsASpy 3 hours ago [-]
Epyc7000+MB+256GB-512GB RAM (from china) usually starts at 800 euros + import tax
3 hours ago [-]
risico 11 minutes ago [-]
One of my dream projects as well, sadly it feels a lot harder to crawl the internet these days, as others have said around here as well.
What are some good practices these days to ensure a good crawl/scrape? Invest in proxies, preferably residential?
phendrenad2 2 hours ago [-]
This is a cool project, and I hope he has fun with it.
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
_joel 30 minutes ago [-]
The photo of the power socket right next to the sink looks safe
The great thing about this is that with the decentralization/recentralization of the Web, it may become easier for certain people to roll their own search engines for their respective communities and crawl/index pages only according to their shared tastes.
The bad thing about this is...read above.
iam_saurabh 3 hours ago [-]
I love stories like this—tech history is full of scrappy beginnings. Even if this project doesn’t succeed, it reminds us that giant companies aren’t unshakable.
When I started using it (~ 2 years) , it was necessary. Google was simply not solving any of my actual issues (software related).
Now, It seems that google might have improved a bit. I check from time to time and the gap isn't as huge, as when Kagi started
shayway 4 hours ago [-]
How does your experience with Searcha compare? It seems to be down at the moment.
the_third_wave 4 hours ago [-]
Do Kagi users get paid for shilling the company? Nearly all threads relating to the subject of search has a few mentionings of the glory of Kagi, often including links to the site. I suspect this is not as effective as the Kagi crew thinks since there is likely to be a large overlap between their potential customers and those who are really turned off by such shilling.
dawnerd 3 hours ago [-]
Flip side how much does Google pay you to defend their monopoly? Kagi is a solid product with a team that clearly cares about what they’re building. They’re transparent and post change logs when things update. I simply trust them infinitely more than Google.
hamdingers 4 hours ago [-]
Have you considered it's a good product that causes its users to become advocates?
> The effect is most likely to occur when there are no obvious reasons for performing the task. Because expending effort to perform a useless or unenjoyable task, or experiencing unpleasant consequences in doing so, is cognitively inconsistent (see cognitive dissonance), people are assumed to shift their evaluations of the task in a positive direction to restore consistency.
TIL about effort justification! I think signing up for Kagi is not particularly effort-intensive however.
datadrivenangel 3 hours ago [-]
Kagi customer here. Not getting paid to shill. I think it's worth occasionally mentioning alternatives that are good enough to pay for so that other people know there are other people using other options.
But full disclosure, sometimes I'm using DuckDuckGo and it's also good enough most of the time that I occasionally forget until I go down some rabbit hole and realize that I'm using the wrong search engine.
testdelacc1 3 hours ago [-]
Disclaimer: Not a Kagi user. Unlikely to use it.
I just don’t understand people who get so upset that someone might like something enough to talk about liking it. So upset that they won’t ever try the thing. Like … ok I guess? You do you. It’s just a strange way to make decisions.
At least this is just a consumer product. Worse is when people here say they make technical decisions using the same process. They’d black list certain tech because they’ve heard people talking about how it solved their problems. Also ok, but now I know I should avoid them professionally.
mdaniel 3 hours ago [-]
I get the impression it's the volume of the folks who sing its praises. There was a web3 crowd for a while, Bitwarden champions would show up to any mention of a password manager, and (ahem) some AI champions can be over the top
In all of these cases, a reasonable counterpoint is that if it were that applicable for all audiences, one wouldn't need to sing its praises, it would sing its own praises
ufmace 3 hours ago [-]
It sings its own praises... how exactly? Maybe by a bunch of happy users talking about how they like it and it's a better solution to the problem that the thread or article is about without being explicitly paid? Which is exactly what's happening here and some people are complaining about it?
testdelacc1 3 hours ago [-]
How does a password manager sing its own praises?
koakuma-chan 3 hours ago [-]
I tried it, it's slow and bad and free tier is only 100 requests, and it's too expensive, and price is unjustified. I use gemini with google search grounding.
alexjplant 3 hours ago [-]
I understand skepticism in the age of LLM-generated content and CAPTCHA-solving bots. What I don't understand is why people choose such weird hills to die on and think that posting about it will accomplish anything. Do you think people will read your comment and go "gee, I was going to use Kagi but now I won't because this random person has a bad feeling about a series of comments they remember seeing"?
I signed up for a specialist forum not too long ago and posted an honest review of a product because I hadn't been able to find one anywhere on the internet. Immediately a bunch of people accused me of being a "shill" for a direct-to-consumer business that's been powered by a Yahoo storefront for the last 20 years, as though a business that's run by a guy with an AOL e-mail address is sophisticated enough to figure out Fiverr and astroturf their reputation on a phpBB forum.
Think about it for just a moment - do you really think that the Hacker News audience is large enough or full of enough tastemakers to sway an alternative search engine's market share? It isn't. If Kagi wanted to do that they'd hire TikTok influencers.
throwaway290 2 hours ago [-]
no one else would pay for search. people on HN is probably 90% of their total possible market.
lelandbatey 4 hours ago [-]
Nope, it's just a nice thing I like. It is nearly the platonic ideal of a search engine for me. It causes me no problems and doesn't try to sell me garbage.
It's like discovering that there a better pair of shoes that're more comfortable. Everybody can use a slightly improved more comfortable pair of shoes, so it comes up frequently.
tmdetect 4 hours ago [-]
Kagi is a polished product. This is drying someones laundry.
Google was invented many years ago by two guys in a dorm room and since then there's been so many white papers and advancements in the public sphere and the actual underlying problem has not changed that much, that it seems like it could be done by a small group or independent person.
dec0dedab0de 4 hours ago [-]
Crawling is much more difficult than it used to be. Significantly more content is behind a login, Javascript is required for way more than it should be, and almost the entire web is behind cloudflare or another type of captcha.
marginalia_nu 1 hours ago [-]
These things are actually fairly small problems.
The parts that absolutely require JS can't be reliably linked to and nobody indexes that stuff. Most apparent SPA:s serve a HTML alternative if you don't claim to be a web browser in the UA.
Cloudflare and the like are also fairly easy to deal with as long as your crawler is well behaved. You can register the fingerprint and mostly get access to cf:ed websites.
non_aligned 4 hours ago [-]
I think there are two factors that helped Google. First, the search engine landscape back then was absolutely abysmal. I'm sure someone will chime in saying that it's abysmal today as well, but the reality is that 99%+ of consumer searches get good results today. And that's simply because the nature of search has changed: we have billions of people using the internet, and they overwhelmingly just search for products to buy, local restaurants that offer takeout, or for familiar pop content to watch or listen to. And there's some SEO spam there, but also pretty fierce quality assurance by search engines.
Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.
So no, I don't think you can repeat the success of Google the same way. It was a product of its time.
snek_case 2 hours ago [-]
Google maps is probably a big moat that's very hard to replicate. You can't as easily just crawl all of that data. It's not easy to generate directions. The average user doesn't want to use your search engine for one thing and Google for everything else, they just want a one stop shop for search.
That's what I was expecting this submission to be about, although to be honest I'm not certain that Marginalia would want the influx of a fastcompany sized tire kicking
marginalia_nu 1 hours ago [-]
To be fair I'm on a colocated server now. No more apartment hosting for me.
jrm4 4 hours ago [-]
More to the point, it's a shame that we can't collectively grok (dammit, they took that from us too) concepts like "personal" and/or "curated" directories, e.g. individual and group wikis and so forth on perhaps more directed topics with lists of good links.
cosmicgadget 2 hours ago [-]
Other than the obvious (but surmountable) technical challenges with crawling and indexing, trying to establish "goodness" for a given user is tough. For a blogger it will be "hey, you are reading this so you probably like what I like". That's often true but as soon as you try to have a centralized service with arbitrary users, it is hard to do anything better than filtering purely commercial content.
sdf4j 3 hours ago [-]
what you mean we can't? there are a lot of curated content directories out there.
jrm4 3 hours ago [-]
Right, I suppose I mean "getting more people to think about why a few of these bookmarked for your favorite topics, especially tied to a trustworthy person, is a million times better than just hitting up Google."
Or, perhaps, a "a better Google should just take you to these."
Something like that.
ambicapter 4 hours ago [-]
Google basically invented the modern cloud in order to efficiently use the hardware necessary to actually build those search engine indices. It's not really a question of implementing a good algorithm and away we go.
CalRobert 4 hours ago [-]
Among other things, I think crawling is a lot harder now.
lif 4 hours ago [-]
Provided they have the kind of massive government support Google has had from the get-go, sure!
OutOfHere 4 hours ago [-]
The actual underlying problem has changed altogether. Pagerank is easily gamed by SEO.
Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.
Crawling too requires innovative approaches to bypass server filters.
I doubt any independent person can afford to run a vector database or LLMs at immense scale.
kcbanner 4 hours ago [-]
> users want the results intelligently synthesized into a text response with references rather than as raw results.
The reason I pay for Kagi is that I specifically don't want this to occur.
OutOfHere 4 hours ago [-]
If you pay for a service (web search) that 99.9% use for free, you're an extreme outlier, and not necessarily a justifiable one either. After all, DDG, Google and various others still have raw results for free.
Workaccount2 4 hours ago [-]
How much do you technologically relate to the average person on the street though?
Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.
yepitwas 4 hours ago [-]
That's worrisome since I've seen those be for-sure wrong a pretty high percentage of the time.
[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.
degamad 3 hours ago [-]
Google "Web" results (not the default results you get when you search) still seem okay for me. You can force them with the udm=14 url trick, or select the "Web" tab in the results. No AI, no images or shopping results, and slightly better text results.
franktankbank 3 hours ago [-]
Yep, same here. Ask it "should I wash venison tenderloin" and you get an initial "No, because" followed by a generally "yes its important to clean including with water" in the longer description. Wow a self contradictory answer! Good job!
jkestner 4 hours ago [-]
We’re being force fed them. I’m an AI hater and I catch myself reading those sometimes.
Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.
throwmeaway222 4 hours ago [-]
At this point the web is also so centralized you only need 3 bookmarks these days (your news, youtube and Amazon)
A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.
freeopinion 1 hours ago [-]
> users want the results intelligently synthesized into a text response with references rather than as raw results
This leads directly to another big change.
People used to submit their sites to search engines and now they might actively block search engines. So a search engine author might have to spend a lot of effort in adversarial games.
ricardo81 4 hours ago [-]
>Pagerank
Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.
iamacyborg 4 hours ago [-]
> Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.
Citation needed
OutOfHere 4 hours ago [-]
You mean all the users of chat services aren't evidence? Chat services increasingly incorporate web links for references in their responses, and this is as the users seek. The tide continues to shift from traditional search to LLM synthesis.
iamacyborg 4 hours ago [-]
I suspect there are more users of traditional search than there are of llm chat apps.
freeopinion 1 hours ago [-]
I suspect that chat apps dominate (80+%?) the under-20 demographic, and have a sizable chunk of the under-30 demographic. Within the next five years it will probably represent 50+% of total search traffic. Maybe it already does. It makes sense that any search site that wants to be in the game tomorrow would keep racing down the AI chat path.
vlucas 4 hours ago [-]
> “I think it’s definitely lowered the barrier,” Lin says of the LLM’s role in enabling DIY search engines. “To me, it seems like the only barrier to actually competing with Google, creating an alternate search engine, is not so much the technology, it’s mostly the market forces.”
Oh sweet summer child
HardCodedBias 3 hours ago [-]
I know that Google engineers have a cushy life but I actually find it unlikely that a guy, who isn't attempting some radical new type of search (like pagerank back in the day) can hope to compete with the orgs in Google who support search.
Again, those orgs are likely too comfortable and less productive than people would like, but we're talking about many-many thousands and depending upon how you define "the work" of search upwards of 10k.
I didn't see any new secret sauce in the article and Google is has said that since 2015 (?) Google Brain has been involved in search.
This is not to say that Google couldn't be dislodged by search via LLM or similar, that is "new" research.
freeopinion 55 minutes ago [-]
If you wrote that 100 people could outwork one person, I'd nod my head. If you wrote that 10k people could outwork 1k people, I'd shrug. If you tell me that 100 people can combine to tie my shoe faster than I can, I'd question that.
Building a state-of-the-art search engine is not shoelaces. But upwards of 10k workers is not impressive in the right direction.
One person starting out with anything at all can quickly grow into one person with one or two really innovative ideas. One or two good ideas can catch fire pretty quickly. Don't be too dismissive.
p3rls 2 hours ago [-]
i've been thinking that google could use its own AI to evaluate URLs instead of relying on pagerank and backlinks which are almost completely valueless as a signal in 2025. in my niche there's more slop than ever being produced daily and it's all hitting rank 1. it's tragic what google is doing to the internet.
Oarch 4 hours ago [-]
I'm sure there's a money laundering joke in here somewhere
mooiedingen 3 hours ago [-]
Nothing new as it has been done before, the concept is simple enough:
step 1: indexer, solr/lucene
Step 2: crawler of which there are several foss, build one yourself?
or you just run yacy which is a combo of the above, hook combine with an oldschool searx instance and you will be granted the title as seeker by the spirit of Fravia+ who was elder of the searchlores!!! Not only will you filter crap made by machine learning models, but thou shall find what thou seek! I refuse to call a 16 line long for loop triggering in memory loaded tokenized data where data can be anything from a scientific paper hallucinated by a chatbot to a message between two lovers anything intelligent for it is not intelligence but a blob of tokenized fcking data in memory getting triggered for an output by a derp with a 16 line long for loop!!!
I have 1542766 domains. Might not be much, but it is an honest work.
It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.
Links
https://github.com/rumca-js/Internet-Places-Database
FYI there's a broken link in your readme:
I understand companies like Perplexity or Brave or DuckDuckGo "rivialing Google", but building a hobby index and crawler is nice, and worthy of a "Show HN: "... but an actual media article?
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.
I'll add it to the mile-long list of things that should exist and be online public goods.
https://commoncrawl.org
https://www.proxyrack.com/residential-proxies/
https://archive.is/HA7y4
Some bits and pieces:
> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
"An error has occurred building the search results."
He can then exhaust the remaining server heat through the dryer vent stack.
why do I never get deals like that when I am shopping for the homelab on eBay?
I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.
A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.
I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.
What are some good practices these days to ensure a good crawl/scrape? Invest in proxies, preferably residential?
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
- SearchaPage - Web Search Engine https://searcha.page/
- Seek Ninja - Stealthy Search Engine https://seek.ninja/
Both of them are erroring out right now?
The bad thing about this is...read above.
When I started using it (~ 2 years) , it was necessary. Google was simply not solving any of my actual issues (software related).
Now, It seems that google might have improved a bit. I check from time to time and the gap isn't as huge, as when Kagi started
[1] https://en.wikipedia.org/wiki/Effort_justification
I’m not following you.
https://dictionary.apa.org/effort-justification
But full disclosure, sometimes I'm using DuckDuckGo and it's also good enough most of the time that I occasionally forget until I go down some rabbit hole and realize that I'm using the wrong search engine.
I just don’t understand people who get so upset that someone might like something enough to talk about liking it. So upset that they won’t ever try the thing. Like … ok I guess? You do you. It’s just a strange way to make decisions.
At least this is just a consumer product. Worse is when people here say they make technical decisions using the same process. They’d black list certain tech because they’ve heard people talking about how it solved their problems. Also ok, but now I know I should avoid them professionally.
In all of these cases, a reasonable counterpoint is that if it were that applicable for all audiences, one wouldn't need to sing its praises, it would sing its own praises
I signed up for a specialist forum not too long ago and posted an honest review of a product because I hadn't been able to find one anywhere on the internet. Immediately a bunch of people accused me of being a "shill" for a direct-to-consumer business that's been powered by a Yahoo storefront for the last 20 years, as though a business that's run by a guy with an AOL e-mail address is sophisticated enough to figure out Fiverr and astroturf their reputation on a phpBB forum.
Think about it for just a moment - do you really think that the Hacker News audience is large enough or full of enough tastemakers to sway an alternative search engine's market share? It isn't. If Kagi wanted to do that they'd hire TikTok influencers.
It's like discovering that there a better pair of shoes that're more comfortable. Everybody can use a slightly improved more comfortable pair of shoes, so it comes up frequently.
Google was invented many years ago by two guys in a dorm room and since then there's been so many white papers and advancements in the public sphere and the actual underlying problem has not changed that much, that it seems like it could be done by a small group or independent person.
The parts that absolutely require JS can't be reliably linked to and nobody indexes that stuff. Most apparent SPA:s serve a HTML alternative if you don't claim to be a web browser in the UA.
Cloudflare and the like are also fairly easy to deal with as long as your crawler is well behaved. You can register the fingerprint and mostly get access to cf:ed websites.
Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.
So no, I don't think you can repeat the success of Google the same way. It was a product of its time.
Or, perhaps, a "a better Google should just take you to these."
Something like that.
Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.
Crawling too requires innovative approaches to bypass server filters.
I doubt any independent person can afford to run a vector database or LLMs at immense scale.
The reason I pay for Kagi is that I specifically don't want this to occur.
Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.
[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.
Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.
A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.
This leads directly to another big change.
People used to submit their sites to search engines and now they might actively block search engines. So a search engine author might have to spend a lot of effort in adversarial games.
Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.
Citation needed
Oh sweet summer child
Again, those orgs are likely too comfortable and less productive than people would like, but we're talking about many-many thousands and depending upon how you define "the work" of search upwards of 10k.
I didn't see any new secret sauce in the article and Google is has said that since 2015 (?) Google Brain has been involved in search.
This is not to say that Google couldn't be dislodged by search via LLM or similar, that is "new" research.
Building a state-of-the-art search engine is not shoelaces. But upwards of 10k workers is not impressive in the right direction.
One person starting out with anything at all can quickly grow into one person with one or two really innovative ideas. One or two good ideas can catch fire pretty quickly. Don't be too dismissive.