I shall conquor the worl..... Ooooo, pretty blinking LED lights!
Joined: Dec 2006 Gender: Male Posts: 623 Location: In my mind
Re: Need help - 23 Gig File, Indexing it... « Reply #15 on Sept 15, 2009, 9:57pm »
Well, with a different approach in coding I was able to cut 3 days indexing down to 10 hours indexing. Still not good enough, I want around 3 hours total time. So I'm still toying with ideas here.
I shall conquor the worl..... Ooooo, pretty blinking LED lights!
Joined: Dec 2006 Gender: Male Posts: 623 Location: In my mind
Re: Need help - 23 Gig File, Indexing it... « Reply #17 on Sept 18, 2009, 1:34am »
Well we've made some headway on the Rev forums on this too. We've got the speed issue pretty much resolved now, but have run into one little issue now during the indexing routine. Some how at some point, our character offset becomes corrupt.
But I think someone posted a possible fix earlier today also. Too tired to check on it now though.
I think we're almost there!
I sure hope so. I've only found 3 offline Wikipedia viewers. Of the 3, only 2 are capable of viewing the entire database, the 1 only contains a set number of specific articles(under 2000 when there's over 3 million in the full database). The last two, while good and allow the entire db, suffer from formatting errors/issues in their viewers, one of which has tables issues, the other leaves extraneous data within the article which makes it near impossible to focus on the articles at all.
Joined: Nov 2007 Gender: Male Posts: 110 Location: Marshall
Re: Need help - 23 Gig File, Indexing it... « Reply #18 on Sept 18, 2009, 8:45am »
Hello Garrett,
I don't know if you can use something like this or not. I got it from the PB forum a few years ago. Its an INSTR replacement thats susposed to be faster than INSTR.
It was faster than PB's INSTR, but it may not be in the software you use.
I'll post it just in case you can use something like this and you are using INSTR.
Code:
'---------------------------------------------------------------------------- ' INSTR replacement, very fast when doing repeated search for long strings ' in long strings. Original code by Steve Hutchesson. ' Adjusted to work as direct replacement for INSTR by Borje Hagsten. ' No negative startpos for backwards search though, sorry. ' ' startpos can range from 1 to LEN(MainStr) - LEN(SearchStr). ' Returned position is 1-based, just like INSTR. Return is zero if not found. '------------------------------------------------------------------------------
FUNCTION bmINSTR(BYVAL startpos AS LONG, m AS STRING, s AS STRING) AS LONG #REGISTER NONE LOCAL eLen AS LONG, cval AS LONG LOCAL lpSource AS LONG, lnSource AS LONG, lpSearch AS LONG, lnSearch AS LONG LOCAL shift_table AS STRING * 1024
! mov esi, s ; store Search string ptr and len ! mov esi, [esi] ! cmp esi, 0 ; if edi is zero - no search string ! je Cleanup2 ; then get out ! mov lpSearch, esi ! mov esi, [esi-4] ! mov lnSearch, esi
! cmp lnSearch, 3 ; check search len ! jb ShortWordScan ; if shorter than 3, use INSTR instead
! mov esi, m ; store Main string ptr and len ! mov esi, [esi] ! cmp esi, 0 ; if esi is zero - no search string ! je Cleanup2 ; then get out ! mov lpSource, esi ! mov esi, [esi-4] ! mov lnSource, esi
! cmp startpos, 0 ; check startpos ! je OKsize2 ; if zero, ok ! dec startpos ; else assume > 0 - decrease, since bmINSTR is 0-based
OKsize2: ! mov esi, lpSource ! add esi, lnSource ! sub esi, lnSearch ! mov eLen, esi ; set Exit Length
' ---------------------------------------- ' load shift table with value in lnSearch ' ---------------------------------------- ! mov ecx, 256 ! mov eax, lnSearch ! lea edi, shift_table ! rep stosd
' ---------------------------------------------- ' load decending count values into shift table ' ---------------------------------------------- ! mov ecx, lnSearch ; SubString length in ECX ! dec ecx ; correct for zero based index ! mov esi, lpSearch ; address of SubString in ESI ! lea edi, shift_table ! xor eax, eax
Write_Shift_Chars2: ! mov al, [esi] ; get the character ! inc esi ; next one ! mov [edi+eax*4], ecx ; write shift for each character ! dec ecx ; to ascii location in table ! jnz Write_Shift_Chars2
' ----------------------------- ' set up for main compare loop ' ----------------------------- ! mov ecx, lnSearch ! dec ecx ! mov cval, ecx
! mov esi, lpSource ! mov edi, lpSearch ! add esi, startpos ; add starting position
Pre_Cmp2: ! cmp esi, eLen ; test exit length ! ja No_Match2 ! xor eax, eax ; reset EAX ! mov ecx, cval ; reset counter in compare loop
Cmp_Loop2: ! mov al, [esi+ecx] ! cmp al, [edi+ecx] ; cmp characters in ESI / EDI ! jne Get_Shift2 ; if not equal, get next shift ! dec ecx ! jns Cmp_Loop2
! jmp Match2
Get_Shift2: ! mov eax, shift_table[eax*4] ; get char shift value ! cmp eax, edx ; is eax pattern length ? ! jne Set_Suffix_Shift2 ; if not, jump to Calc_Suffix_Shift ! lea esi, [esi+ecx+1] ; add bad char shift ! jmp Pre_Cmp2 ' ***************************************************************
Match2: ! sub esi, lpSource ; sub source from ESI ! mov eax, esi ; put length in eax ! inc eax ; adjust for 1-based return ! jmp Cleanup2 ; exit
No_Match2: ! mov eax, 0 ; set value for no match
Cleanup2: ! mov FUNCTION, eax ; return 1-based position EXIT FUNCTION
ShortWordScan: ' if search len is < 3 FUNCTION = INSTR(startpos, m, s)
I shall conquor the worl..... Ooooo, pretty blinking LED lights!
Joined: Dec 2006 Gender: Male Posts: 623 Location: In my mind
Re: Need help - 23 Gig File, Indexing it... « Reply #20 on Sept 19, 2009, 2:07pm »
~!HOLD THE PRESSES!~
I've received a lot of help from a lot of people over on the Rev forums for this project, and with their help, we've not only achieved my goal, but surpassed it!!!
My goal was to be able to index the entire 23 gig file in 3 hours, as is the time for the other programs like this have done.
It took a mere 32 minutes total to completely index the entire 23 gig XML file. The end result is an index listing 9,013,937 entries of which only around 3 million are actual entries.. the rest are extraneous entries which serve as redirects to the articles or are image references or category references. In my final version I'll be implementing a filter which will weed out as much of the extraneous stuff as possible. The index file only weighs in at 312 mb.
I can't thank everyone over there enough for all the help that gave me.. I can't even thank you guys enough for trying to help me out on this as well.
I'm just so excited at the end result that frankly I'm still a bit in shock at the final total time!
I'm not done with this project yet of course.. Still many things to do, but that was the biggest hurdle right there. I had already done much of the other work needed, such as search, grab data from the file, parse the data, display the data in a formatted view etc... But still more work to be done.
I hope this can become one of the more viable offline wikipedia programs available today.
It'll be open source, so if anyone is interested in it at a later date, you can grab the source and do as you wish with it.
Joined: Dec 2006 Gender: Male Posts: 721 Location: Northwoods Wisconsin
Re: Need help - 23 Gig File, Indexing it... « Reply #21 on Sept 20, 2009, 5:12am »
Congrats, Garrett!
Now you can begin work on health care reform, world hunger, and global warming! All it takes is perseverance and creative thinking. Looks like you've got the right tools for the job.
I shall conquor the worl..... Ooooo, pretty blinking LED lights!
Joined: Dec 2006 Gender: Male Posts: 623 Location: In my mind
Re: Need help - 23 Gig File, Indexing it... « Reply #24 on Oct 28, 2009, 10:18pm »
Jerry.. that's a sweet little device! But I'd like it if they went with a color screen and tossed in the images that go with the articles. Maybe they'll do that in a year or two.
BTW, I had to shelf my project because classes started last month and I was not done with it. I last left off with formatting issues with the articles that errors in formatting markers in them.
If anyone is interested in the sources for this to either derive a project or continue this project, feel free to PM me with an email address to send a zip file (without wikipedia database.. download it from their site instead).